Hi, mssmb-ga:
Your excellent question has two parts, which I will take the liberty
of rephrasing:
1. How does one convert data from a relational database into XML?
2. Why would it be important or useful to do this?
Your wording, "algorithm (steps) to generate xml from relational
data," tells me that you want a concrete answer to the first part of
this question, which I will endeavor to provide. But the second part
of your question reminds us that to do it well, "how" we convert the
data depends on "why" we need to convert it.
Therefore my approach to answering will be a little different than
usual. I want to begin with a brief summary of relational databases
and XML representations of data, and then give a simple but explicit
example of converting from the former to the latter. In describing
this example I will point out the "manual" steps involved in
conversion.
What follows will depend to an extent on your feedback. I do not know
how familiar you are with either relational databases or XML.
Assuming you have some significant experience with each, I will
proceed to give a more complex example which illustrates one way in
which client-server applications commonly use such a conversion. Then
I will enumerate a number of toolsets or "frameworks" which are
available for automating most of the conversion, commenting on which
of these either ignore, require, or supply a DTD corresponding to the
XML output validation.
My hope is that by putting before you an assortment of mechanisms for
conversion, I can give a more satisfactory answer to the second part
of your question. The different frameworks for generating XML from
relational data have many similarities, but they expose also an
assortment of purposes for these conversions, and thus how the
"algorithms" of conversion may be attuned to these various purposes.
[SUMMARY OF RELATIONAL DATABASES AND XML REPRESENTATIONS]
For a comparatively long time, relational databases have dominated
managing "ownership" of data above a certain threshold of complexity.
Relational databases exploit the "table" as a primitive data
structure. Along with a few other concepts like keys and indexes,
tables represent "business rules" as constraints to the valid data for
an application. Essentially from the outset of the RDBMS "theory",
Codd and Date advocated "normalization" as a set of
guidelines/prescriptions for good design of relational databases.
XML is a comparatively recent text-based formatting standard for data
documents, aimed at simplifying the transmission or exchange of data
between disparate architectures (data feeds). As such XML is both a
flexible and regularly structured format. The XML standard has given
rise to numerous derivative standards, such as XSLT, XPath, and
Schema, and has insinuated itself into almost every aspect of modern
application development, from Web site management to program
documentation. This "surprising" success is due in part to an
abstract "tree" model underlying XML (see Don Box's book, Essential
XML), which is counterpoint to the flattened "table" designs of RDBMS.
However XML lacks any "philosophy" of normalization comparable to
what has been prescribed for relational databases, and designing XML
schemes remains a much more unstructured art than designing relational
database schemes.
Thus converting from relational (table-based) data to an XML
(tree-based) format occasions a certain amount of choices, potentially
informed by our grasp of the ends to be served by this conversion.
There are always so many possible ways to accomplish this, that one
might wish for a "magic wand" (framework) that takes some or all of
the decision making out of our hands.
Let's begin our progress towards a discussion of these "magic wands"
with a simple example in which the choices are easily made.
[EXAMPLE 1]
Suppose we have a single relational database table EMPLOYEE, defined
as follows:
CREATE table EMPLOYEE (
empid char(6) NOT NULL,
firstname varchar(15) NOT NULL,
middleinitial varchar(2) NULL,
lastname varchar(20) NOT NULL,
suffix varchar(3) NULL
)
and a pair of records stored within as follows:
INSERT into EMPLOYEE (
empid, firstname, middleinitial, lastname, suffix
)
VALUES ('100000','Thomas','A.','Edison','Sr.')
INSERT into EMPLOYEE (
empid, firstname, middleinitial, lastname, suffix
)
VALUES ('100001','Bill',null,'Gates',null)
[XML DOCUMENT REQUIREMENTS]
To express these records' data in an XML format, we must produce a
well-formed XML document that conserves the "complete sentence"
semantics entailed by the business rules underlying the table. Let us
assume, for the sake of illustration here, that field empid is the
primary key on this table.
Every XML document _should_ begin with an XML declaration, minimally
like this:
<?xml version="1.0"?>
Fortunately the folks at W3C did such a good job with the basic XML
recommendation, there hasn't been any newer version of XML after 1.0
(although much has been built atop this foundation). Other than the
version parameter (which is required in any XML declaration), the XML
1.0 recommendation allows two optional parameters (encoding and
standalone). But the above is fine for our immediate purposes.
The rest of our XML document, its "body", will consist of a suitably
nested arrangement of "elements". The W3C recommendation _allows_ for
certain additional kinds of contents (comments, processor
instructions, and CDATA sections), but for the sake of simplicity we
will not use any of those constructs in this example.
An XML element consists of an opening and a closing "tag", the
contents of the element being everything in between. Tags are
bracketed by an opening "<" and a closing ">", and the name of an
element (a text identifier) is the first thing one finds after the
opening "<".
In addition to the element's name and its contents (which may be
comprised of character data and/or a further arrangement of nested
"child" elements), an element may possess attributes. The attributes
are specified within the element's opening tag, separated by
whitespace from the name and from each other. An empty element (one
without contents) may be abbreviated into a single tag that is both
opening and closing, as we shall illustrate below.
[CHOOSE A NAME FOR THE DOCUMENT ELEMENT]
It is required that an XML document have a single element at the root
of its hierarchy of elements. So the first choice we'll make here is
what to name this "document element". Under the circumstances
(describing a collection of EMPLOYEE records), I would choose to name
it "employees" (plural, hinting that the contents will consists of a
repeating group of "employee" elements). So let's add a minimal
opening and closing tag for our document element, giving the result:
<?xml version="1.0"?>
<employees>
</employees>
In itself that would be a well-formed XML document. However we must
still provide further contents to reflect the records held in the
relational database table. As foreshadowed above, I plan to layout
one "employee" element for each of the two records in the EMPLOYEE
table. That brings us to the next choices to be made.
[DELINEATE THE ATTRIBUTES AND CONTENTS OF THE DOCUMENT ELEMENT]
Since the contents of distinct fields on the record are separate, we
should preserve that separation in the way we layout these <employee>
elements. We could keep these values separate by either representing
them as child elements or as attributes of <employee>. In the first
case we might do something like this with subelements:
<employee>
<empid>100000</empid>
<firstname>Thomas</firstname>
<middleinitial>A.</middleinitial>
<lastname>Edison</lastname>
<suffix>Sr.</suffix>
</employee>
while an alternative approach using only attributes might look this
way:
<employee empid="100000"
firstname="Thomas"
middleinitial="A."
lastname="Edison"
suffix="Sr.">
</employee>
Notice that in this last approach the <employee> element is empty. In
such a case one can "abbreviate" the opening and closing tags into a
single tag as follows:
<employee empid="100000"
firstname="Thomas"
middleinitial="A."
lastname="Edison"
suffix="Sr." />
Applications which parse the XML should treat these last two
alternatives equivalently.
Deciding whether values are to be represented as elements or as
attributes is not always an obvious choice. Both extremes (all child
elements with no attributes, versus empty elements characterized
solely by attributes at some depth in the hierachy) have proponents.
However I would choose in this case a combination approach:
<employee empid="100000">
<firstname>Thomas</firstname>
<middleinitial>A.</middleinitial>
<lastname>Edison</lastname>
<suffix>Sr.</suffix>
</employee>
The reasoning behind my choice is that the field empid has a
distinguished role, as the primary key, in uniquely identifying the
individual employee. Therefore it is attractive to link it closely
with the <employee> element. Also it is important to note that the
key field empid appears to be a "synthetic" sort of value, without
useful substructure. This is important when deciding if a value ought
to be an attribute, because while elements can be nested in a a manner
that reflects substructure (or grouped in a manner that reflects a
common semantic purpose, as we might do here by combining all the
other values as child elements under an <empname>...</empname>
construct), there is no nesting or grouping of attributes.
Here then is my completed conversion to an XML document of this
relational data:
<?xml version="1.0"?>
<employees>
<employee empid="100000">
<firstname>Thomas</firstname>
<middleinitial>A.</middleinitial>
<lastname>Edison</lastname>
<suffix>Sr.</suffix>
</employee>
<employee empid="100001">
<firstname>Bill</firstname>
<middleinitial />
<lastname>Gates</lastname>
<suffix />
</employee>
</employees>
There are several ways in which this example was especially simple.
Not only were we treating a single table or view (or more generally a
single recordset; stored procedures may return multiple recordsets),
the records were few and the values were all naturally mapped into
character form. In particular we did not bump into issues about
representing floating point numbers or dates in character form. Some
closely related issues would surface if we delved into what aspects of
this converted XML document might be "validated" under a DTD.
[PROCEEDING RECURSIVELY]
But the example does point the way toward an important aspect of the
general conversion process, and that is recursion.
Here we began by naming the document element and delineating its
contents. Actually we slid very subtly into performing these same
activities on the next level deeper of child elements, progressing
from <employees> layout to <employee> layout.
A similar pattern is applicable to any XML conversion scheme: define
the mapping at the level of the document element, and from there
repeat the process with each child of those elements already mapped.
It is a classic instance of definition by "induction", with the basis
step formed naturally by the mapping of the document element.
[FEEDBACK]
As proposed at the outset, I would like to "pause" here to collect
feedback from you. Have I told you already more than you ever wanted
to know about XML and relational databases? Or have I told you only
things which were already well known to you? Have I started us along
a worthwhile track to a deeper discussion (as I proposed, a more real
world example of conversion together with a survey of "automated"
conversion frameworks)? Or have I begun the discussion in a way that
either misses your concerns altogether or "passes overhead at a high
rate of speed"?
In short, should I proceed apace, backtrack to fill in gaps, or quit
and give another researcher a crack at your question?
regards, mathtalk-ga |
Clarification of Answer by
mathtalk-ga
on
23 Dec 2002 21:34 PST
Hi, mssmb-ga:
Thanks for the kind words and encouragement. You have a good
background in XML and relational databases, and your focus seems to be
on how DTDs can be used to validate XML documents that represent
relational data. I think it best to revisit DTD concepts first in the
context of the earlier example.
[Example 1 DTD]
Before proceeding to our "real world" example, let's elaborate the
first XML document with a DTD (Document Type Definition). The DTD
syntax, while also "inherited" from SGML, is quite separate from what
we have used so far. See Secs. 3 and 4 of the W3C XML 1.0
Recommendation here:
http://www.w3.org/TR/REC-xml
The purpose of the DTD, as you know, is to define a particular class
or "type" of XML document. The DTD can then be used by an XML
processor to "validate" whether a XML document instance conforms to
these rules. Any DTD must be referenced in the XML document after the
XML declaration and before the document element.
The DTD may have a part "internal" to the XML document and another
part "external" to it. The usual syntax matches a DOCTYPE identifier
to the document element name in this fashion:
<?xml version="1.0">
<!DOCTYPE myDocElementName
SYSTEM "http://www.myWebName.com/myDtdName.dtd">
<myDocElementName>
. . .
</myDocElementName>
This external form of the DOCTYPE declaration uses a URL "location" to
identify where the DTD file can be found. The internal portion of the
DTD follows the external SYSTEM "..." reference. DTD declarations in
the internal portion can then override any similar DTD declarations in
the external portion. The external portion may be omitted if only the
internal portion is needed. We illustrate this below for the sake of
having a self-contained presentation.
Recall our first example:
<?xml version="1.0"?>
<employees>
<employee empid="100000">
<firstname>Thomas</firstname>
<middleinitial>A.</middleinitial>
<lastname>Edison</lastname>
<suffix>Sr.</suffix>
</employee>
<employee empid="100001">
<firstname>Bill</firstname>
<middleinitial />
<lastname>Gates</lastname>
<suffix />
</employee>
</employees>
Since our document element is called "employees", our DOCTYPE
declaration will wrap various DTD declarations (more precisely,
"markup" declarations if you read the W3C specs) inside a pair of
square brackets:
<!DOCTYPE employees
[
.
.
.
]
>
By convention the various DTD declarations are listed on separate
lines (as are the opening and closing square brackets) for legibility.
There are four essentially kinds of DTD declarations, not counting
comments: ELEMENT, ATTLIST, ENTITY, and NOTATION. We will only be
concerned here with the first two of these. The last two are useful in
"parameterizing" an XML document by tying in macros and (externally
defined) XML or non-XML content. Coordinating IGNORE and INCLUDE
directives with ENTITY declarations allows one to create "conditional"
declarations, but this is outside the present scope of our discussion.
An ELEMENT declaration which looks like this:
<!ELEMENT myElementName ... >
constrains the content of the element <myElementName> by either a
content category or a content model.
The syntax options for the first case are quite limited; one has two
choices:
<!ELEMENT myElementName ANY >
<!ELEMENT myElementName EMPTY >
It's really all or nothing with these content categories. The ANY
option allows all combinations whatsoever of text content and child
elements (including no text content and/or no child elements). The
EMPTY option requires no text content and no child elements.
A content model allows more flexibility, at least in terms of
specifying the varieties of child elements permitted. In our example
we have <employee> elements. These always contain just these four
child elements exactly once in the given order: firstname,
middleinitial, lastname, suffix. The ELEMENT syntax that expresses
this is:
<!ELEMENT employee (firstname,middleinitial,lastname,suffix) >
This is an example of a "sequence list" of child elements. Let's back
up from the <employee> element to its parent in this example, the
document element <employees>. The particular instance of the document
type might be validated with this declaration:
<!ELEMENT employees (employee,employee) >
But this is inadequate. It asserts exactly two occurrences of
<employee>, no more and no less, and we know from the intended
application to retrieving records from a relational database that one
should contemplate a variable number of records, including the
possibility of no records returned.
The ELEMENT syntax that expresses this is:
<!ELEMENT employees (employee)* >
The * is called a cardinality operator, and its sense is that zero or
more instances are allowed. Therefore * has the meaning of "optional
element(s)". For the sake of completeness we mention that ? has the
sense of zero or one instance, hence meaning "an optional (singular)
element", and that + has the sense of one or more instances, meaning
"required element(s)".
Finally we would simply like to say that the child elements of
<employee> each have text-only content:
<!ELEMENT firstname (#PCDATA) >
<!ELEMENT middleinitial (#PCDATA) >
<!ELEMENT lastname (#PCDATA) >
<!ELEMENT suffix (#PCDATA) >
A couple of words are in order about how the nullable fields
EMPLOYEE.middleinitial and EMPLOYEE.suffix have been mapped into our
XML document. Authors (such as C.J. Date) have observed that "null"
as a value is open to a variety of interpretations, e.g. a nonexistent
value, an accidentally missing value, a truly unknown value, a
circumstantially inapplicable value, a "none of the above" optional
value, etc. The handling of nulls is therefore dependent on "business
rules" that exists outside of database design and contents per se.
Here I have chosen to treat the null values for middle initial and
suffixes in a manner similar to how they would be treated if instead a
zero length text string were supplied. I have included
<middleinitial> and <suffix> elements which happen to be empty. This
is allowed, incidentally, by the #PCDATA characterization; one doesn't
need to add a cardinality operator to indicate that the (necessarily
singular) text contents are optional (no text amounts to zero length
text).
An alternative approach, one that would distinguish a "null" value
from one which is an "empty" string, would be to omit the corresonding
elements from the XML document when a field is populated with a null.
In our case this would mean changing the ELEMENT declaration for
<employee> to:
<!ELEMENT employee (firstname,middleinitial?,lastname,suffix?) >
so that the child elements <middleinitial> and <suffix> of <employee>
can appear zero or one times. But that is not how I chose to map the
elements here.
Again this is something of a special case because we are mapping
character strings as text content. Other sorts of values, like
numbers or dates, would naturally be considered "nulls" if the
corresponding element were empty of text content.
This completes the descriptions of element contents, but one important
item remains: the attributes! In our example only the <employee>
element was assigned an attribute, but it was a critical one, the
primary key EMPLOYEE.empid.
The syntax used to describe the attributes of an element is like this:
<!ATTLIST myElementName
.
.
.
>
in which the individual lines (as formatted here) will describe
attribute names, their types, and how to treat their default values
(if any). Without going into complete detail about all possibilities,
the "natural" way to describe a primary key associated "uniquely" with
a record from a relational database is with an attribute of type ID,
and which (from a "default" behavior standpoint) is REQUIRED. Hence
we would like to say:
<!ATTLIST employee
empid ID #REQUIRED
>
because the ID attribute type means specifically that this attribute's
value is to be unique within instances of this element (in the
particular XML document). Unfortunately there is a problem with this
approach.
The problem is that the values of ID attributes must be legal XML
identifiers. While XML identifiers can contain both alphabetic and
numeric characters, there is a restriction on the first characters of
legal XML identifiers. It cannot be a number. However an underscore
is a valid first character of XML identifiers, so "mangling" of the
empid values would provide one workaround. There are other attribute
types for which the EMPLOYEE.empid values would qualify without
mangling, such as CDATA or NMTOKEN, but these would not carry any
sense of uniqueness.
Further, as we shall see with the more complex example to come, the ID
attributes are useful in linking data from one element type to another
as IDREF attributes come into play. This is analogous to primary
key/foreign key relationships in a relational database design.
For now I will take the "underscore mangling" approach in order to
capture the semantics of uniqueness. The completed example is then:
<?xml version="1.0"?>
<!DOCTYPE employees
[
<!ELEMENT firstname (#PCDATA) >
<!ELEMENT middleinitial (#PCDATA) >
<!ELEMENT lastname (#PCDATA) >
<!ELEMENT suffix (#PCDATA) >
<!ELEMENT employee (firstname,middleinitial,lastname,suffix) >
<!ATTLIST employee
empid ID #REQUIRED
>
<!ELEMENT employees (employee)* >
]
>
<employees>
<employee empid="_100000">
<firstname>Thomas</firstname>
<middleinitial>A.</middleinitial>
<lastname>Edison</lastname>
<suffix>Sr.</suffix>
</employee>
<employee empid="_100001">
<firstname>Bill</firstname>
<middleinitial />
<lastname>Gates</lastname>
<suffix />
</employee>
</employees>
Additional Links:
[W3C Recommendation: XML Information Set]
(or XML InfoSet, abstraction of XML tree structure)
http://www.w3.org/TR/xml-infoset/
[W3C Candidate Recommendation: XML 1.1]
(formerly XML Blueberry, mainly character set & name issues)
http://www.w3.org/TR/xml11/
|