Programmers and academics often think and theorize about XML as kind of tree data structure. And so indeed it is. But it is also allows much more: it is a series of different graph structures composed into or imposed on top of that tree.
Why care? Well I am not saying you need to! Many people only use the tree structures in XML by choice (Content, Attributes, Elements and Comments: CACE?), and then find themselves having to revisit and perhaps reinvent the same kinds of data structures provided by the bits of XML they don't use. This is neither good nor bad: it depends on the case. I am not saying that XML IDs or entities are perfect or imperfect, for example: merely that they provide since they spring out of solving particular problems, we can see them as one way of revealing a (more interesting?) general problem or solution space.
So below is a table showing what I think are four different simultaneous data structures that are available in vanilla XML. (The default OReilly.com style does not like tables, so it may be preceded by some whitespace: apologies.)
In this view, XML is
- The elements in XML form a tree (a single-rooted, directed, acyclic graph with no shared nodes). And a particular kind of tree: an ordered, typeable tree with labelled nodes that can have properties (a kind of attribute-value tree?) and unique identifiers, and with unlabelled text with no properties possible as leaves, but whose edges are unlabelled and have no properties.
- Imposed on this we have a graph structure made using the ID/IDREF links.
- Underneath the elements, the document is composed as a structure of parsed entities. (Most XML documents are made of a single entity: one file or one web resource of course, but the entity mechanism allows a document to be constructed from multiple sources of text.) These parsed entities form an acyclic directed ordered graph.
- Then, above this are links that point outside the document (again using the entity mechanism): non-XML entities such as graphics form a star, XML documents form a graph (e.g., the Web), and I've tacked in for completeness the old SGML SUBDOC feature which allows documents to be nested.
Most of the built-in XML layers have more ambitious re-workings: instead of XML parseable entities you may be able to use XInclude, instead of ID/IDREF you may find XSD's KEY/KEYREF better (but probably not), instead of external entities you may find XLinks (for navigation) or RDF (for semantic links) better. No-one would (or could) use SUBDOC but would go for XML Namespaces plus XInclude.
Looking at these data structures, we can see some order: that the composition of documents is acyclic means they are not infinite, and that the element tree built on them is not a infinite tree (given finite entities in the first place.) [However, sharing of nodes is still useful: in a sense, the entities are normalized while the tree is denormalized.]
Looking at this table, the thought comes as to whether some of the additional XML standards are attempts to convert more of the Ns into Ys, or at least whether it would be a more rational approach to improving XML (or its successor's) expressive power while keeping things simple.
For example, XML Schemas has its own composition mechanism (xs:include) and its own hierarchy mechanism (type derivation), its own internal linking system (global/local types) and an external linking system (import). But in the composition layer, the how to handle circular inclusions or multiple inclusions (graphs or acyclic graphs) is necessarily defined. I would expect that a complex quasi-standard such as XBRL could also be analyzed in these terms: how are taxonomies composed? what is the primary structre of taxonomies? how are taxonomies internally linked? how are taxonomies externally linked (e.g., for a Discoverable Taxonomy Set)?
For example, in XML the parent/child edges on the element tree are unlabelled. Would it be more useful to have labelled edges? Does XPath fit the bill? XPath axes might look like edge labels, but their aim is to allow a kind of multi-rooted tree effect, where every node is reachable from every other node, rather than labelling edges with any specific information. What about XLink? it allows the declaration of labelled edges between nodes but these have to be declared as links. RDF Schema sort of works the right way, but it is not general for any arbitrary XML. I'd say the closest thing is GRDDL, but it operates as an XSLT transformation that converts a file to RDF rather than merely labelling edges (which is not to say that some specialized implementation that acted in-place could not be developed.)
(Table below. You may have to scroll.)
|Node label?||N||Y (element)|
|Node Id?||N||Y |
* For SGML this would be Y since entity declarations can have data attributes
** SUBDOC is an SGML feature removed from XML
There could be another column for XML Schemas, since the Post Schema Validation Infoset makes an implied DAG from element (or attribute) nodes to types (actually, to type declarations). Contrast this with RELAX NG, which is not annotative. Schematron's new properties mechanism does allow annotative validation, but it allows many-to-many links from document nodes to properties with values may be dynamically generated.
I cannot believe anyone would make it down here!: I have written this just to log some thoughts. In that sense, it is perhaps a companion to the series I put up a year ago on some parser aspects: Jotting on parsers for SGML-family document languages.