XML Schema development approaches

Can we free our documents from the straightjacket of structure?

By Rick Jelliffe
August 8, 2011 | Comments: 2

The way that people approach developing schemas has evolved over the years: each new approach grows out of problems with the status quo (see Hegelian dialectic) but enriches rather than supplants.

I thought I would take a little walk through the various generations of approaches I have seen over the last couple of decades, also noting the important books, but without attempting to be complete. Each of these generations does not die immediately with the birth of the next; there is a lot of life left in even the oldest mule. Bored readers may care to correlate each generation with phrases from KLF's Justified and Ancient.

First Generation: Declarative and Rigorous

The initial rationales of XML (when it was called SGML) were firstly to promote declarative markup (the separation of processing and data) and secondly to promote rigorous markup (then termed Document Type Definitions, now termed XML schemas). In this we can see SGML as a child of its time, the 70s, in the same groove as its sibling, the database movement.

The earliest books such as Charles Goldfarb's The SGML Handbook and Martin Bryan's SGML: An Author's Guide concentrated on syntax.

Second Generation: Battling WYSIWYG

Early books (such as Brian Travis and Dale Waldt's) did not really have worked out methodologies as such: they were mostly concerned with helping reader make that first intuitive leap, that material due for publication could be represented separately from its presentation, and fighting an initially losing battle with the proponents of WYSIWYG and disconnected personal computers. That battle was finally won courtesy of HTML and the WWW, and WYSIWYG was pronounced dead by the late 90s. (The death notice was a little premature: stylesheets are ubiquitous now, even inside word processors, but the markup that the styles apply to is not high-level markup, analogous to database columns, but merely publication-neutral generic markup.)

One of the contradictions at this stage was that practitioners would say "just model the data". However "Markup reflects a theory of text" (Sperburg-McQueen) which meant hierarchical in abstract structure and, critically, hierarchical in the actual markup: flat or externally-linked structures were not well-supported.

Third Generation: Inspection

The next milestone was the book by Eve Maler and Jeanne El Andaloussi, Developing SGML DTDs. This gave a systematic method for document analysis based on the inspection and generalization of published examples.

The flaw in this approach is similar to one of the issues that Christopher Alexander reported with his pattern approach to architecture: that lay people applying the pattern methodology ended up making buildings which were slightly eccentric versions of the kinds of buildings they would have built without the pattern methodology. There still needed to be a mind-broadening phase, where the designers would get a large enough vocabulary of implementation techniques and forms (Alexander thinks we can and should go beyond the forms to an objective quality without a name when implementing designs with our pattern language, based on anthropological reality.)

For this generation, good design still required gurus in markup, rather than domain experts.

Fourth Generation: Patterns and Fragments

The need for some structures more than DTDs provided was being increasingly felt just prior to XML's emergence: ISO made a stab of it with something called Architectural Forms, however it failed the test of simplifying life for developers.

A good book at this time to give more exposure to diverse approaches was Dave Megginson's Structuring XML Documents.

This was also the spirit of my book The XML & SGML Cookbook: there are multiple ways of representing in markup any data. A professional, no matter which methodology they use, still needs to be aware of each different way, and to make deliberate choices about the trade-offs. These issues never go away: people who are used to HTML find markup that does not look like HTML (such as RDF) entirely odd; similarly, people who are used to RDF markup may not understand why it is entirely inappropriate for embedding in HTML. (More recently, if you were used to ODF's representation (mixed content, potentially hierarchical) then you might be startled by Open XML's representation (no mixed content, linear): but ugly is not the same as inappropriate or indefensible!)

So the cookbook gave many alternative fragments out of which documents could be constructed: documents analysis did not need to start from scratch, instead you could run through various design options and select the most appropriate.

(The seed of Schematron came at this time too: as it became clearer to me that many decisions about how to represent data "independently" of its presentation (such as the decision about whether to have flat, nested or dimensionalized documents) could only be answered by having a model of the processing and use of the elements: in the absence of such a model, following convention and idiom was often the best bet. Grammar based schema languages effectively prevented specification of the "why" of markup, only allowing "what" to be expressed.)

Fifth Generation: Vocabularies

At this stage XML and the WWW had arrived, and ideas stopped being transmitted through books. Books became product manuals.

The next approach to markup emerged was of standardized, modular sub-vocabularies made by domain experts rather than markup gurus: XML Namespaces was designed to enable this, and the W3C lead the way with initiatives such as MathML.

The flaw in this approach was the growth of kitchen-sink schemas: lazy, or just quick and dirty, construction of schemas by adopting standard vocabularies holus bolus. Bloated schemas built on bloated schema languages.

Sixth Generation: Namespaces as Data Dictionaries

A very common development pattern emerged, based on Conway's Law really: split schema development into two efforts: one to make data dictionaries and the other to select these into particular schemas. Rather than adopting the whole namespace, you select items from it. (This pattern is the basis, for example, of the Standard Business Reporting project here in Australia, built on XBRL: there is a project-wide definitional taxonomy, and numerous form-specific reporting taxonomies.) The domain experts could be involved in the dictionary phase, the developers and gurus could be more restricted to the schema stage.

The namespace technology that encouraged standardized vocabularies of fragments such as Dublin Core also allowed an inversion: standard envelopes such as SOAP, on which the Web Services effort was built.

And on top of XML Namespaces came the dreaded W3C XML Schemas: it added popular datatyping features to a subset of DTD-style grammars, plus many other low-bang-per-buck features that boldly eschew any attempt to be layered. (The ISO languages such as RELAX NG and Schematron have more bang per buck. It is tempting to say that the ISO languages are less well supported by vendors, but the particular features that any given implementation of W3C XML Schemas supports is a lottery, so neither has a very satisfactory story.)

Sixth Generation: Semantic Markup

If schemas could be made using standard vocabularies identified by URLs, could we invert the structure of markup from being annotation of existing text to being slots into which facts are represented explicitly linked to their definitions? This was the initial premise of RDF 1.0, and it proved very shaky: the syntax was too far removed from the needs of publishing and dynamic documents to be workable, and RDF did not fit in with XML Schemas and DTDs, the dominant schema languages, at a time when the pressure was arising to move XML to the centre rather than the periphery of processing.

The question people were asking was not "give me a standard way to represent facts independently of my application" but "give me an easy way to get the facts in my application out there".

When the RDF people argued that the information content of an RDF document allowed it to be transformed into lesser formats, the response was, well, in that case why not keep it in the less descriptive but more convenient format and just transform it to RDF if ever it that was needed? The result was W3C GRDDL, perhaps the final nail in the coffin of RDF 1.0 as a viable exchange syntax.

Seventh Generation: Data binding

But there already was a community with lots of data arranged as individual simple facts and tuples, independent of RDF: relational databases. They were not concerned with linking their databases, they were concerned with sending query CRUD data between databases and middleware.

It is quite nice that the desire of programmers to reduce work and errors by automatically generating schemas for XML, object classes and schemas from databases from each other (for example, the SEAM framework) has come at the same time as programming languages introduced markup-language-inspired annotation features.

Rather than necessarily analyzing documents, you just take your object model and pick out the classes to serialize, or you take your database schema and pick the tables and columns to serialize to.

The advent of this kind of automated data-binding compromised most of the core value propositions of XML as sold (simplicity, flexibility, standards, ease of reading): rather than the web-service-y loose-coupling using the clear interfaces of public schemas and standard vocabularies that allows decentralized development, automatic data-binding necessarily marginalizes the role and (often) the quality of the target. The justification is that the XML is ephemeral and the analysis has already been done.

Eighth Generation: Data modeling

A frequently more satisfactory approach has been to loosely couple XML schema development to data modeling, for example using UML. The loose coupling reduces the chances that the schema designers will get sidetracked by hobby-horses about XML, but will concentrate on their analysis.

The schema design effort then concentrates on the best way for serialization and representation, not content analysis. This can be rather like how some people keep their database or warehouse schemas at arm's length from their data models: with the intent that the performance, maintainability and structure of the database should not be compromised by abstract modeling methodologies.

But the real problem emerges: the more that the data model is made without regard to the limitations of the grammar-based schema languages, the more constraints that you may have to jettison checking when representing the data model in XML.

Ninth Generation: Linked Data

XML DTD's use of regular grammars was a strength conceptually, but also a flaw for modeling, it does not provide good support for constraints that flow across links: if you want these non-regular constraints, you have to use hierarchy: the structures possible in your document will be determined by the capabilities of the schema language.

W3C XML Schemas has a small but useful improvement over DTDs: the KEY/KEYREF mechanism allows you to declare the type of the element at the other end of a link within a document. (ISO RELAX NG supports a more powerful class of grammars, which in theory allows structures internal to an early part of a document to constrain elements or attributes later in the document: it can address a different part of the problem of specifying structures more than the XML Schema/DTD grammars allow.)

So this does not help when, for example, you want to specify and enforce constraints between branches of your document, or even between documents. You need additional validation.

For example, to link some data value to a generic container which has information specific to that data value: I suppose you could call this tunneled constraints. An example of this can be found in XBRL: you have a data item (say "Profit") which then links to a context (by an ID, say "C123") that in turn specifies the entity, units and period information that every XBRL fact requires. However, when you are using XBRL Dimensions, the context element may also link or specify other information to place your item into a position in a hypercube, (e.g., which country does your item belong to).

In this case, it is beyond the power of regular-grammar-based schema languages to say "This <my:profit> element's contextRef attributes needs to point to a <xbrli:context> element which contains some other element that specifies the various dimension properties." XBRL have addressed this by making W3C XML Schemas even more complicated: they have an XLink-based system where you specify the dimensions using XML Schema syntax, but these schemas are interpreted by specific XBRL processors in order to perform validation.

The best way, in most cases, to specify and validate constraints in XML documents beyond those that are convenient in XML Schema languages is to use ISO Schematron. It uses XPaths, and has many features to support clear layering. It is the only mainstream schema language that supports validation of constraints between documents.

Where most of the grammar-based schema language are weak, from a modeling point of view, is that they do not allow clear representation of patterns except where the pattern corresponds to namespace boundaries (e.g., SOAP is an envelope pattern) or when they correspond to an element (e.g. html:p is mixed content). For example, the XML Schemas specification is built around the idea of "components" which are strongly related groups of elements: however XML Schemas the technology has no concept or representation of components.

Schematron is very strong here, because it allows you group rules and assertions into patterns (with names and documentation) regardless of the particular markup used to represent them.

Tenth Generation?: Views and Pivots and Multi-valency

But there is on big problem that has remained constant, and is unaddressed: it comes from the idea that we only need one representation for a document. XML tried to address this in part by making DTDs optional: if you needed to put information out in some different arrangement, you could. RDF 1.0 tried to address this n part by having structures that let you know what each item of information was about, but assumed that the information was in some way complete. XML Schemas tried to address this in part with the ALL content model and with features such as substitution groups. XBRL addresses it in part by making information flat, and using various links.

In my blog Highly Generic Schemas a year ago, I gave one approach, which allows data to be highly denormalized or highly normalized as possible, and also perhaps is friendly for object-oriented programmers. The method in it is not to look at the data, but to look at the users: in particular, to ask "what do developers need the markup to tell them, in order to make efficient and effective use of the data?"

But developers are only one set of users. If I were to say what we need to be supporting now, what methodology we need to develop, or what do our schema languages need to support, I would suggest that perhaps we need to ween ourselves off the idea that there is only one true, optimal, standard form for our documents: we want rigorous, declarative markup, but we need convenient markup.

Where to begin? One approach is to have so-called multi-valent documents: see Can a file be ODF and Open XML at the same time? I see that Murata-san and others at ISO SC34 are spending more time on the ISO MCE technology (Markup Compatability and Extensibility) that allows medium-grained multi-valent document branches.

But the most promising approach I see, acting at a stage before MCE or ZIP, is to pursue the kind of ideas that I have written up as PRESTO. It works with existing infrastructure. The PRESTO material is couched in terms of providing multiple representations of documents (JPEG and PNG, PDF and HTML, ODF and Open XML, etc.) which is increasingly happening, as is the provision of documents for multiple platforms (e-readers and print, tablets and desktops, etc.)

But the same mechanism can also be used for supplying the same information (resource) in very different schemas (representations). A developer who wants to put some information into a data warehouse might like a flat, dimensionalized representation; a developer converting the data into HTML might find flat, dimensionalized data distressing to work with and prefer hierarchical data that can be put out as tables trivially.

These kinds of gross representation differences are not really the kinds of things we can expect stylesheets to do. We can expect our GUI to handle sorting tables based on various columns (e.g. jQuery and tablesorter, for HTML pages), but we don't have facilities to handle pivoting or sorting data fields that are rendered outside tables.

Imagine a calendar, for example: we have 12 tables of 7 columns, each on apparently a separate web page, delivered by an AJAX system as one XML file with and rendered by our browser. What if is more convenient to view just the Saturdays, in one table? This is not something that we would expect our table widget on the GUI to be able to do: the table for one month will not know anything about the table for another month. And it is not something that we traditionally expect our nice pure XML document to provide. But it is potentially the kind of generic operation that might be quite easy for the server/middleware/serializer to provide.

I think there is great scope for getting XML documents with URLs parameterized by various pivot and normalization depths. The challenge then is for schema languages which allow specification of document structures that are independent of the pivoting. Schematron gets part of the way here, with its abstract patterns. But I suspect that what a pivotable document needs is not just a pivoting schema, but also a pivoting API: the pivoting document is better for driving push-based (event-based) processing, but there still needs to be pull-based systems where you can access information reliably regardless of the structure.


This blog is a response to a question Roger Costello asked over on the XML-DEV mail list,
Do you know a good book or article that describes the *process* of finding or determining the basic/core items in a newly-to-be-created markup language (or the basic/core combinators to be created for a functional program)?

You might also be interested in:

2 Comments

Are you kidding me?
First you write about RDF as "odd", "shaky" and with "a nail in the coffin", and then you start using the term "Linked Data" out of the blue sky.
You should know that Linked Data is probably the first practical method to globally interlink data which is enabled exclusively by the RDF data model. You should also check out the developments in this area - and you will see that they are very much alive.
XBRL is on the other hand the most terribly complicated XML standart I have seen, which tries to imitate most of RDF's features while lacking its grace and simplicity.

Ah, I was not meaning "Linked Data TM" but just "data that is linked". It might make more sense if you read it like that, since there is nothing about RDF there. I suppose I should change the title of the section.

As for XBRL: yes I agree that XBRL can be madly complex, especially in practise. Yes, I think that it would be better for XBRL to keep the current instance syntax, but to ditch the schemas in favour of GRDDDL transformations to RDF, with Schematron for extra constraints.

Do you think think RDF Schemas and so on are effective for specifying schemas that can be generated into user interfaces, the way that XML Schemas can do? Remembering that the effort in XBRL is not in making the instance documents but in making the taxonomy.

News Topics

Recommended for You

Got a Question?