This is a bit of a brain dump, not a tutorial.
The Rise and Fall of Mandatory Validation
SGML was based on a strict regime: the document had to have a grammar (DTD), and it had to conform to the grammar (DOCTYPE). This idea, the jargon is rigorous markup, springs from a workflow that was likely to be distributed in time (archiving), task (re-targetting), and responsibility (different organizations.) You don't want to bad documents to propagate down the line (in terms of tasks, organizations or years) before discovering it is flawed.
Mandatory validation had other benefits too, notably it enforced Design by contract.
However, mandatory validation requires schemas, which add to the expertise investment. And it had become clear that high-volume, low stakes data delivered from a debugged system did not benefit from being validated before being sent: validation was appropriate for debugging and verification. And at the receiver end, the increasing use of schemas to generate code (data binding) which builds some kinds of validation (generated code rarely checks the full constraints, but necessarily barfs when there is an error in the document against the parts of the schema that it has used.)
If there has been a trend away from mandatory validation, there has also been a trend away from voluntary validation. In large part, this has been due to the unworkable complexity of W3C XML Schemas, which has repelled the current generation of developers. The less rigidly structured and non-facty the documents need to be, the more that alternative schema languages, notably RELAX NG, have been taken up.
When I first developed Schematron, almost a decade ago, one of my concerns was the idea that validation needed to be more than just a binary yes or no.
While standards may be couched in terms of simple yes/no, a lot of the usefulness of validation comes from information that comes with that message: can you explain the problem? can you identify the cause? can you provide enough info to suggest remedies? I came to the conclusion that in fact, it was the provision of information to humans in forms specific to the document, and with dynamic information (extracted from the document being generated) that was important.
The next insight was that validation could be regarded as a transformation of the instance document into a report. This raises the point that in fact there is a spectrum of reports from a document that can be generated for management and workflow purposes, with validation being only one of them. While we could regard storage structures, business rules, unit testing and data summarization as different tasks, in fact it was possible to make a single technology that was capable of making a good stab at all of them: a tool for finding patterns in XML documents using Xpaths, Hence Schematron.
Schematron was designed to be useful as a companion to other schema languaages. Eddie Robertsson's open source code allowed Schematron embedded in XSD or RELAX NG to be extracted and the gaps filled in. The ISO standard for Schematron even has an annex detailing how a standard that wanted to implement Schematron functionality in a modular way could embed ISO Schematron, ratifying Robertsson's approach.
The wheel turns, and XSD 1.1 is adding bits of Schematron's approach: I have mentioned that XSD 1.1 is adding assertions with a streaming subset of XSLT2. I don't consider the XSD 1.1 assertions to be "assertions" since they make no natural language statement; they are just constraints. But very useful none-the-less. Another very welcome addition that I think is influenced by Schematron is the availability of open content models: a combination of interleave and wildcards which lets you specify a partial schema. (While I think many of the changes in XSD 1.1 are steps forward, elsewhere I have been pretty scathing about XSD 1.1 for its utter lack of modularization --whether this is the cause or symptom of an NIH syndrome is not for me to say-- and for its lack of support of developers who need something simpler.)
Schematron takes advantage of partial schemas in an interesting way: it allows you to group and name different collections of patterns (the containers for rules with assertions) so you can switch in and out assertions to test. This mechanism (phases) supports a variety of different uses, including workflows.
XSD 1.1 does define a rich set of outcomes (the PSVI) for validation, but these are predefined. The new version of Schematron (implemented in the XSLT2 version) supports a property mechanism which allows arbitrary properties to be assigned and calculated as part of validation: this extends Schematron to feed machines as well as people with useful information. RELAX NG's Murata Makoto and I have briefly discussed the possibility of adding properties to the LHS rule of RELAX NG grammars, to allow annotations to interact or enable or round-trip hierarchical type systems better (e.g. to label RELAX NG rules with the equivalent component name from XSD to support the type declaration hierarchy of XSD by a nice clean layer.)
The use of XSD in standards like WSDL continues to position XSD as a language for contracts and code generation rather than validation: they say say "this is what we accept and generate" as a one-party declaration. Validation comes more from where two parties agree but need to verify.
The Rise and Fall of ExtensibilityWith XML, well-formedness checking is Draconian, but validation with a schema is not manadatory. So a different approach was taken to allowing successful large systems: modularity. The XML Namespaces spec allows your documents names to be partitioned off into different namespaces, with each namespace being in effect a vocabulary or sublanguage. Generic software or libraries that only understood a single vocabulary become possible, such as for SVG.
In this view of modularity, documents would increasingly be composed from items in different standard namespaces: the most extreme implication of this was teased out by XSD 1.1, which does not provide any facility to specify the top-level element. The SGML notion of a document type was replaced entirely. Again, those people who still needed document types switched to RELAX NG or used a simple Schematron assertion.
Extensibility sprang up in another place too. As even the most pragmatic HTML people grokked the need for rich structured data, rather than use XML generic identifiers on elements, they started to use the HTML
class attribute. As this has proved insufficient for labels that does not necessarily have any rendering difference, or which lives in the
head element, the microformats movement has arisen: defining subvocabularies of values for
rel attributes and so on, allowing collections of information on a subject (e.g. the different parts of an address) to be labelled.
While some microformats are defined using grammars, I don't imagine that anyone uses the schemas to validate HTML documents with. You could use Schematron to report where a particular microformat has been used, or to require it, I suppose. But microformats were not very susceptible to validation, particularly because they are open ended: how can you check for a spelling error if there is no limit to the allowed data values? Rigorous markup it ain't.
Again some attempt has been made to overcome this by moving to a namespace (and prefix) system, so that keywords in microformats could be somehow traced back to vocabularies.
In the electronic business XML community, a related approach has been developing, spearheaded by Ken Holman, the UBL Methodology for Code List and Value Validation which allows a much more rigorous and distributed approach.
Minor and Major Versions, and Subsets
But extensibility has its problems. The flexibility of not having mandatory validation creates a vacuum for unmanaged change. Extensible software is often written (as distinct from code generated from schemas) with policies to cope with unexpected elements, with HTML being the poster boy.
However, what if the change relates to something regarded as intrinsic to the information? It is popular to say that the application entirely determines how the information is processed (if you have pure data).
However, some information items have strong cohesion, so you cannot have one without the other. For example, if you changed the HTML
td element to be called
bigboybygum, in order to placate those voices in your head I suppose, then you would not expect HTML software to work correctly merely by stripping the element out and continuing with its contents.
In fact, one of the ways to characterize the difference between Schematron and other schema language is to say that the other schema languages are interested in specifying the characteristics of an element and its children, while Schematron is interested in specifying strong cohesion (and strong repulsion). Aphoristically: grammars are about coupling and Schematron is about cohesion.
Coping with change is not an easy problem.
On very simple way to support versions is for the document to have a version number. Schematron, RELAX NG and now draft XSD 1.1 have features to support using a version number in a schema.
Very often, however, version numbers are approximately useless. This is because their use is often only defined when the second version is coming, which is too late for software written for the first version to make use of it. This shortsightedness by committees, that merely providing a slot for some future versioning without considering that such versioning needs to include current documents, is one of the few constants in the world. XML's versions are useless for this reason, for example.
And recently I called for the ODF TC to include better support for versioning. The result? Put it on the list for later.
So what policies might be minimally useful? In the case of version numbers, the main this is to establish the backwards or forwards compatibility relationship, or to characterize the changes in terms of subsets.
The expectation for minor and major version numbers is that there would be minimal graceful degradation between documents and systems with the same major version and consecutive minor versions, but that the more the minor version changed, the more possibility of degradation. And the more that the major version changes, the more chance of high degradation or failure or information loss.
But this idea does not correspond to the strict world of schemas and validation, of mission-critical data and contracts and guaranteed interoperability. Pulling minor and major versions into that world involves, for example, defining that all documents in a previous version of a minor number are valid against all subsequent minor numbers, for example. There are lots of different concrete policies that can be implemented.
Markup Compatibility and Extensibility
But while the minor/major version is powerful when these simple subsetting relationships hold, it breaks down in many real circumstances, because often a new version may have a raft of features, both breaking and non-breaking changes.
XML, being a Web technology, is particularly prone to the issue of how old software should handle new data. CORBA was sunk by its tight binding to schemas, and in XML the problem is not entirely eradicated.
Indeed, the problem is resurfacing with a vengeance with the arrival of the mass XML-in-ZIP formats, such as ODF and OOXML. We may laugh that IBM takes whatever Tweedle Dee position will oppose Microsoft's Tweedle Dum, but the recent discussions on the correct policy for handling ODF formula (until Open Formula is baked and out of the kitchen) shows that the issue I identified at the beginnning of this piece as central to the move from SGML to XML is still very much alive.
The disciplines and approaches needed for mission-critical data is different from those needed for popular data. How this played out in the recent formula discussions was Tweedle Rob huffing that his wedding planner's spreadsheet would break, while Tweedle Doug puffed about the need to be 100% complete and accurate. (I recognize that both POV have their merits, and I think the user should be the one to decide the policy not the vendor: the dreaded popup warning about degradation is the right way to go.)
One powerful method is the most direct: label what is important. This smells good, because one of the axioms of markup is that adequate labelling solves everything.
SOAP lead the way, with its
Elements tagged with the SOAP mustUnderstand attribute with a value of "1" MUST be presumed to somehow modify the semantics of their parent or peer elements.
MustUnderstand has been integrated into a more comprehensive management system, the Markup Compatibility and Extensibility (MCS) mechanism of OOXML. (See my Safe Plurality: Can it be done using OOXML's Markup Compatibility and Extensions mechanism?)
MustUnderstand works at the namespace level. Rather than merely providing information to allow an application or validator to reject a document because of unknown elements, MCE allows the application to first check between different variants (in different namespaces or not) of the data and to pick the best.
MCE also allows the document to specify a simple policy for unknown elements: do you ignore them or process their contents. (This is a similar consideration to XSD's lax, strict and skip validation modes.)
Moving compatibility into a generic vocabulary of its own, suitable for a preprocessor (I write a prototype in XSLT in 2 days, it is not so hard) is a great innovation. And it is certainly suitable for use in other systems.
Dialects, Profiles and Feature Sets
However, even MCE seems not appropriate in some cases. It does not extend to data values such as enumerated values, for example. The application has to sniff the data to figure out which is the best alternative to use, for example.
What do you do when you have adjustments to the semantics of a standard, but you don't want to have a new namespace or new elements or attributes? What happens when different elements have different status, for example being deprecated or 'compatibility' or 'transitional' but still available? What happens when several different editions of a standard are made over time or between organizations, each time with dialect alterations? What happens when implementers take ambiguous or incomplete specifications and implement then differently, creating in effect different dialects?
The approach that has been encouraged by XSD's notorious complex type derivation system has been to have a base schema which kitchen sinks all the various schemas and then to derive individual schemas for each dialect: for example, a schema for the new pure version and a version allowing the deprecated elements. The intent of XSD is that using the type derivation mechanisms, it is possible to take two derived schemas and identify exactly which elements are common (in the sense of being the same particles in the base content model) between them. Don't ask me how this really helps anything.
The most common answer marking up dialects is just to give the dialect a name; the document specifies which dialect it uses, and software has to muddle on the best it can using that information.
A more sophisticated approach in these cases would be to be to have a feature sets mechanism. Rather like SAX Features, this involves adding markup at the beginning of a document or branch giving the details on the dialect used. Feature sets are most commonly used for specifying non-standard aspects of a document (e.g. ODF's application compatibility settings) but they can be used for standard aspects as well.
For example, take the case of OpenXML's boolean values. The original ECMA 376 specification allowed values "yes" and "no" in many cases. For some reason, the OOXML BRM decided that this was not neat, because XSD datatypes only allow the values 0,1,true,false. So IS29500 requires the XSD datatypes. Obviously I think this was a bad choice at BRM, because it effectively meant that probably no existing OOXML could conform to the OOXML standard. I have no idea what was going on in people's heads.
How to fix this? Well, obviously software will ignore the difference, and accept both. The standard should similarly be fixed to reflect and guide this transitional reality. But with feature sets, the header of the document could say
<feature name="ooxml-which-boolean" value="yes-no-only" /> for example.
This is obviously not so much a problem when the lexical space of the new values in the new dialect are distinct from the old, but where there is overlap, it can cause a problem. For example, consider the case of where an attribute is added to an element so that each element can specify which notation (datatype) has been used for it contents. If the lexical space of the old and the new notations overlap, an application made for the old vocabulary (i.e. before the attribute was added) will not recognize the attribute and may allow the data value, even though it is the wrong notation.
Whether this is acceptable or not again comes down to the reliability-requirements of the data. Data for a blog stylesheet, or data for a life-support machine.
I don't know any standard vocabulary for XML feature sets.
Conformance by transformation
Another kind of conformance is on the cards. In this kind of conformance, the document is first transformed into a set format, then the transformed document is validated. This is useful when your validation is to ensure that required information items are present, but you are not concerned with superficialities or details with dialect.
For example, ISO DSRL supports namespace, element, attribute and value renaming. And I see that the GRDDL language allows you to attach a stylesheet to a document to extract RDF triples, which could in turn be validated to check for required patterns.
Conformance in the floating world
This article has looked at some trends and challenges for document validation.
The challenges come in two classes: first, raw capabilities for lifecycle support for standards; second, coping with transitions from technologies defined by implementation to technologies defined by standards with the necessary agility.
(There are many other minor challenges: validating sliced and diced documents distributed as parts in a ZIP archive, for example.)
Of course, we can only know which mechanisms a standard will have needed to support it in hindsight. And in the absence of hindsight, the trend is still to downplay special features for versioning and dialect issues. I have to confess that it is not at all clear to me which of the various mechanisms above are effective or even necessarily exclude each other.
It is not impossible that a very large, complex schema that is expected to evolve over the next decades might actually turn out to need MCE with different namespaces plus a major/minor version number system plus a feature set system plus a DSRL renaming system plus a profiling system! Let's hope not: it is difficult to get standards makers to even adopt one, let alone multiple ones. This may be a rationale response to the intuition that each individual solution will not go far enough to be useful, I suppose. Standards-making stakeholders working on the XML-in-ZIP specifications will be increasingly grappling with this kind of issue, just as a function of the maturity of XML.