Vigorous standards that need to support a dynamic market are a problem. We all like nice stable standards, and we certainly like the idea of nice stable standards, but building our standards processes around some idea that we get it right and complete the first time is folly: it may be a worthy goal, but in many cases even the most perfect initial standard will immediately suffer evolutionary pressure.
Isn't this the problem that XML Namespaces is supposed to address? Yes, to an extent: XML Namespaces lets us have a clear separation into different vocabularies each targetted at specific parts of a document: a namespace for paragraphs and document parts, a namespace for maths objects, a namespace for metadata and so on.
But XML Namespaces provide only a medium-sized grain, they don't help either when we are interested to implement some of the namespace but not all of it or when we want to supercede the namespace with whole new production. Namespace URLs almost always span different generations of the schemas for that vocabulary: the XSLT 2.0 namespace is the same as the XSLT 1.0 namespace, for example, and it will not be surprising if XHTML keeps its namespace through multiple versions.
During the OOXML standardization proceedings, the ISO particpants felt that there was one particular sub-technology, Markup Compatibility and Extensibility (MCE), that was potentially of such usefulness by other standards, that it was brought out into its own part. It is now IS29500:2009 Part 3: you can download it in its ECMA form here, it only has about 15 pages of substantive text.
The particular issue that MCE address is this: what is an application supposed to do when it finds some markup it wasn't programmed to accept? This could be extension elements in some foreign namespace, but it could also be some elements from a known namespace: the case when a document was made against a newer version of the standard than the application.
The approach taken is very practical and, I think, user-oriented: an application that doesn't understand some new kind of markup should fail if that new markup was essential to the document. Otherwise it can use various other strategies, the most straight-forward one of which is just to ignore the new markup.
And, to complement this, the document is allowed to have alternative versions of the same content using different namespaces, where the application chooses the versions it is happiest with. In particular, this can allow one technology to be superceded by another without much pain: indeed it also allows applications that have a legacy requirements to have their own legacy native format. Like MIME media types and mail readers, the user selects the format they want.
This is a kind of having your cake and eating it too, you might think; the smart thing that gives it a hope of working is that MCE also provides some attributes
PreserveAttributes which let you (the standards writer or the extended document developer) list the elements that do not need to be stripped when modifying some markup.
I think standards developers who are facing the cat-herding issue of multiple implementations and the need for all sorts of extensions should seriously consider the MCE approach.
However, that does not necessarily meaning adopting IS29500:2009 Part 3. Even though MCE is couched in terms of a generic markup pre-processor (and, indeed, it would a good thing to adopt by reference into ISO DSDL) which would make it a good thing if everyone used the same namespace, in practice it will frequently be just a streaming pre-process bundled in with the reading and serialization. To be more blunt, some groups may find it preferable to adopt MCE using their own namespace, thereby leaving themselves the option of taking it in their own direction in the future, if their needs dictate: OOXML took this route with its "SMIL-like" animation system, and ODF did the same with its dialect of SVG.
So how does MCE work? For a start, it doesn't preclude an implementation from merely refusing to open any document with unexpected markup. While we are used to Draconian error handling for well-formedness and schema validity, it is simply not a feasible approach for many kinds of data, in particular for user-facing distributable documents. Aunt Maude will not be impressed if her word processing document could not be opened merely because her MaudeOffice application added a single immaterial but non-standard metadata element in its own namespace.
There are three basic aspects of MCE:
- Alternative Content
- Namespace Subsumption
Each are relatively straight forward.
MCE works by some standard attributes, typically on a root (or namespace branch).
By default, an application is supposed to signal an error if foreign-namespace markup is found.
Ignorable attribute specifies a list of namespaces which needn't cause an error. The allowed extensions.
MustUnderstand attribute specifies a list of namespaces which the application must grok if it is to cope with the information in the markup. This addresses the problem that some extensions are decorative, harmless and banal, and others are complete, possibly important, chunks of information without which the document would be corrupt.
So some markup must be understood, and some can ignored and the rest causes an error. What should be round-tripped? There are three attributes
ProcessContent, which allow a detailed specification by name within a namespace of the treatment for (ignorable markup). The standard says that these are suggestions rather than requirements: MCE does not seem to impose any kind of requirement that a strictly conforming minimal implementation should be able to round-trip extensions.
The compatability markup above sets the scene for a really neat feature: alternative content. This is really close to my heart: it is exactly the kind of plurality support that I have been banging on about for more than a decade. I don't know that it is a complete solution, but it is a simple one that looks like it meets 80/20.
The heart of the matter is an element
AlternativeContent. It is on this node in particular that the various attributes mentioned before would go, and on its immediate children.
AlternativeContent is a container. It contains various
Choice elements and optionally ends with a
As you would expect,
AlternativeContent is a kind of switch statement. The application selects the particular
Fallback by matching the
MustUnderstand namespaces to choose the best possible.
To understand why this might be useful, consider MathML. David Carlisle, the long-term W3C MathML editor, sent me some interesting email once commenting that MathML was developed as an exchange format rather than a native format. In other words, in the full expectation that a particular application might have its own format that could contain all sorts of other application-specific information or suit that applications data structures. Mathematica, for example, has executable math, while a Web browser doesn't.
Rather than telling Wolfram Research "Take this out of MathML", or allowing extensions into MathML which would quite possibly be confusing and complex even if they could be shoe-horned, a live and let live approach is taken. Keep private semantics in a private native format but let all common semantics be exchangeable in the standard exchange format.
AlternativeContent tags make these alternatives into first-class objects.
An interesting wrinkle. What happens when a namespace has been obsoleted? When a new namespace has been adopted and it has some new markup? In MCE, this is just a matter having
AlternativeContent for each of the sections.
This is a rather brute-force method. I think it could be usefully augmented by ISO DSEL, to allow token remapping (namespaces URIs, element names, attribute names, enumerations, etc).
A tool for re-proprietorization?
The eager-minded should be thinking by now: Doesn't MCE make a conformance hole big enough to drive a truck through?
If a producing application is free to add extensions willy nilly and then require that the consuming applications understand it, don't you get documents that are "standards conformant" but actually interoperable only between a particular vendor's applications?
To call a spade a spade, couldn't Microsoft use MCE to re-proprietorize OOXML or another standard that adopted MCE, all the while claiming standards conformance?
In the absence of some simple extra rules, the answer could be "yes": I think it would be easy to make up some examples in particular with a null fallback case. I wouldn't go paranoid and see MCE as a Trojan Horse for proprietory extensions, though.
So how do we use MCE safely to prevent this, even if we consider it far-fetched?
The answer is surprisingly simple, as far as I can see.
- MCE attributes are only allowed on
- All foreign markup must be in an
- It is in a
Choicealternative content section which has
- At least one
Fallbacksection or other
Choicewith standard elements and no
In other words, you can use any crazy markup extensions you like, as long as there is at least one alternative content that uses a standard form. Safe Plurality.
(Toe-dippers might even go as far as restricting where
AlternateContent elements can appear, so that they only go on clear namespace boundaries. That would allow alternative formats but not give any help for minor variations or dialect support.)
A correspondent asks "What about where there is an active attempt to embrace and extend, where the foreign markup is marked ignorable but really it should be MustUnderstand.?"
If this were considered a problem, then an even stricter restriction can be made: there should be at least one alternative with no foreign markup at all. (But at least some non-foreign markup!)
This could be tested by a Schematron schema such as (untested):
context="mce:Choice//*[not(namespace()=' ...standard namspaces ... ')]">
test="[$theAlternatives/mce:Choice or $theAlternatives/mce:Fallback]/*
[namespace()=' ...standard namespaces ... '] ">
Every AlternateContent should have at least one choice that uses
[not(.//*[not(namespace()=' ...standard namespaces ... ')])] ">
There should be at least Choice or Fallback which only uses
(However, if you were indeed this suspicious, you would have put some wording too. But there will always be loopholes for dirty tricks: for example, in ODF 1.0 there seems to be no clauses that prevent an application from storing its text in a Caesar code in the XML.)