The old SGML idea of DTDs was primarily a gatekeeper function: it was (incoming) validation rather than (outgoing) verification. The idea was that by requiring validation, invalid documents (bad data) would not propagate unchecked through a system. More than that, the location in the production process where the invalidity occurred would be clear: the recipient of the invalid document can send it back to the person or process that caused the problem.
This is a nice model, and gave a tacet software engineering discipline that made SGML successful for many large projects. Even now, gateway functions are useful in Web Services systems. However, the idea that on the WWW the recipient can send back documents for re-work is obviously bogus.
For the last few years, I have been commenting that our current validation languages are not rich enough to cope with the current demands of the large consumer schemas like ODF and OOXML. The sky isn't falling, so there is no need to panic; but there is a need to get workable fixes out as soon as possible.
The trouble is that the organizing principle of most schema languages (XSD, RELAX NG) is the namespace. But we have no schema languages that treat namespaces as first class objects, or allow parameterization of them. (Both RELAX NG and XSD 1.1 do allow the use of attributes on top-level elements to select document variants, I should point out, but while this is a great feature, I don't think this goes far enough.)
Both OOXML and ODF have had substantial discussions relating to versioning and extensions (the two are interwined): ODF has gone with a head-in-the-sand hack-something-later approach; OOXML has a very good mechanism (MCE) which address the ability to add shiny new extensions in new namespaces well, but does not address changes within a namespace.
The folly of the ODF approach can be seen because it bit OOXML: OOXML did not start off with a workable version mechanism inside namespaces so SC34 ultimately decided it was safer (for preventing data corruption on load in legacy applications) to put almost everything into new namespaces. The hack-something-later approach is hubristic: it is asking for trouble. Personally, I have never been a fan of OOXML changing its namespace: I think it it muddies the waters for implementers.
How can Schematron help?
I think Schematron can play a part, because a Schematron schema is not hooked to a single namespace the way XSD and RELAX NG are.
With Schematron it is possible to express rules that:
- Report which flavour of OOXML or ODF a document has (such as ODF, ODF 1.0, ODF 1.0 with extensions, ODF 1.1, ODF 1.1 with extensions, ODF 1.2, ODF 1.2 with extensions, ECMA 376 first ed., IS29500:2008 Strict, IS29500:2008 Transitional, IS29500:2008 as ammended 2010 Strict, IS29500:2008 as ammended 2010 Transitional)
- Report whether the document is standard but has flaws that might require users to select some fix, due to legacy issues.
- Report which feature sets the document has. (In the case of OOXML and ODF, these are largely similar to which namespaces are found, but a more useful module system could be worked up.)
- Report on policy issues and profiles: for example I could imagine that Brazil might make a profile of ODF that disallows extensions and macros; or the EU could report when OOXML documents use Transitional rather than Strict.
But Schematron can also be used for implementation-specific issues:
- Report which vendor-specific extensions or subsets or profiles or flaws the document has (such as OpenOffice 3.4 ODF, Office 2007 OOXML).
- Report whether, when using MCE, there are open standard versions of each fragment as well as versions in non-standard extensions.
For example, I found a month ago that when saving OOXML with AbiWord, the XML contained a wrapper element that belonged to the Word 2003 XML format: because it was a different namespace, systems reading the document would strip the element and its contents out. The result: all pages blank. Now it is too simplistic to say "Oh this is invalidity that is AbiWord's fault" and then their developers will say "It is only a beta the next will be better" and so on.
The point I would like to make is that everyone will never catch up, and models of interoperability based entirely on the promise that sooner or later everyone will catch up will just lead to disappointment. Now, of course, to an extent the ODF approach has been to try to lower the bar at feature level compared to Office (though ODF does have some features OOXML has not) but even there we will have a moving target: ODF NG for example.
So I wonder if it would be useful to have some kind of Open Source Schematron schema where we could collect tests and diagnostics for the various flavours. Developers could refer to it when creating their document loaders; it could be part of validation and gateway services; it would help gurus and archivists determine which flavour was used; it would help procurement people check that the appropriate standard was being used.
For example, taking that AbiWord beta bug again, I would imagine a schema like this:
<sch:p>Variable has-office-2003-namespace is true if there is any
element in an Office 2003 namespace that is not in an MCE section.</sch:p>
<sch:p>Variable has-ecma-ooxml-namespace is true if there is any
element in an ECMA 376 1st ed. namespace that is not in an MCE section.</sch:p>
| office2003xl::*[not(ancestor::mce:*)] )" />
| ecma-ooxml-xl::* )" />
test=" $has-office-2003-namespace and $has-ecma-ooxml-namespace "
diagnostics="a1 a2" >
This document mixes different incompatible versions (Office 2003 + Ecma 376):
text and other items may disappear when opened.
<sch:diagnostic id="a1" role="consumer">
This issue is known to occur with a 2010 beta version of Abiword OOXML loader.
If that is the case, and if that system is still available, re-open the document
and save it to, say, ODF.
<sch:diagnostic id="a2" role="technical">
This issue may be resolved by editing the XML contents to strip out the elements
in the namespace with the string "2003" (keep their subelements.)
Now this is very far from what is conventionally thought of as the role of schemas. Which leaves us fumbling when real life messiness intervenes on our nice neat ideas.