W3C: Please put XSD 1.1 on hold and address the deeper issues

The subset of XSD that we need for more reliable databinding etc looks like RELAX NG

By Rick Jelliffe
May 13, 2009 | Comments: 5

Here is a letter I have mailed to the W3C Technical Architecture Group (TAG) and to the W3C XML Schemas Working Group, regarding the XML Schemas 1.1 proposed recommendation.

I would like to register with the W3C TAG and the W3C XML Schema WG that, on having considered the XSD 1.1 draft, I think it is exactly the wrong direction for the WG and W3C to be taking. That is, while each individual decision may be well-founded, and each change justifiable and beneficial, the total effect will not help get us out of the mess that XML Schemas has created, but mire us further in it.

I see this as highly analogous to the situation with the SGML 5-year review at ISO in the early 1990s. Many small solutions to individual problems had been made, and many wizz-bang new ideas added, and there were many worthy new things on the cards.

But the fundamental problem was SGML was too big. The approach was of course to slim it down to XML, and to reintroduce many of the cast-off features and ideas (DTDs, modules) into layers on top of XML (schemas, namespaces.)

(A further parallel may indeed be that a change in forum was necessary in order to get this change: in a certain sense the original developers of SGML were "part of the problem" not "part of the solution." Not because of malice or ineptitude; quite the reverse. The dynamics, personalities and goals of the working group were only capable of change in the direction of neatness and expansion. Indeed, I know that many on the W3C Schema WG are acutely aware of these issues, but perhaps the stars have never been aligned to address this. Since the W3C TAG itself has such a rich representation from the XML Schema WG, I hope that they may be conduits for fresh-thinking from the TAG and not conduits for rationalizations from the Schema WG.)

Comments on the problem

That XML Schemas is in a crisis and has failed to meet some of its basic goals can be seen by the work on XML Schema Patterns for Databinding. That two such comprehensive lists were necessary is a sign of bad layering.

Indeed, if considering the original requirements document for XML Schemas, http://www.w3.org/TR/NOTE-xml-schema-req, its shortcomings become more manifest. For example, in the Usage Scenarios, XML Schemas has not been successful for

4) Traditional document authoring/editing governed by schema constraints.
(DTDs and RELAX NG have large inroads in this area. For example OASIS ODF. I note that even for the XML Schemas for ISO OOXML [DIS29500], which had been written to use a very conservative subset of XML Schemas, it turned out that Xerces would not accept schemas allowed by Microsoft's validators, both of which being well-regarded and mature implementations. )

5) Use schema to help query formulation and optimization.
(The current draft has to change its type model to fit XQuery)

6) Open and uniform transfer of data between applications, including databases
(See the databinding comments above.)

Furthermore even in the online application scenarios 1, 2, 3, and 7, the heavy weight processing that XML Schemas requires and the complexity of its concepts has meant that it is rarely actually used for validation, even as it is so inadequate for databinding.

So if it is not congenial for validation, and it is not a success for reliable databinding, is it at least good for documentation? In fact, the verbosity of XML Schemas makes it utterly unusable for presenting to humans to understand a document's structure. In this regard, I note that the recent HTML 5 drafts have reverted to something akin to RELAX NG Compact Syntax (which looks like DTD content models and has a standard mapping to the XML form.)

Further if XML Schemas is not useable for documentation, is it useful for generating useful validation messages for humans? The answer is clearly that the messages produced by implementations of XML Schemas are not much use, particular the obscure structural messages. As someone who has both implemented most of XML Schemas (a converter to Schematron) and who has customized the messages from various schema processors, I don't see how some of the messages can be made human-friendly, since they relate to obscure rules in XML Schemas.

And if XML Schemas is not good for validation, does it redeem itself by winning over implementers with a good standard? It is no secret that the XML Schemas Structures standard is the very model of an impenetrable, guru-inducing standard. But, having work in the W3C XML Schema WG at the time of the first release, and deeply respecting the editors and working group members, I believe this is not a fixable fault with the documentation, but a reflection of the brain-numbing technology.

I have two personal anecdotes about this. In 2001 I had a contract from Manning Press to write a book on XML Schemas, in particular explaining the standard. After three months of full-time work on this, I abandoned the project and repaid my advances at my loss, because I decided that trying to make a silk purse out of a sow's ear would be either impossible or irresponsible. The second anecdote is that when making our implementation of XML Schemas (a project initially funded by JSTOR which is making its leisurely way towards open source ) we twice had programmers threaten to resign because working on XML Schemas implementation was too unpleasant. One of these programmers was subsequently headhunted by Microsoft and the other is currently working on his PhD. in Computer Science so they are not idiots or defeatists; and we have a history of high retention rates.

Continuing further looking at the original requirements, we see the following puported design principles:

1. more expressive than XML DTDs;
2. expressed in XML;
3. self-describing;
4. usable by a wide variety of applications that employ XML;
5. straightforwardly usable on the Internet;
6. optimized for interoperability;
7. simple enough to implement with modest design and runtime resources;
8. coordinated with relevant W3C specs (XML Information Set, Links,
Namespaces, Pointers, Style and Syntax, as well as DOM, HTML, and
RDF Schema).

I contend that it is apparent that the changes proposed for XML Schemas 1.1 do nothing to address the shortfalls in meeting these goals that have been a bugbear since XML Schemas 1.0. In particular, it fails
4. see the databinding and related comments above
5. there is nothing straightforward about XSD, and it is too verbose to download
6. see the databinding and related comments above: it is manestly a disaster for interoperability
7. XSD is manifestly not simple to implement
8. the PSVI (post-schema validation infoset) represents a fundamental break with the basic relevant XML Specs. Indeed, it might be said that XML Schemas are not schemas for documents, but schemas for databases that have an XML serialization. The two are not the same.

So, allowing for argument that XML Schemas may be so deficient in these areas and so complex, can it justify itself as allowing very sophisticated document constraints? Clearly the answer is no, certainly for Part 1 Structures. The rival language to XML Schemas (i.e. OASIS/ISO RELAX NG) is far more powerful, and the alternative (which is also a complement) for non-grammar/non-datatype constraints and assertions (i.e. ISO Schematron) is far more powerful.

XML Schemas has a very poor bang-per-buck ratio. There are many significant classes of document structures it is incapable of being useful for: for example, SVG, XSLT. Indeed, it may be argued that these kinds of tricky structures are exactly the kinds of structures most calling out for validation.

Finally, if the language is not very good for structural constraints, is it at least good for document evolution? The answer here again is no. Experience with large schemas has shown that the XML Schemas complex type derivation facilities are quite bogus: the type extension mechanism introduces not only an extra concept, but causes a fragile base-class-like problem for maintenance. And the type derivation by restriction mechanism does not simplify declarations.
I do have many other specific issues as well, which I won't bore readers with: they can be summarized by the comment that XML Schemas 1.1 may address the kinds of problems that you might want to validate in 1999, but not the kinds of problems found in XML as practised in 2009: for example, foreign codeslists, and the abandonment of large XML documents in favour of either XML-in-ZIP or XML-on-filesystem collections of smaller documents linked by URL and other IDs.

I should acknowledge that there are indeed many successful uses of XML Schemas. I see no evidence that these successful uses are because of any particular excellence in XML Schemas that would not be possible in other schema languages.

A proposed solution

I therefore ask the TAG to instruct, influence or otherwise encourage the XML Schema Working Group to put XSD 1.1 on hold and instead to work on a radical relayering into a two-layer model. Some of the XSD 1.1 changes would make their way into the basic layer, some would make their way into the advanced layer which would be equivalent to the proposed XSD 1.1.

In concrete terms, I propose this:

1) A radically simpler schema language, compatible as much as a possible with the current XSD 1.0 syntax, be created. It should have the following properties:

i) It should follow ISO RELAX NG in all relevant design decisions, and be trivially translatable to and from RELAX NG.
ii) In doing so, it should remove as many of the patterns identified as problematic for databinding
iii) It should have no concept of structural type derivation: no extension or restriction of complex types. It need not support any simple type derivation or facets, though it would support those the built-in derived types of XSD.
iv) It should have no obscure rules such as UPA that are not required by RELAX NG.
v) It should have no constraints or requirements for streamable implementation

2) A secondary layer which adds:
i) Complex type derivation
ii) UPA, naming, and other obscure rules
iii) Features problematic for databinding and to allow streaming validation would be allowed

The bottom line is that the new simpler language would not be type-based, nor would it require 1-unambigous schemas. Both those things, which are currently presented as core to the mechanics of XML Schemas would become additional assertions to be used or checked by the full language and its processors.

There are many details and issues, of course, but I believe this is more straightforward than may be thought. In any case, it is necessary to bring XML Schemas to its full potential for being useful on the web, rather than the hindrance and snare it currently is. There is a misapprehension, in particular, that RELAX NG cannot be used for databinding; in fact, the Java API for ODF was created by a databinding tool for RELAX NG, so this is hardly (true.)

Cheers
Rick Jelliffe

Editor, ISO/IEC 19757-3:2006 Information technology -- Document Schema Definition Language (DSDL) -- Part 3: Rule-based validation -- Schematron

Invited expert, ISO/IEC SC34 WG1 Schema languages
Invited expert, ISO/IEC SC34 WG4 Office Open XML
Formally Australian delegate, ISO/IEC WG 8 (e.g. SC34)
Formerly member (for Academia Sinica), W3C XML Schema WG
Formerly invited expert, W3C I18n SIG
Formerly invited expert, W3C XML WG (e.g. SIG)

Author, The XML & SGML Cookbook, Recipes for Structured Information Management,
Charles Goldfarb series, Prentice Hall, 1998.


You might also be interested in:

5 Comments

Spot on.

Re your proposed solution: What's wrong with simply using Relax NG, instead of developing yet another schema language?

What's missing, IMO, is wider support for Relax NG (if JustSystems would consider implementing Relax NG in XMetaL, it'd solve a lot of my current problems and make me a very happy man), but other than that, it already does everything I need from a schema language.

Best,

/Ari

Ari: We already have two syntaxes for RELAX NG, both of which serve their purposes well. The idea of a RELAXSD (RELAX NG in XSD syntax) is that this would allow migration and provide an impetus for the XSD WG to move full XSD towards RELAX NG rather than gratuitously away from it, as they are currently doing. People who have already discovered RELAX NG would certainly not need it.

I have written a converter from XSD to RELAX NG, and the difficulties tend to be in the mechanics of schema and component assembly, not the content models.

XSD - type derivation - arcana ~= RELAX NG - expressivity

remembering that parts of RELAX NG were put in place to cope with XSD features (such as substitution groups.)

Totally agree,
XSD is holding back XML. I am very sad that Docbook is declining and I see the lack of tools (XMLMind XMLEditor isn't free) as the cause.
XSD brings more problems than solutions.
Bruno

Rick,

One struggles to see what compelling show stopper needs there are for schema v1.1 in the context of eBusiness exchanges and XML structures. Frankly work such as NIEM.gov had already restricted 182 rules of "do not" WRT schema 1.0 syntax - in efforts to make more consistent and reliable information exchanges.

I also submitted this white paper on XSD schema v1.1 to the W3C - but clearly they are marching to a different drum.

http://www.oasis-open.org/committees/download.php/29164/White%20Paper%20on%20CAM%20and%20XSD.pdf

Appreciate your stance on all this.

Bruno: +1

David (who is exactly the kind of participating activist we need more of): Thanks. I will be asking more for the intercession of St Jude Thaddeus, the patron saint of hopeless causes!

I didn't have any expectation my request would go anywhere, but it was not a frivolous or vexatious request. It would be a shame if there was a door unlocked but never opened because no-one knocked IYSWIM.

I think it is useful for setting expectations, if people realize that when the XSD WG (and the TAG) makes noises supporting simplicity, they are saying "yes" in the sense that one might comfort a child spooked by imaginary spiders. However, XSD has the worst implementation record of any IT standard I know of: the spiders are not imaginary.

Yet it seems that all the failures of XSD are translated in the minds of its boosters as paradoxical signs of its *success* (by which is meant widespread adoption: "it is so good people even use it where it is no good!"): when you have this mindset in play, no objective evidence of problems can worm through into the committee work.

News Topics

Recommended for You

Got a Question?