Ten years ago
In 2001 we had an interesting exchange about schema languages on the XML-DEV mail list. I had written Are we losing out because of grammars?. James Clark responded:
It seems self-evident to me that both grammars and path-based rule systems have their place. Some problems can be solved most conveniently with just grammars, some most conveniently with just path-based rule systems and some most conveniently with a combination. Can't we just leave it at that? What is the point of this crusade against grammars?
So I say to you all: go back in your caves and come out with *one* schema facility that lets me write grammars when I want to and xpath expressions when I want to, and has an elegantly unified syntax. Then declare victory.
Tim was keen that this needed to be a single language to be successful. Dave Megginson played the layer-fairy:
Ideally, then, we'd want a layer to capture what most grammatical schemas have in common (i.e. a generic schema Namespace), and then allow the differences to be layered on top (i.e. the XML structure Namespace, the entity-relationship Namespace, etc.).
This was one of the seeds for the development of ISO DSDL, the family of schema standards which now includes both Clark and Murata's RELAX NG and my Schematron.
James was being teasing when mentioning my supposed crusade, because it was a rather provocative term I has used the month before:
Grammars are quite handy for some structures; but they are dogs at others. Grammars with nice set-operation properties can be invented; but that won't make them any more usable than DTDs (though certainly more powerful) or XML Schemas. Having technology that can be grasped and used by the non-elite is a more worthy and difficult goal than science or elegance alone, to me (not that the latter is remotely unworthy nor difficult, of course): the rest of the XML community seems on a pro-grammar crusade that perpetuates the disconnect between concept and expression. One thinks about things in some terms, then tries to fit those thoughts in a grammar.
I was looking back at this exchange recently, thinking what do I think of it now? Not in order to re-start any old pissing competition, but because it is still a question of real technological and business impact.(And for criticism of XSD specifically, see here.)
I guess my views haven't really changed that much, though I think I can express my views slightly more clearly now.
- First, it still seems to me that the primary problem with grammars (lets keep on calling them that for a little while) is their disconnection from the user's or application's conceptions about the data. The direct result is that they have very little explanatory power: the proof of the pudding being in the eating, we can see that the more than two decades where grammar-based schema have been in the ascendancy has not resulted in a golden age where everyone validates as a matter of course, but one where validation is a tacked-on afterthought, something that people do as a last resort. I don't know that there could be a more damning result (though of course the picture is slightly more complicated.)
- The second reason against grammars, which I have raised before, is that they promote issues of minor importance to take the first rank: it is much more difficult to specify that some element is in no fixed order in relation to some other element rather than require orderings which have no foundation in business requirements for example.
- A third reason is that I think grammars may enforce a category error about the data structures in a great chunk of XML documents. I don't want for a moment to doubt that some data is trees, nor that some data is dumps of relational tables for that matter. And while I do think the dumbing down of XML into trees rather than graphs is both a terrible flaw and brilliant heuristic for adopting grammars, the category error I see is that many XML documents, and indeed many XML design decisions and motivating use-cases, are based on a data-structure which is a tree (or directed rooted graph) only as a side effect.
The basic data structure of XML is a series of runs of text, each super- and sub- annotated hierarchically and the hierarchies ultimately united under a single root. The guiding word here is markup: the concept that we have some span of data of interest, and both that we want to annotate it with some element (within the constraints of well-formedness) and that adding our own annotations to the data should not thereby lessen other annotations especially the primary markup of the document.
Take the following example,
In this example, the tree view has "world" disconected from "hello". But my point is that they are in fact part of a single run "Hello world" in which the "world" happens to have been annotated a few times: the conventional tree view disconnects data that is fundamentally connected and does not let us distinguish between markup that does not break the text run and markup which does (for example, a new paragraph or an inline footnote.)
For example, if you have an XHTML document with a table, and I want to annotate a particular couple of tr table rows with an element (in my own namespace) that says for example suspect-data, why should the document be invalid? When I ask in a query, give me the rows for this table, why must I write some special query function rather than being able to rely on the information from the schema language?
The grammar-based schema languages, it seems to me, fight directly against this approach. Of course, there are workarounds: we can move the label into an attribute, or perhaps XSD skip could be used if we thought of it, or we can use Namespace Validation Dispatching Language (NVDL) schemas to split out the objects of interest. These are excellent workarounds unless you consider the requirement to be fundamental to the problem being solved by XML, in which case they become hacks. [Note: I worked on NVDL, so I am not disputing that it not valuable and necessary and well-designed for solving its problem: it the problem itself that I am commenting on.]
But once deciding the data is a simple tree, we make schema languages that fit this the most elegantly and powerfully (e.g. RELAX NG) or least elegantly and the least powerfully (e.g. XML Schemas) according to our taste.
- And the forth reason I can give is fairly new in its formulation: it is that what I have been calling 'grammars' here is really only one kind of grammar (formalized generative grammars coming from Zellig Harris through his student Noam Chomsky and through a lot of solid academic work in theoretical computer science) that leads us to be interested in one set of questions, but there are other kinds of grammars (in particular operator grammar again coming from Zellig Harris) which naturally lead us to be interested in another set of questions. And I would say that these questions are just as interesting, practically and theoretically; I cannot say how tractable they are.
Reading recently of various characterizations of Harris' work, I have been struck at how many of the practical issues I and others in the life-is-bigger-than-grammars crew have been raising can be traced to being practical implications of Harris' theory. I don't see any harm in this kind of retrofitting; far from co-opting an unrelated theory, it is very exciting to me to read formal material on grammars which seems to be directly related to the issues that are of interest to Schematron.
Lets look at four characteristics of Harris approach of Operator Grammar (this goes over, to some extent some ideas from last week's blog, I apologize for being at the start of understanding Operator Grammars rather than being at the end, but I guess a blog is useful for recording ideas as they are being worked out):
- Equiprobability Basis: Harris starts off from the basis that any words are equally likely to be found, in which case the rules of the grammar are seen as constraints which filter out possibilities. Contrast this with generative grammars where you start with a production which generates the language by following its rules. Is this anything different than the old issue of schema openness? A Generative Grammar-based schema language (which is I suppose a better name for what I have been calling grammar-based schema languages) needs to be extended with wildcards, NVDL, and so on in order to allow some kinds of openness; an Operator Grammar-based schema language (if we allow Schematron to be such for point of argument) starts off allowing anything and requires special constraints to close itself off.
- Partial Order Constraint: Each of the constraints specifies some partial order between nodes in the document: this does not directly positional order but a logical or co-occurrence order (the existences of one element entails that some other element should be present.) We could even say that the partial order constraint is one in terms of the information content rather than the syntax, as far as the XML schema is concerned.) The need to be able to represent or explain the semantics of some schema usage construction rather than just the syntax has of course been one of the long-running themes of Schematron.
- Likelihood Constraint: This constraint points out that when there is some required information, it may be marked-up in several different ways: it may be an attribute, it may be a child element, it may be on a "structured attribute" that is the property of some child of some ancestor element, it may be at the other end of a link, it may be located by index to some shared string table. The generative-grammar-based schema languages entirely ignore this kind of information, yet it seems to me to be information that continually crops up; information that we would expect a schema language to be able to represent. (Of course, there are workarounds/hacks in place: RELAX NG's inclusion of attributes in content models is elegant, XSD's KEY/KEYREF is at least clear, while XSD 1.1's alternative type selection is chaotic and unnecessarily baroque.) [Note: I will need to tease this out more: Harris has a linearization step that perhaps covers this funcationality]
- Reduction Constraint: A really interesting part of Operator Grammars is the insight that information needed to make sense of the sentence (in our case, to validate the XML) may not be found in the sentence itself, nor the grammar, but may be known from other information in previous (or future!) sentences, or indeed from information outside the current sentences, from the discourse. In Schematron, we can collect and assign information to variables that can be used in later assertions, and we can access information outside of the current document, in the WWW. The UBL Code-list methodology, for example, shows that this kind of capability is not an armchair-theoretic construct, but a real-world one.
So it would be great if we could get to the stage where the next paper giving a Taxonomy of Schema Languages for XML would not just leave Schematron out (surely its omission is evidence of a weakness of the taxonomy?) but have a top-level divide between Operator and Generative Grammars? (Of course, the question of which class of formal Generative Grammars Schematron fits in, for example, when accompanied by a traditional schema language as a data guide, is still interesting.)
It seems quite likely to me that Operator Grammar may provide insights into how to develop Schematron further, just as the Generative Grammars have helped guide the development of the grammar-based schema languages from DTDs to RELAX NG.