Here is a test: when you hear the terms "layering" and "pipelines" are they abstract gibberish which bear no hard relation to the way that you develop? Or are they core parts of your toolkit, perhaps the core parts?
I have repeatedly found that when XML people talk of "docheads" and "dataheads" what they actually are referring to is people from a pipelining background, in distinction to people from an object, relational or 3-tier background. Pipelining is of course highly associated with UNIX: the shell pipes, the OS API pipes, even Dennis Ritchie's STREAMS mechanism.
Over the years I have repeatedly come face to face with perplexing people who talk of "XML processing" and yet, on examination, are actually talking about everything except XML processing: for example, processing XML data into high level objects/components, or shredding it into a relational database.
But pipelines with XML-in and XML-out are a really useful design approach: a design approach that has proved itself in widely used network software over many years; a design approach that encourages small targetted components.
This post looks at how Schematron and parts of DSDL can be implemented in a pipeline. In order to explain the design of the latest release of Schematron, I thought it would be useful to show how the Schematron design has changed over the last decade to involve mulitple stages.
Schema languages' functional operation
Each schema language has a slightly different operating model.
- ISO SGML DTDs provided information necessary to parse the SGML document. So the result of validation was the production of a valid document (with all the markup gaps such as missing tags filled in). Validation is a function intertwined with parsing that produces a document information set (e.g. ESIS).
- W3C XML DTDs are not needed for parsing, except for the particular case of providing declarations for entities and for providing default values of attributes. But XML DTD validation is basically a function that returns a boolean result: valid or invalid.
- W3C XML Schemas do even more. The result of validation is a Post Schema Validation Infoset (PSVI), which has no standard serialization, but which annotates the XML document with outcome results about whether each element was validated, what type it is, whether it validated successfully, what a default value is, and so on. Validation is a function that returns a complex PSVI, one outcome of which can be a simple boolean valid/invalid.
- OASIS/ISO RELAX NG took the approach that all that validation needed to do was provide the boolean function. In particular, issues such as providing default values, or typeing, or non-XML-able properties were off the table. It is not that these other things are somehow improper, it is just that they can be provided by extensions or subsequent processes using the schemas, rather than complicating the core operation of a validator.
- ISO Schematron takes the approach of validation as transformation. The output is some rich information constructed from looking at the document set.
Validation as transformation
Schematron started off when I was doing some R&D at Academia Sinica Computing Center in Taiwan at the start of 1999. I wrote a little article that was picked by InterChange magazine Using XSLT as a Validation Language.
Compiling a schema to XSLT
Francis Norton picked up on this, and wrote a program to convert from the DCD schema language into XSLT. By this stage I had been working almost full-time for three years studying documents, schemas and schema languages, first for my book The XML & SGML Cookbook: Recipes for Structured Information and then at ASCC, representing them at the W3C XML Schema Working Group. So I thought, If we start off with the constraint that it has to be implementable readily on XSLT, can we make a schema language that addresses many of the difficult problems that have cropped up?
So the architecture of the first Schematron system was a compiler (initially OmniMark, then XSLT) that first generated XSLT. Then that XSLT is run on the document to produce the result. This is the basic model that Schematron has followed. Indeed, XSLT has a special namespace (usually with the axsl prefix) specifically to allow XSLT scripts that generate XSLT: symbolic programming and application generators used to be a common techniques.
The skeleton and meta-stylesheet design
The early Schematron implementations merely forked the code and added different output code. Very early in Schematron's life, Oliver Becker or his students rearranged Schematron to use an XSLT API: all processing was moved out to named templates, and which simplified customization because you only needed to provide templates that override the particular built-in named templates that interested you. This system is called a meta-stylesheet system. The basic and default Schematron functionality and utilities are termed the skeleton.
In order to facilitate testing different XSLT engines (quality varied enormously) I made a simple XML output format, used for screamathon testing. This format got some traction and was developed and is now standardized as ISO SVRL (Schematron Validation Reporting Language). It turns out that producing XML output is really useful in integrating Schematron into tool chains, and it seems to have taken over from the meta-stylesheet system.
Inclusions and abstract patterns
As Schematron developed, the Schematron 1.5 implementation was the version that many people settled on. In Schematron 1.6 I introduced the idea of abstract patterns: these finally allowed me to represent patterns in an XPath independent fashion, and solved the core problem that had been so difficult in making the book: how to describe formally things (such as tables) which had several different markup forms.
The easiest way to implement this, and for simple inclusions, was a macro-pre-processor. I took this route rather than vastly complicating the XSLT skeleton. (These kind of things are so much easier in XSLT2.) However, adding the extra stage seems to complicate life for implementers, and I believe that many people kept with the two stage process even if they used the Schematron 1.6 skeleton.
When ISO Schematron was released, it was organized in such a way to allow easy moves to new versions of XSLT, such as EXLST, XSLT2, XPath2, XQuery. Actually, XSLT1 is the default, but even non-XML languages (such as RDF's Squish or SQL) could be used as query languages.
ISO Schematron formed part of ISO DSDL (Document Schema Discription Languages) which is an ongoing project to standards small targetted schema languages. The core languages are now mature standards with good implementations (RELAX NG, Schematron, NVDL); there is a crop of newly standardized support languages with immature implementations (DSRL, DTLL, CREPDL); and there may be more added in the future.
Several of these supporting languages are suitable for implementation in XSLT, as part of the pipeline.
Most importantly, in ISO DSDL the SC34 Working Group (WG1) decided against developing any home-made validation framework. This was because we wanted to support and not compete with the XProc effort at W3C, which Norm Walsh has been a leading light.
I am not sure that XProc can be implemented in XSLT2, though I suspect it can. So what we have with Schematron is the possibility of an all-XSLT implementation of most parts of ISO DSDL. The attraction is performance, using a mature implementation, and most importantly, the ubiquity of XSLT engines. (Of course, the fact that they are mostly XSLT1 rather than XSLT2 throws a spanner in the works in the short term: one day I expect Microsoft will wake up and realize how dumb they have been pitting Linq against XSLT2 as if they were competitors. I believe GNOME's open source libxslt is still XSLT1, though it is being upgraded by Steve Ball for example.)
You can see where this is heading. In the earlier part of the decade, my programmer at Topologi, Eddie Robertsson, came up with an extractor program that allowed Schematron fragments to be embedded in W3C XML Schemas or RELAX NG schemas. We released this open source, and it has proved fairly popular (and, indeed, XSD 1.1 will probably have a simplified version of assertions added, since it is such a trivial and useful layer to add.)
Compiling schemas to Schematron
Ken Holman has released as open source some XSLT scripts that implement the UBL code-list validation methodology in Schematron.
Over 2008, I wrote a long series of blogs on a project I was doing with sponsorship from JSTOR, converting XML Schemas into Schematron. This means that I have joined the harried ranks of XSD implementers, I suppose, which perhaps makes me officially part of the problem rather than the solution. I also have implemented an XSD to RELAX NG converter in XSD (Murata-san generously fixed it up) which I hope to present in some future blogs. All these use XSLT.
- Macro processing the XSD
- Validating your own derived XSD simple types
- Validating IDREFs
- Validating special complex content types
- Progressive validation for complex content models
- Partial order validation for following sibling elements
- Required pairs in a sequence
- Post processing the Schematron Schema
- Friendlier schemas
- Some better diagnostics
- Identity constraints
Pipelines on the instance side
Pipelines on the input side are possible too, of course. Standards such as W3C's XInclude, ISO OOXML's Markup Compatibility and Extensibility, and ISO DSDL's DSRL can be slotted in.
However, it turns out that simple pipelines on the input side are probably not as useful as you might think. This is because documents call other documents: for example using the XSLT
document() function, or with other kinds of links.
So pipelines on the input side really need to be attached to the "entity resolver" of the XML processor in order to be effective.