Converting Schematron to XML Schemas, part 2

By Rick Jelliffe
December 3, 2008

Reading Erik Wilde's 2004 paper Metaschema Layering for XML today, which demonstrates his point using Schematron to make a profile of XSD for special uses, it struck me that I think I have not written anything about converting Schematron schemas to XML Schemas in the 12 months since that article.

That blog item was about how to structure Schematron schemas so that they expressed canned constraints about elements declaratively, which would then make conversion easy. This is about a more ad hoc (and fallible) approach.

In a previous series in this blog, I wrote about the reverse, Converting XML Schemas to Schematron. That series was reported on real code being developed: but here I just want to sketch out a possible approach.

Let's imagine an architecture first: the familiar old two process pipeline, where the first stage gleans various kinds of information from the Schematron schema, and the second stage uses that to generate the schema.

The basic approach I am suggesting here is just brute force and ignorance (BFI) pattern matching. More satisfactory results might be obtainable by feeding things through a system that uses higher-order logic, of course.

We basically make a catalog of interesting templates (Schematron already uses the term pattern, so I won't re-use it; I don't mean XSLT templates though...) for which we can find matches in the rule contexts and assertion tests.

Lets start.

Open or closed schemas

By trawling through the XPaths, we can extract a list of all elements and attributes in the schema. At worst, we can give all these global declarations with wildcarded content models.

If a pattern has a final rule with just the wildcard "*" (and no previous rules use wildcards) and a report element with a test of true() then the Schematron schema is closed. This is quite important to know, in judging whether the XSD will be complete: if we don't know the schema is closed, then we need to validate with the XSD in "lax" mode and the wildcarded contents need to be validated "lax" as well.

When patterns like the following are found, then the general contents of an element can be discovered.


<sch:rule pattern="X">
<sch:assert test="Y">X should have Y because blah
</sch:assert>
<sch:assert test="count(Y) = 1">X should one Y because blah
</sch:assert>

</sch:rule>

It is fairly obvious to see how this information can be used to generate declarations in XML schemas. The loosest content model would just by a lax wildcard to allow anything, in the absence of other information. If XSD 1.1 is being used, then if the features in the draft makes it through, it may indeed be possible to attach many of the Schematron assertions directly as XSD assertions, of course!

Open or closed content models The important thing is to find closed child lists such as the following.
<sch:rule pattern="X">

<sch:assert test="count(Y) + count(Z) = count(*)">X should Y and Zs only because blah
</sch:assert>

</sch:rule>

They allow a repeating choice group to be generated.

Where there is a closed child lists and each particle in that list also has a single occurrence indicator, such as the following


<sch:rule pattern="X">

<sch:assert test="count(Y) = 1">X should one Y because blah
</sch:assert>
<sch:assert test="count(Z) = 1">X should one Z because blah
</sch:assert>
<sch:assert test="count(Y) + count(Z) = count(*)">X should Y and Zs only because blah
</sch:assert>

</sch:rule>


we have enough information to use an XSD <any> element.

It is at this stage that we see the main limitation: the less complete the Schematron schema is, the more chance that the XML Schema will not have enough information to be workable. I think that is a reasonable and intuitive limitation, given the nature of Schematron as an open-by-default schema language, and given that people may be using a Schematron schema specifically because they needed something that XSD was less good at. I expect that RELAX NG would be much easier in this regard, because of its different approach to wildcarding which could allow more incomplete schemas.

Ditto

I think you will should get the idea: the rest is just details: the kind of Schematron rules and assertions generated by the XSD-to-Schematron converter would be a good source of templates. If some an assertion used a numeric test on some attribute value, we can use that for generating a simple type for it. Tests for enumerated values are obviously easily translated as well.

Which leads us inexorably to the million dollar question: why on earth would anyone want to do this kind of conversion? In general, I am a fan of on-ramps and off-ramps: to allow people to move between technologies with minimal disruption (of course, there always is a large aspect of "you've made your bed, now sleep in it" about any technical choice, because systems are hard to change, but that does not mean technologies can afford to actively prevent substitution.)

The main use I would see is where someone has decided to use Schematron, gone ahead and suddenly someone sticks their benighted head up and says their systems need XSD (because of data binding or so on.)

Consistency checking

I can see another use for the kind of information extracted from a Schematron schema by that first stage: where that schema is used to profile (subset) a larger schema. It can be used to provide a measure of consistency checking.

For a trivial example, the element and attribute names used in the Schematron XPaths can be checked against those declared in the XML Schema. Where the Schematron schema has the rule context "/" and an assertion like "book" or "book or section" or "count(book)=1" then the XML schema must have a corresponding global declaration for book and section.

Where the Schematron schema has a rule context like "X" and an assertion like "count(A) + count(B) + count(C) = count(*)" then the XML Schema content model for X (or its complex type) must contain declarations from A, B and C, and any other elements must have minOccurs=0 on them or some ancestor. (The intereactions is a little more complicated than this: consider a content model (A, B, (C, D)?) where if you get rid of the D you must also not allow C. In this case, the effective content model is (A, B), so strictly there is no need for a declaration of C in the XML Schema.


You might also be interested in:

News Topics

Recommended for You

Got a Question?