The Grammar of Schematron

By Rick Jelliffe
September 15, 2009 | Comments: 2

Can we make a schema language that handles both Schematron and RELAX NG at the same time? Can we convert or implement Schematron (or certain kinds of patterns, rules and assertions) in RELAX NG, as a way to get streaming performance? Here is a sketch.

RELAX NG Compact Syntax

Let's start with RELAX NG, using the compact syntax. This is quite like the syntax of DTDs. However, RELAX NG differs from DTDs in three main visual ways.

  • The first is that you can have local declarations, so that instead of a content model just being element a { b, c, d} it can contain element a { element b { text }, c, d} and so on.
  • The second is that when there is a bare name in a content model (like c and d above) it refers to a production not an element. So a = element a { b, c, d} b = element b {text} and so on. In effect, it is like the DTD parameter entity mechanism gone mad: everything is a parameter entity! (Or, to put it another way, the parameter entity mechanism shows that the DTDs are a simple kind of tree grammar, not the oddity they may be if you expect a string grammar.)
  • The third is that content models can include attributes within them.

Extending RELAX NG

So let's take RELAX NG and extend it. Whereever there is currently a { } group allowed and also after any any pattern, we will allow multiple and { pattern }. Each of these new patterns is a new branch grammar evaluated in parallel from the current position.

For example, in HTML, the element a cannot have another a anywhere in its contents. James Clark showed how this can be implemented using a second RELAX NG schema just for that constraint. With our extension, it can be done in one schema.

anchor = element a { text | inline }* and { no-anchors }*
no-anchors = element * - a { no-anchors }*

And let's make another simplifying extension too, "skip" meaning: any content but there is no need to validate it. I am not sure whether RELAX NG currently has this functionality: I expect some kind reader will let me know.

So an example of this would be table = element table { element tr { skip+ }+ } meaning that the element table has one or more row elements, and each row element has one or more of anything, we don't care. One of the early confusions in SGML days was whether the ANY keyword meant 'skip' (i.e. don't check) or 'lax' (i.e. check if you know them) or wildcard (i.e. any element as long as it has a definition.)

Forward-only Schematron

Now when we say 'Schematron' let's mean the ISO Schematron that using XPath 1. And we will limit ourselves to only schemas which have XPaths that can be rewritten using forward axes only. So if there was a schema

<rule id="tr" context="tr[parent::table]">
  <assert test="td | th ">
  A table row should have cells or header cells.
  </assert>
</rule>
It would be rewritten as
<rule id="tr"  context="table/tr">
  <assert test="td | th ">
  A table row should have cells or header cells.
  </assert>
</rule>
(Note that this rule does not catch trs in other positions.)

There are various academic papers about rewriting XPaths to only use forward paths, and mechanisms to do it. Let's ignore the intricacies and limitations, especially of predicates with tests in them.

To RELAX NG

So here is the output from converting the Schematron rule above into a RELAX NG grammar, so that each location step in the context steps becomes a production: the main trick is knowing which permutation of skips and element particles are appropriate for each level.

   table = element table { tr | skip }*
   tr = element tr { skip*, ( td | th ) ( skip )* }

Multiple assertions

Lets add another assertion

<rule id="tr"  context="table/tr">
<assert test="td | th ">
A table row should have cells or header cells.
</assert>
<assert test="@colspec">
A table row should have a colspec attribute.
</assert>
</rule>

   table = element table { tr | skip }*
   tr = element tr { skip*, 
             ( element td { skip* } | element th { skip * }), 
             ( skip )* } 
             and { attribute colspec } 

The assertion does not say there has to be an element tr, which is why the content model does not require it.

Since the second RELAX NG pattern only contains a position-independent pattern in this case, it can be combined.


table = element table { tr | skip }*
tr = element tr { skip*,
( element td { skip* } | element th { skip* } ),
( skip )*,
attribute colspec }

Multiple rules

Lets add another rule.

<rule id="tr"  context="table/tr">
<assert test="td | th ">
A table row should have cells or header cells.
</assert>
</rule>
<rule id="div" context="div">
<assert test="table[@class='shonky']">
Every division should have at least one table
with a class of "shonky".
</assert>
</rule>

We get

   div = element div { skip*, table, skip* } 
   table = element table { tr | skip }*   
                    and { attribute class { text "shonky" }
   tr = element tr { skip*, 
             ( element td { skip* } | element th { skip * }), 
             ( skip )* } 
             and { attribute colspec } 

Note that we would have to be careful of here: we need to have the RELAX NG rules that require at least one table[@class='shonky] but does not disallow other tables.

There would be many other transformations possible. For example:

<rule id="dt"  context="dt">
<assert test="following-sibling::dd">
A definition term should be followed by definition data.
</assert>
</rule>

would be converted to something like

  general = ( dl  |  element * { general } )* 
  dl  = element dl { skip* }, element dt {*}

Multiple patterns

Each Schematron patterns could be a super-RELAX NG grammar. (Or, indeed, they could be merged. And in some cases the multiples patterns could be factored out into separate complete grammars, perhaps even not requiring the extensions I mention here.

Now we could keep on going and adding various kinds of extension, such as allowing the and { } to include a full grammar, so that absolute XPaths and external documents could be used. And supporting phase. The tricky thing is how to support variables (and current()) and other axis.

But the idea of this sketch is to show that in fact a lot of Schematron can be implemented directly in a mildly enhanced version of RELAX NG without (I think) explosions before it all runs out of steam. (At some point, guards or expressions or caterpillar grammars are probably needed.)


You might also be interested in:

2 Comments

Thanks for your ideas, but it would be more helpful to include references to existing approaches, e.g. Combining RELAX NG and Schematron (Robertsson 2004), and RELAX NG, XSD, Schematron (D'Arcus 2006).

Yes, this particular blog explores some new ideas: it is a "sketch" and not a tutorial.

News Topics

Recommended for You

Got a Question?