HTML 5 (see last week's blog The Bold and the Beautiful: two new drafts for HTML 5 for some other thoughts) provides a good test for different how powerful, in practical terms, different schema languages are for the important class of generic "rich text" narrative documents that HTML now epitomizes.
And during last week a colleague asked me for a collection of typical constraints that Schematron is used for, to test an implementation.
So lets look at the assertions in draft of HTML 5: The Markup Language which collects constraints about the markup: the kinds of things that are susceptible for schema testing.
Most of the draft is taken up by section 6, which is a listing of all the elements with a standard form for the constraints in various kinds Content Model, Attribute Model, Permitted Contexts, and so on. Lets have a look at the assertions in particular and see how they fit in with Schematron.
Since, on my understanding, the Assertions were actually largely created during an exercise that created RELAX NG schemas and Schematron, it should be no surprise that RELAX NG can handle all the content models and Schematron can handle all the assertions. But it is interesting to classify the kinds of assertions, to get an idea of the kinds of constraints that are, in practise, important. (I don't know whether the designers of HTML limited themselves to Schematron assertions or to the subset in drafts of XSD 1.1.)
I've omitted repeating assertions, where exactly the same constraints or the same kind of constraint has been specified. Here they are categorized in various ways.
Downward axis exclusion constraints
These constraints can be expressed using conventional schema languages such as RELAX NG and XSD. In SGML DTDs, some of them could be be expressed using the inclusion exception or exclusion exception mechanisms.
- The interactive element "a" must not appear as a descendant of the "a" element.
- The sectioning element "footer" must not appear as a descendant of the "header" element.
- When a "datalist" element is the first child of a "datagrid" element, it must not have following siblings.
- The element "img" with the attribute "usemap" must not appear as a descendant of the "a" element.
So if you can use a grammar, why not? Because of combinatorial explosion. Every one of these rules requires a parallel set of content models starting from the ancestor. The XPaths to do these in Schematron are trivial.
In the first case, that
html:a//html:a is not allowed, you would need to define local declarations for all inline elements that could appear in a, for example.
Using the assertion reconstructs the idea of SGML
exclusion exceptions, which allows the content models to state the broad case simply, and then to note any particular variants which are exceptions to the rule.
Now, actually, it is often possible to express this kind of constraint with conventional grammar languages by using two grammars: one for expressing the normal structure and another which just models the exception. James Clark has demonstrated this with his RELAX NG schemas for HTML. It is smart but a little clunky because combining the exceptions into the same schema might itself have a (smaller) combinatorial explosion.
Downward axis requirement constraints
A variant on the downward axis constraints is one that makes a requirement. For the first two examples below, they are clearly constraints that could be built-into a content model, at the cost of complicated it. The XPaths to do these in Schematron are trivial.
- The "area" element must have a "map" ancestor.
- A "bdo" element must have an "dir" attribute.
- The "header" element must have at least one "h1"-"h6" descendant.
The third example above is a different kind of constraint. It is impossible to do this kind of constraint using conentional regular grammars: it is not a regular constraint because it is a constraint where the path taken in one (descendent) content models alters the paths available in all subsequent content models.
In an XPath for Schematron, this is of course not complicated:
<assert test=".//h1 or .//h2 or .//h3 or .//h4 or .//h5 or .//h6">
The "header" element must have at least one "h1"-"h6" descendant.
Complex value constraints
These constraints are not possible with conventional grammars. However RELAX NG does allow content models where the presence of a constant text value in an attribute or element forces a particular path. XSD 1.1 has an assertion mechanism that should be able to handle some of these too.
- The value of the "min" attribute must be less than or equal to the value of the "value" attribute.
- The value of the "value" attribute must be greater than or equal to zero when the "min" attribute is absent.
- The "select" element cannot have more than one selected "option" descendant unless the "multiple" attribute is specified.
- The internal character encoding declaration must be the first child of the "head" element.
This last example is, however, more tricky because it requires parsing a value to find a certain substring. Schematron is capable of this. I expect XSD 1.1 assertions are also capable of it, however I believe RELAX NG would not be useful for this kind of constraint.
The particular reference constraints here are fairly straightforward. I expect that XSD KEYREF checking could cope with these.
- Any "button" descendant of a "label" element with a "for" attribute must have an ID value that matches that "for" attribute.
- The "list" attribute of the "input" element must refer to a "datalist" element or to a "select" element.
In Schematron, you would use something like the following:
<rule context="button[ancestor::label[@for]]"> <assert test="ancestor::label[@for]/@for = current()/@id"> Any "button" descendant of a "label" element with a "for" attribute must have an ID value that matches that "for" attribute. </assert> <.rule>
<assert test="//datalist[@id= current()] or
The "list" attribute of the "input" element must refer to a
"datalist" element or to a "select" element.
Reverse axis constraintsSometimes the constraint works backwards, as in this:
The "img" element with the "ismap" attribute set must
have an "a" ancestor with the "href" attribute.
HTML 5: The Markup Language also has lists the permitted contexts for each element. These are usually the grouped common content models like (I have elided)
common.elem.flow = p | hr | pre | ul | ol | dl | div | h1 | h2 | h3 | h4 | h5 | h6 | address | blockquote | map | section | nav | article | aside | header | footer | dialog | figure | table | form | fieldset | datagrid | menu | details
This kind of constraint can be modelled in Schematron using abstract rules.
<pattern id="element-flow"> <rule abstract="true" name="child.of.common.element.flow"> <assert test=" parent::p or parent::hr or parent::pre or parent::ul or parent::ol or ... "> The <name> element is part of the element flow. It can be used inside p or hr or pre or ul or ol or dl or div or h1 or h2 ... </assert> </rule>
<extends rule="child.of.common.element.flow" />
Content models sequence
One very interesting part of the draft HTML 5 content models is how rarely sequence is actually used. Sequence is one thing that is usually easier to express with content models than with XPaths in Schematron. In my opinion, this ease leads to the situation where sequence is used even where it cannot be related to any business requirement. But sometimes it is of course what you need.
Looking at the use of sequence in HTML 5, a few things stick out. First is that all cases where sequence is important, it seems to me that the constraint is either:
- A certain element must appear at the start of the contents. Examples of this are the
legendelement which must or may appear as the start of
figure. (The case of
legendmay also alternatively appear as the last element.) Also the
sourceand even the
- The complex structures of a table, definition lists and ruby text.
For example, look at the content model for figure:
figure = (legend, (text & common.elem.flow*)) | ((text & common.elem.flow*), legend?) & figure.attrs
Here is the corresponding Schematron schema for the legend constraint, which I think is clearer than the content model, because it can explain more of the rationale for the sequence constraint.
<assert test="count(legend) = 1">
A figure cannot have more than one legend.
<assert test="*[self::legend] or *[last()][self::legend]">
The legend for a figure must either be the first (title)
or last (caption) element.