The top three mistakes in Schematron

By Rick Jelliffe
June 7, 2009

After almost a decade of Schematron schemas, here are the three errors I see most often.

(Stretch the window wide for least ugly code sample, please.)

Error messages not assertions

The most common error comes from people whose mental model of Schematron is not as a schema language but as a validation language: they want error messages. So they will have rules like the following:

<rule context="html">
  <assert test="head">ERROR! No head found</assert>
</rule>

What would be better? An assertion is a positive, natural language statement saying what is expected in a context, and preferably why is is expected. It should err on the side of being given in terms of the markup's underlying model (if I may use model as a flame-proof euphemism for the S-word semantics.)

Why? Because a schema is like a recipe, something that can be shown to stakeholders and allow discussion and understanding of a document type (or partial document type) without getting bogged down in details. Indeed, the details may change: an attribute may be promoted to an element without changing the semantics, and the assertion text does not necessarily need to reflect housekeeping issues. Furthermore, error messages are notoriously ad hoc, scenario-dependent and fragile.

Here is what I think is better:


<rule context="html">
<assert test="head">The html element should contain a head
element, which contains frontmatter and metadata for the page.

</assert>
</rule>

Or, even better for a rich environment:

<rule context="html"  
   id="elements-html"
   see="http://www.w3.org/TR/1999/REC-html401-19991224/struct/global.html"
   icon="elements.gif">
  <assert test="head"  
   id="elements-html-head" 
   see="http://www.w3.org/TR/1999/REC-html401-19991224/struct/global.html#h-7.4.1"
   diagnostics="hint-add-missing-1"
   flag="non-WAI" 
   icon="element-element.gif"
   role="structural">
    The <span class="element">html</span> element 
    should contain a <span class="element">head</span> element, 
     which contains declarative frontmatter 
     and metadata for the page.</assert>
</rule>

...
<diagnostic id="hint-add-missing-1" role="hint" icon="hint.gif">
Add the element missing. It should be the first child of <name />.
</diagnostic>

That is a lot of markup! Of course you don't have to use all that, but I wanted to show tha Schematron is a completely web-enabled schema language: it has constructs to directly tie into larger web and semantic web information. XSD, in contrast, is not a web-enabled schema language: it provides slots in which people can (but don't) invent and place their own annotations.

Let's look at this extra markup.

For a start we see that instead of an error message, we have a diagnostic element linked (by ID reference) from the assertion: this diagnostic lets you be very specific in your advice: the name element allows dynamic construction of the diagnostic message, and therefore can be reused by multiple assertions (and assertions can have multiple diagnostics.) We see that the diagnostic also has an icon attribute, to allow the kind of first-class (and highly useful for synoptic scanning of large numbers of messages) user agent presentation of the messages. Plus it has a role attribute, to allow classification of the diagnostic: this mirrors the reality that real error systems frequently classify errors as 'fatal', 'error', 'warning', 'note' and so on.

In the rule, we see similar rich annotations. The see attribute is a URL to human-readable documentation, in this case, the description of the head element in the HTML 4 recommendation. The id attribute allows specific reference to this rule for linking information in. And, again, the icon attribute for richer visuals. (I regard the provision of non-text indicators as a basic issue of accessibility, by the way: non-, pre- and post-literate users of systems need to be supported at the foundation level. I also see it as an issue of internationalization: anyone who has lived for any time in a country which used a script or language you cannot read knows how important hints can be. Just yesterday I was trying to use Japanese Ebay, for example, and it's text-centric operation defeated me—my patience ran out working through long lists of options.)

In the assertion we also have id, see, role, and icon attributes, which allow the detailed integration with the large web by incoming and outgoing links.

It also has a flag attribute: Schematron not only allows simple binary validation (i.e., valid|invalid) and detailed reporting of validation results, but it also allows incidental results to be flagged. In this case, if a document does not have a html/head element, the non-WAI flag will be raised. (Other failed assertions may also raise the same flag.) If the role attribute allows us to classify the function or importance of the assertion (rule, pattern or diagnostic), then the flag attribute allows us to classify or allocate the assertion against some external profile or grouping of interest.


Documentation not assertion

The second common error I see is schema like the following:


<rule context="html">
<assert test="count(head) =1 ">When you count the number
of head elements in an html element, all using no namespaces,
the resulting numerical number should equal integer 1
and nothing else. Note: no explicit casting is needed
because of XPaths built-in typecasting rules.

</assert>
</rule>

This kind of error comes from people whose mental model of the assertion is as code documentation, explaining in detail the operation of the XPath. The assertion text is useful only for programmers.

But that puts the cart before the horse. The assertion text states the requirement, then the XPaths in the assert/@test try to implement it as much as they can, but may be lossy: you may have to have more than one assertion to zero in on a single constraint. Indeed, it is possible to express untested and untestable constraints.


Over-elaborate context patterns

The third error is more syntactical.

<rule context="//html">
  <assert test="./head">The html element should contain a head 
  element,  which contains frontmatter and metadata for the page.
  </assert>
</rule>

There is no need for the context pattern to start with // because that is built into the semantics of Schematron. And similarly, there is no need for the ./ because an assertion test is always evaluated in the context of the parent @context.

This is of course a trivial mistake, and one entirely forgivable in newbies. But keeping the XPaths simple is important for maintenance, and having these extraneous prefixes unnerves me because it sets up an expectation in me that the writer was not very aware of either semantics of Schematron or the terseness possible in XPath. (Of course, what unnerves me is a weak indication of an error, but I hope you see my point!)

Surely there are more serious errors than this trivial one? Err, well not really. I just don't see people making mistakes in Schematron schemas. Because they are not forced to make constraints for things that have little value, nor impeded by derivation complexity from making small declarations for even little constraints that are important, Schematron has been remarkably easy to maintain at the public level. The mailing list rarely gets questions how do I do X, because people know that a complex constraint is just a matter of learning about XPaths more. Of course, knowing the patterns or templates for how to express different kinds of constraints is important (Dr Nic's ZVON site was any early demonstration of this); and indeed there is a feedback effect that once you know that you can express something fairly readily in Schematron, it then becomes the kind of thing you want to express with a schema.

For example, once you know that you can reference local external code-lists easily, rather than building code-lists into the schema directly, it becomes the kind of thing you do in your schemas. In XSD, you cannot refer to external code lists (unless they are in XSD notation and you import or include them), so your schemas just don't do it. Again, Schematron starts with the basis that we are in a web of semantically-rich information (readily accessible by XPath) and our schema, validation and reporting may need to make use of information by dynamic reference rather than just cut-and-paste.


You might also be interested in:

News Topics

Recommended for You

Got a Question?