Validating Operator Grammars in Schematron

By Rick Jelliffe
July 21, 2010

Harris' Operator Grammars is a linguistic theory that we can adopt (or be inspired by, or perhaps dumb-down) for validation purposes.

We don't need more information than is in the Wikipedia summary: we want to suggest Schematron features that can handle dependencies, likelihood and reduction to some extent meaningful for XML documents containing data (lets say it is business or scientific data rather than linguistic data.) Where the grammars talk of words, for XML validation in Schematron we talk instead about nodes in .an XPath Data Model information set.

Dependency

In Operator Grammar, we can first identify operator nodes. Operators are nodes which require argument nodes. The most likely markup that would be an operator is of course container elements. But links and id references are also operators.

So here is an example:

   <sch:pattern id="operators">
      <sch:title>Operators and Dependencies</sch:title>

<sch:rule context="/">
<sch:assert test="html">The root element should be 'html'
</sch:rule>

<sch:rule context="/html">
<sch:assert test="head">The 'html' element should have a child 'head'.</sch:assert>
<sch:assert test="body">The 'html' element should have a child 'body'.</sch:assert>
</sch:rule>

<sch:rule context="@lang">
<sch:assert test="../html or ../body">The lang attribute should only
appear on the html or body elements.<sch:assert>
<sch:rule>
...
<sch:rule context="br or hr" role="not-operators">
<sch:assert test="true()">The elements br and hr have no dependencies</sch:assert>
</sch:rule>

<sch:rule context="*">
<sch:assert test="false()">No other elements are allowed apart from the
HTML operator and argument elements</sch:assert>
</sch:rule>
</sch:pattern>

Note: The purpose of this is informative. I have not run the code so there could be syntax errors.

In this simple example, we go through all the operators using one rule each. Note that sometimes the simple child relationship is enough, but I have used to parent relationship to get one argument for the operator @link.

In the example, the second-last rule eliminates nodes (nodes are matched one by one going through each rule in a pattern, from top to bottom, until it matches a rule's context) that are arguments. This allows the last rule, which reports if there are nodes (elements) with unexpected names.

[Update: I see that Zellig Harris himself late called this partial order constraints rather than dependency.]

Likelihood

We can put aside notions of probability when treating likelihood, and just say that the likelihood is the strategy for locating an argument.

For example, take this:


<sch:rule context="table/tr/th">
<sch:assert test=
"count(../../tr) = count(../../tr[count(td) >= current()/position()">
Every heading in a table has a corresponding cell in every row.
<sch:assert>
<sch:rule>

That is how we would do things in Schematron using XSLT, and it is a perfectly good idiom.

However, for the operator grammar, we want to actually locate each corresponding cell. So we might invert and revise the assertion, and call the cell the operator and the heading the argument.

 <sch:rule context="table/tr/td">
             <sch:assert test=
                    "../../tr[1]/th[position() = current()/position()]">
             Every cell in every row should have a corresponding header.
             <sch:assert>
      <sch:rule>

In the XSLT2 binding for Schematron, we could make this a first-class object by defining a custom XSLT2 function (which Schematron allows you to do) that provides a nice name for the accessor function.

 <sch:rule context="table/tr/td">
             <sch:assert test=
                    "my:find-heading(.)">
             Every cell in every row should have a corresponding header.
             <sch:assert>
      <sch:rule>
...
<xsl:function name="my:find-heading">
      ...
</xsl:function>

This approach allows you have have a series of accessor functions: the heading might be in ../../tr/th or if not then in ../../tbody/tr/th (using tbody containers) or if not then use the previous table in the same section ancestor::table[1]/preceding-sibling::table[1]/tr/th

This covers the case of a flattenable schema, where an element may contain its contents directly or it may refer to them: XSD (W3C XML Schemas) is an example of this, where you can have types put locally (embedded) or globally (referenced). The dependency is that an element must have a type, but the likelihood is that the type will be a global reference or an inline declaration.

In some situations, Schematron's abstract pattern facility might be applicable too: the dependency declarations being represented by an abstract pattern, and the likelihoods being represented as parameters. You can see examples of parameterized abstract patterns used like this in Example 8.2 in Dave Pawson's online book ISO Schematron Tutorial.

Reduction

So this leaves reductions. To help figure out what might be an implementation strategy, we of course need to figure out what the XML analogs of Harris' reductions might be: the use case.

I can think of a couple. The first, there is attribute or element implication: where instead of an element, you use a default value. DTDs and XSD provides this for attributes only. RELAX NG and standard Schematron do not do infoset augmentation, however.

The second is XML Schema's nillible types: this strange animal is a typical XML Schemas-ism of providing something with as much complexity and as little power as possible. (It was supposed to be something to help database data transfers, but seems hopeless. XML Schemas 1.1 added some more things to allow dynamic context to determine validity, but still did not manage to specify nillible as just another syntax for this.)

I can see four ways to approach reduction in Schematron:

  • Add an extra term in assertions, allowing some dependency to be missing. Seems a terrible approach.
  • Post-process the SVRL output (of the validation) to remove svrl:failed-assert elements relating to missing dependencies.
  • Mark the SVRL output of for such svrl:failed-assert with extra information that the missing dependency is allowed: using the sch:assert/@role attribute.
  • Process the document to imply the missing data, then validate this.

And, in fact, Schematron can be used for this last method, which might be the most satisfactory in some situations. The draft of the proposed new version of ISO Schematron allows you to declare a set of properties that link to the assertion or rule. If the assertion fails, these properties are evaluated (for any dynamic values from the data) and the result can augment the SVRL. The default values can be specified using properties. The SVRL result of validation, augmented with these properties, could be folded back into the original data and this data revalidated; repeat until no more of the properties generated.

Update: I think the previous section does not approach the concept of reduction well, and that it should not (or need not) be related to infoset augmentation or other kinds of macro expansion. Instead, I think reduction as it applies to XML validation would be more concerned with issues of naming and retaining information that is not necessarily in a regular context: it relates to what happens when the information required is not immediately on hand. In fact the the information required for validation may even be outside the current document...

To state it from a slightly different angle: what kinds of questions would a validation system that took its basic categories from Harris' Operator Grammars encourage you to ask?

  • Dependency: can we express basic dependencies between nodes (in an abstract way that is independent of actual look-up strategies)?
  • Likelihood: can we express the sequence of places to look for resolving these dependencies?
  • Reduction: can we keep track of information in previous contexts, and even information from outside the current document, in case information important for validation is not part of the document (or of the schema)?

In all these three cases, it seems to me that the answer for XML Schemas (and RELAX NG etc) is no and the answer for Schematron is yes.

It is very hard to track down material on Zellig Harris' theory on the WWW (there seems to be something else also called Operator Grammars): but the possibility is intriguing that there may be some existing formal theory that could be useful for characterizing Schematron.


You might also be interested in:

News Topics

Recommended for You

Got a Question?