Validation using tries and feature sets

An approach for making Schematron usable for coarse-sieve validation of very large XML documents

By Rick Jelliffe
June 8, 2009 | Comments: 1

A critical decision in a designing a schema language is to decide whether validation is an operation that should be performable in a single (or some bounded number) of passes over the document (such as an unambiguous grammar), by a number of parallel passes or backtracking (such as ambiguous grammars), or whether it can utilize random access (such as Schematron's XPaths.)

Sometimes a smarter algorithm comes along that busts the conventional wisdom: RELAX NG demonstrated that some ambiguous grammars could be implemented in an efficient one pass algorithm without parallelism or backtracking (using the re-writing or derivatives algorithm that had fallen between the cracks.

One-pass algorithms have the attraction that for XML validation they are typically implemented directly on the SAX event stream, supposedly allowing high efficiency and certainly allowing the validation of documents larger than physical or virtual process memory. Random access methods typically are implemented from an in-memory tree structure, which for languages with bounded process spaces such as Java imposes a strict upper limit on the size of documents that can be validated.

There are, of course, wrinkles. XSD for example extracts various key and ID reference information items from the schema, then validates them against extracted ID lists, in effect creating a second pass over a subset of the document. Schematron can potentially be implemented in a single-pass streaming fashion by using STX, the streamable dialect of XSLT. And the rather crummy assertion mechanism in XSD1.1 drafts used only a backward or downward-looking subset of XPaths, which can be evaluated in a single-pass fashion when the end-tag event for an element is reached, but which would probably require storage of the children nodes of an element, and thus be the same in storage terms as a random access validator, for pathological shallow, long documents.

However, there is another design or implementation option for validation, which is to generate a trie for the document, then to validate that trie. Because our schema languages attempt to validate more than a trie can represent, in fact we also must extract a feature set from our schemas: this is true whether the schema is a grammar-based schema or a Schematron-style schema. This gives us weak validation (a note I wrote on this a decade ago, and here I am still banging on about it), but it also may allow performance or implementation characteristics that may be useful in some situations, in particular for efficient single-pass streaming validation of very large documents. (Knuth's treatment of tries is very clear.)

Simple trie and feature set

Let us give an example, and use that weak validation note as the example: it is 1999 HTML but lets process it through an imaginary TagSoup or Tidy first to give us XHTML. We first take a pass though the document and extract each unique full XPath; we store these as we go in a trie (pronounced 'tree') data structure:

(I apologize to listeners of the speech synthesized version: these next fragments
don't read well.)

html		/ whitespace()
		/ head 	/ meta 	/ @http-equiv
						/ @content
						/ @name 	
				/ title		/ text()
		/ body	/ @lang
				/ p		/ @align
						/ img			/ @src
									/ @alt
									/ @width
									/ @height
						/ whitespace()
						/ text()
						/ br
						/ a			/ @href
									/ text()
				/ h1		/ @align
						/ text()
				/ h2		/ text()
				/pre		/ text()
				/ul		/ whitespace()
						/ li			/ text()
									/ tt		/ text()
				/ hr

In the above, I am representing the trie data scructure as a set of XPaths with tabbing to represent common parents. You can see that this is a very terse and small representation of a document which would only blow out in the unusual case where the large document contained mostly unique element or attribute names.

Now let us derive our feature set from the HTML schema: the schema or DTD is trimmed to only those features determinable in the trie. In the case of an XSD, this will involve a series of strength-reducing steps, such as flattening the schema to remove any sequence or alternatives information and replacing cardinality with 0 or 1. So content model (a, b+, (c | d)[2-10]) becomes (a & b? & c? & d?). This obvious works best for either the ( x | y | z)* content models of rich text, or the <all> group structures of database rows. But it works equally well for ambiguous and unambiguous grammars.

So, in effect, our HTML schema becomes something more like the following (feature set):


element html { head? & body? & frameset? }
element head {
meta {
attribute http-equiv? & attribute content? & attribute name}? &
element title { text } &
element script { text }?
}
...
}

Now we validate the trie against that weaker schema.

This tells us a variety of information about the document as a whole: are there any elements or attributes with unexpected names or with incorrect parents? Are there any elements or attributes that are required but always missing?


More elaborate trie

This is a good starting set, and it should be obvious that the trie can be enhanced with cardinality and position outlier information for better information. For example, let the decoration [n,m] mean that the lowest number of times the element was found as a child of that parent was n times, and the maximum was m times.


html [1,1] / whitespace()
/ head [1,1] / meta [3,3] / @http-equiv
/ @content
/ @name
/ title [1,1] / text()
/ body [1,1] / @lang
/ p [21,21] / @align
/ img [0,1] / @src
/ @alt
/ @width
/ @height
/ whitespace()
/ text()
/ br [1,1]
/ a [0,1] / @href
/ text()
/ h1 [1,1] / @align
/ text()
/ h2 [5,5[ / text()
/pre [11,11]/ text()
/ul[3,3] / whitespace()
/ li[2,5] / text()
/ tt[0,2] / text()
/ hr[1,1]

Now this would allow a better validation from the same feature set: it can validate if the maximum and minimum occurrences of an element in context matched the simple optionality requirements. Similar decorations can be added for positional boundaries too.


Performance

What are the costs of this? Obviously, building new paths along the tree will take some performance. But then the cost is just tracing through paths. Isn't this actually exactly the same as having a grammar though?

In effect, the trie is a custom schema for the specific document, and could be executed as an FSM operating on the SAX stream (vacuously). In XSD terms, the feature set can be viewed as a schema from which the real schema can be derived by restriction (every document that is valid against the real schema will be valid against the feature set schema). And so validation of the trie can be viewed as merely checking that it is also derived by restriction from the feature set schema.

A grammar based validator will build a stack as it goes too, so I don't expect that there would be any difference in order for the performance of using the trie or just validating using the feature-set grammar directly (or, indeed, for using the full real schema.) So the kinds of speed ups possible from this would only seem to be of the smaller kind, if less work is done at each input event by deferring decision-making about validity till after the pass on the document is completed.

Schematron angle

However, where this approach piques my interest is that it can also be used for Schematron schemas. Take the schema


<pattern>
<title>Element rules</title>

<rule context="html">
<assert test="count(body) + count(head) + count(frameset) = count(*)">
The html element can only contain head, body or frameset elements.
</assert

</rule>

<rule context="meta"l>
<assert test="@content">
The meta element should have a content attribute </assert>
<assert test="@name or @http-equiv">
The meta element should have a name attribute or else a
content attribute</assert>
<assert test="not(*)">
The meta element should not have any elements in it.
The meta element should have content</assert>
</assert>
<.rule>
...
<./pattern>

We can derive a weaker version of the Schematron schema, if we worked out the exact transformations of the XPaths, to make a feature set schema. Strength reducing is a little less satisfactory here, because the Schematron schema may likely be an open schema rather than a closed one. The brute force method is to merely recognize the idiom of the assertion and to remove any assertions that either have an inappropriate idiom or do not have an idiom we can recognize.

  <pattern>
      <title>Element rules (simplified)</title>
 
      <rule context="html">
         <assert test="count(body) + count(head) + count(frameset) = count(*)">
         The html element can only contain head, body or frameset elements.
         </assert
      
      </rule>
     
     <rule context="meta">
         <assert test="@content">
        The meta element should have a content attribute </assert> 
         <assert test="not(*)">
         The meta element should not have any elements in it.
          The meta element should have content</assert>
        </assert>
     <.rule>
      ...
  <./pattern>

In fact, we can then test this directly, merely by emiting our trie as XML and validating that with normal Schematron. (The same goes for the grammar-based schemas too!)

What it does give us, however, is an algorithm for a Schematron that can run, in effect, in a single streaming pass of the document for even the largest document for certain of its constraints.

By now, I expect the question has formed in your mind but wouldn't the strength-reduced Schematron schema (the feature set schema) be so impoverished as to be almost useless, given that Schematron schemas are so often used for more intricate constraints? Notably value-checking constraints...

I would agree, but also point out that one reason that Schematron is sometimes not used is precisely because of its memory requirements.

The trie/feature set approach could be used first on the document, as a coarse sieve, then if the document passes, full Schematron validation might be used. Or, rather than validating the full document against RELAX NG then against Schematron, the trie may have performance savings by allowing strength-reduced validation of both the RELAX NG and the Schematron schemas over the same trie.

Of course, the natural next step would be to augment the trie with information that suits the idioms of Schematron. For example, it would be possible to parse each simple content string and see if it were, lexically, one of the XPath lexical types: number, boolean, string or empty. Or to memo-ize the boundary strings for each element or attribute (on the utterly hopeful assumption that these may well be bad values: obviously the mileage for some kinds of constraint simplification will vary enormously.)

Indeed, by looking at the value, you could also see which XSD datatypes it corresponded too: imagine a bitset for each element with one bit per built-in XSD type, so when the instance has an attribute X with a value 10, the trie augmenter would would set the xs:byte bit and when it found a value of 300 it would set the xs:short bit, and when validating the trie for datatypes it would see if the corresponding bit were set. If any bits belonging to base types were set, these would be out-of-range values if the schema had specified a derived-by-restriction simple type.

The validation of the trie could be offloaded to another processor. Or it could be used to optimize subsequent finer-grained validation, along the lines I presented in Thinking about optimizing a custom Schematron validator which trims the schema of unneeded context tests.

Or if validating from a SAX stream, for example, the trie for each element or attribute could include sequence numbers of the first and last occurrence of the event, which could help some implementations that did not handle the //x[@id=current()/@ref] .idiom efficiently.

Related Material

Regular readers will see this as related to other material I have talked about: progressive validation, feasible validation, partial ordering (Hook), usage schema, path validation language

Using weaker schemas is of course one of the great reliefs when implementing schema converters, such as the XSD to Schematron converter.

In my mind, it is especially connected to Logical inferences from Schematron schemas. (Note also MSM's Logic Grammars and XML Schemas)


You might also be interested in:

1 Comment

Update: I see there is another better name for this kind of thing in the academic literatrue: a "data guide".

News Topics

Recommended for You

Got a Question?