Thinking about optimizing a custom Schematron validator

Using the document paths to prune the schema

By Rick Jelliffe
May 20, 2009

One of my thought projects is creating a custom Schematron validator that didn't use XSLT, but was based on an XPath implementation designed to be optimal for Schematron. In most cases, the performance of my ASCC/ validator is completely good enough, especially when used with Michael Kay's SAXON 9, but in the scheme of things, the faster the better.

And sometimes it is not speed but memory: see Henri Sivonen's comments Lowering Memory Requirements by replacing Schematron (about James Clark's Jing implementation of Schematron 1.5, however I am sure the same thing could be true of my SAXON/XSLT implementation.) Schematron using XSLT works on a random access memory model which in its simplest or worse case implementation needs the whole document in memory. There is of course a lot of work around trying to optimize XPath: apart from SAXON, I have seen no evidence of this work being directed in Xalan's direction. In the last decade, academics were largely just factory workers for making ideas to pitch to venture capitalists; I hope a benefit of the recession will be more focus on implementing their research ideas to augment FOSS software.

It is not just total speed, but the speed to get the first result that seems important to some people. This was the basis of my Optimizing time-performance of streaming Schematron note. In that note I commented that if you validate as you build the document tree, there are some constraints that can be checked when you arrive at a start-tag (e.g. x/parent::y), some that can be checked when you arrive at an end-tag (e.g. count(*)=1) and others that potentially require random access to the whole document. Because of the way Schematron is structured, it is possible to analyse the various XPaths and divide a schema into an upward-looking schema, a downward-looking schema and a document-end schema.

But I think there are some other characteristics of Schematron that might make some other kinds of optimizations or heuristics useful.

First, Schematron XPaths are often concerned with "does a path exist?"

Second, Schematron XPaths are often concerned with "Is this count correct?"

Third, Schematron XPaths are often concerned with "Is this value found elsewhere in the document?" (Such as IDs)

In all these cases, the extreme condition is where the nodes looked for have no occurrence. So one possible optimization that springs to mind is to test each XPath (or each consecutive step of steps factoring out paths in predicates) to see if the path exists at all in the document, in a preliminary streaming pass, and pruning the pattern accordingly.

That is a bit abstract. Here is an example.

We start with the Schematron rule:

<rule context="pets/dog[@type="fluffy" ]/collar">
<assert test="//owner/name = current()/nametag/owner"
The owner on a fluffy dog's nametag should be one
of the registered owners.

We check (in a preliminary pass) to check

  1. Is there at least one occurrence of //pets/dog/@type/../collar?

  2. Is there at least one occurrence of //pets/dog/@type/../collar/nametag/owner?

  3. Is there at least one occurrence of //owners/name?

Then use this information to mark the schema. If 1. is false then there is never any point in testing the assertions. If 1. is false and is the last rule in the pattern, there is never any point in finding the matching context nodes so that subsequent context are correctly found.

If 1. is true and 2. and/or 3. is false, then the assertion test can be replaced with false() since it will fail each time.

Obviously this kind of optimization would work best on paths that are singletons, and on documents that were relatively incomplete. It would be useless where the document did have lots of name-tagged dogs and owners. It is the kind of dynamic analysis is what you might do if you were making a preliminary pass (e.g. in order to extract IDs) in the first place, as medium-low hanging fruit. (Other optimizations spring from it, of course: having found the first occurrence, memo-ize the position so that the full search by the second path can restart from that location. And values used in predicates against current() can be treated as a kind of ID and stored in an index.)

I am very interested in gathering ideas for this kind of thing: I find the academic papers are often hard and sometimes don't deliver quite what they promise (e.g. you read a promising paper on rewriting reverse axes XPaths as forward Axes XPaths, but discover it requires a new pass for each reversal) but I'd love to have some of the new ideas and possibilities explained more in programmer-friendly terms.

You might also be interested in:

News Topics

Recommended for You

Got a Question?