Another insane Schematron patent?

This time it is IBM

By Rick Jelliffe
March 31, 2010

I don't know whether to be pleased or furious about this.

I have previously blogged on several instances of patents being granted on what seem like straight-forward implementation ideas for ISO Standards. And I have drawn attention to the common problem that statements are made in patents applications that are either untrue or neglect unpleasant facts.

Here is another one: Method and System for Validating XML Document assigned to IBM, the world's greatest monopolist as far as patents are concerned. What it attempts to patent is where you have a Schematron schema, but you are only interested in certain rules or assertions: as a use-case, it is not novel but something that every interactive editor using schemas has to work through.

One intended application seems to be something like an interactive editor or form, where you select some element and then validate whether it is valid against some particular targetted set of rules in the schema. If that is indeed all it is, the people involved should be ashamed of themselves: why don't they get an honest job making up genuinely new ideas?

But the guts of the patent seems to be little more than common subexpression evaluation or caching. If one rule has a context of X/Y/Z/2 and another rule has X/Y/Z/2 then you store X/Y/Z in a context registry for use by both. It is a nice technique for Schematron in this situation. But caching intermediate results is patentable? Really?

Now, let me immediate temper my furious comment above: if IBM has contributed this patent to their Common Patent Pool for RAND RF use, and if that was the purpose all along, then they and everyone involved are to be commended. The US (and Chinese, in this case) patent systems would still stink for allowing software patents in general and patents that sabotage the International Standards system in particular, but I know that many patents are made defensively.

But it is strange to me that, after first having pioneered the idea that a Schema can be divided up into pieces that are independent of each other and providing first-class language constructs for it (i.e., Schematron patterns) for a separation of concerns, and that schema validation can be divided up into different selections of things of interest and providing first-class language constructs for that (i.e., Schematron phases), that someone gets a patent on just validating smaller pieces and finer selections: it is just repeating functionality that is built into the larger language. The user could, indeed, just write their schema so that the patterns and phases were at the level of granularity of their interest...

Crap

Lets look at some statements in the Detailed Description section :

[0022]The XML document validation process with Schematron is divided into two procedures. The first procedure is transforming a Schematron rule document into an intermediate document, i.e. Validator.xslt document 30 by executing Schematron.xslt document 20 in an XSLT engine 60. The second procedure is executing the Validator.xslt document 30 in the XSLT engine 60 to perform validation for XML document 40 to generate a final validation report 50.

That is OK, as far as an all-XSLT implementation. I have frequently commented, there is no requirement that Schematron must be implemented using XSLT, as long as it has equivalent libraries; I was glad to see that the patent at least does not assume that all Schematron implementations must be XSLT-based like the skeleton.

[0023]A performance issue will be caused by the two rounds of XSLT transformation. A Schematron schema is transformed into an equivalent XSLT document, and is commonly implemented as a meta-style sheet, called skeleton. This skeleton is applied to the Schematron schema and the resulting XSLT is in turn applied to the XML instance document. Because there are two rounds of transformations based on XSLT in the validation process, the performance often becomes a critical issue especially in some environments requiring real-time processing, such as registry and repository with a large number of concurrent operations by end users and applications.

This seems to be based on the idea that you cannot cache a compiled schema? That is such rubbish that it is hard not to see it as deliberately misleading.

[0024]Such an XSLT based validation method lacks shareable rule context. The contexts of each rule are not shareable, so that many nodes are traversed more than once in the validation process. It is also another critical issue for performance.

But every aspect of how an XSLT stylesheet runs is implementation-dependent. The point of having a functional language for XSLT was that it would allow all sorts of optimizations, such as parallelized functions. You only need to look at Michael Kay's discussions of optimization to understand what a rich are this is. It is not intrinsic to XSLT that templates in different modes (the Schematron skeleton implementation converts Schematron patterns to XSLT modes and Schematron rules to XSLT templates) will be traversed in separate passes of the data: that is merely one implementation strategy.

I don't know that XSLT 1.0 makes any statements that one mode will be evaluated before another. That is an implementation decision. The results of transforming a document in one mode may certainly appear in the output result tree after the results of transforming the document in a different mode, but the point of having a side-effect-free functional language like XSLT is that there evalution of different branches can be done in any order: it is inherently parallelizable.

[0025]It is difficult to achieve fail-fast validation with Schematron. Fail-fast refers to a lightweight form of fault tolerance, where an application or system service terminates itself immediately upon encountering an error. Schematron validation based on XSLT transformation makes it difficult to achieve fail-fast implementation due to the nature of XSLT.

And yet Schematron does indeed have fail-fast capabilities: for example the schematron-terminator stylesheet, which uses Ken Holman's system of just adding terminate="yes" to the output XSLT.

That Schematron could be used with a streaming implementation such as STX that would allow even faster fails should be obvious: indeed, the ISO standard even reserved stx as a keyword to allow this. Michael Kay has worked on various implementation strategies to figure out how to get streamable XSLT (indeed, there may be some explicit guidance constructs in XSLT 2.1). And there have been several papers and even open source projects on techniques for XPath rewriting to allow streaming.

I even put out a discussion paper in 2002 Optimising time Performance of Streaming Schematron discussing an implementation strategy that would allow fast failing on, e.g. a SAX stream. (This technique could be used, for example, before tree-based Schematron processing: indeed, assertions that were tested using this heuristic could be removed from the tree-validating Schematron schema.)

[0026]Such an XSLT based method has matching problems generated by XSLT. Such problems usually exist in XSLT based implementations. For example, when in the same pattern, some rule context scopes overlap with each other, it will cause more than one rule to be satisfied and get triggered. Using an XSLT based implementation, each rule is represented as a template; but for XSLT 1.0, if multiple templates are matched at the same time, only one with the highest priority will be called, with the others being ignored. XSLT 2.0 has the feature to do "match-next", but still cannot completely solve the problem. This defect makes a gap between the Schematron specification and XSLT capability.

This is either completely false, or completely correct, or completely confusing, depending on whatever "overlap" is supposed to mean. Giving them the benefit of the doubt, they may be pointing out that to go from Schematron's lexical evaluation order for rule contexts to XSLT's order-independence the skeleton implementation uses the XSLT priority attribute. It is a simple, effective and complete solution. So why is it somehow considered a problem? Do they misunderstand the semantics of Schematron? If not, why the statement? Life is full of mysteries.

The only overlap of the kind spoken of occurs between Schematron patterns, which can be evaluated in parallel and are entirely independent. Any given node in an instance could cause multiple rules to be fired, though only one rule per pattern.

[0027]Such an XSLT based method makes it difficult to support partial validation with fine grained assertions in a Schematron document. Using an XSLT approach, the smallest unit of the rule container to be selected in a Schematron document is a "phase" element, where users or applications could not select a finer grained unit, such as a rule or an assertion, for validation. It may cause a problem when there is a requirement to validate XML documents with only a subset of a phase, for example where a user selected rules or assertions, and rules or assertions for a specific version or a section of a standards specification such as WS-I BP, etc., let alone the other advanced features for more flexible validation are used.

And again this is completely false.

An XSLT-based method could support what the ISO Standard calls "elaborated rule context expressions" (s3.9) which are a single rule context expression which explicitly disallows items selected by lexically previous rule contexts in the same pattern

So, for example, if the Schematron schema had one rule with a context x/y followed by a rule context y (i.e. any y that was not a child of x) the XSLT could be done in a single template: for example with XSLT 1 <xsl:template match="y[not(parent::x)]"> or XSLT2 <xsl:template match="y except x/y"> (err is that right?)

Or, to get rule-level granularity (or assertion-level even), you could just customize the XSLT so that assertions on rules prior to the ones of interest in a pattern were turned off. Indeed, I know of one large publishing company that uses Schematron and does have rule or assertion-level checking. (I think it is more straightforward to filter out unintended results from the output SVRL reports, in general and for dynamic viewing of a report, but turning on and off assertions is fine.)

Now it is true that the ISO Standard does provide phases and does not specify mechanisms for turning on and off individual assertions: but that is to do with the formal notion of validity not because the idea is somehow novel.

When I edited the ISO Standard, I included the definition of elaborated rule context expression explicitly to show that it was possible to have atomic rules that gave order independence, and which could be evaluated independently of other rules. The idea that you could in fact turn a Schematron schema into many small XSLTs, at pattern, rule or even assert granulatiry is not novel or new, but a property of Schematron that I knew about right from when I first made it.

(Indeed, the role attribute was provided in order manipulation of schemas, documents, implementations, targets and results at this kind of granularity, cross-cutting the hierarchy of rule)

What does the Schematron Standard say?

Out of interest, here is what the International Standard for Schematron (ISO/IEC IS 19757-3) says about order and evaluation:
6.5 Order and side-effects

The order in which elements are validated is implementation-dependent, without altering the validity of the instance.

The order in which patterns are used is implementation-dependent, without altering the validity of the instance.

The order in which assertions are tested is implementation-dependent, without altering the validity of the instance.

The only elements for which order is significant are the rule and let elements.

A rule element acts as an if-then-else statement within each pattern. An implementation may make order non-significant by converting rules context expressions to elaborated rule context expressions.

NOTE 12:  The behaviour of the rule element allows constraints that would require a complex context expression to be factored into simpler expressions in different rules.

An let element may use lexically previous variables within the same rule or global variables.

NOTE 13:  A wide variety of implementation strategies are therefore possible.

All queries shall act as pure functions. Queries shall not alter the instance in any way visible to other queries. This part of ISO/IEC 19757 does not specify any outcome augmentation of the instance being validated.


You might also be interested in:

News Topics

Recommended for You

Got a Question?