ODF Plugfest

+ Alex Brown's probe + Parallelizing Schematron

By Rick Jelliffe
June 16, 2009 | Comments: 3

I am looking forward to seeing the report from the ODF Glugfest 2009. The Dutch government is doing everyone a great service in organizing this.

Actually, I am more looking forward to seeing the result of next year's plugfest, when we should actually see whether competition is increasing implementation quality. We should get some good bug reports this year, good fodder for marketing insinuendo I know, but necessary for the market/bazaar to operate.

Looking over the site, two things stuck out. The first was their working definition of interoperability for the purposes of the workshop being:

  1. Content equivalence - no text or data is lost
  2. Structural equivalence - headers, footers, tables, are preserved as headers, footers, etc.
  3. Dynamic equivalence - style names are preserved, live field names remain live
  4. Presentation equivalence - page size, margins, font sizes and styles, etc., preserved

This is quite similar to my Classes of fidelity for documents (raw, exchange, industrial, facsimile) except for the final class: my "facsimile" class would have word, line and page breaks preserved, while the engine-unspecific and media-dependent aspects such as paper size belong to industrial. But in a sense, I only suggest the "facsimile" class in order to dismiss it as not being relevant to ODF/OOXML documents. So the ODF Plugfest's functional grouping make sense.

The other thing that struck out was that Alex Brown has released an open source Schematron validation library, which I was not expecting. The library is called Probotron and looking at the website Alex is planning a .NET and a high performance version to go along with the initial Java version. (Michael Kay of course has lead the way with his dual platform availabilty of SAXON.)

He has a blog entry ODF Forensics on his Office-o-tron ODF validator, which uses Norm Walsh's open source XProc processor. I see he writes that it is uses Jing (RELAX NG) to validate rather than Schematron.

The latest release of the Schematron skeleton for XSLT2 code (see Schematron 2009) has an experimental features, multi-document patterns, which I added to support XML-in-ZIP formats better.

While online validation using Schematron certainly works with small documents and for high-value documents (indeed, my company Topologi has a servlet product for this), there seem to have been scaling issues: large documents take too much memory to move and load at the server, or the wait until the document loads into a DOM or whatever means that validation messages come late. I have written on a few possibility for optimizing Schematron for faster response over the last few years, so it will be interesting to see what Alex has in mind for his Probotron-HP.

(Actually, recently I have been working through various issues on the question: to what extent can Schematron be parallelized? Of course, there is a very simple high-level parallelization available: because patterns don't interact, they can be performed on separate threads/machines or interleaved. Actually, if you had a map-reduce multi-node system and were interested in that kind of eager parallelism, each rule (and indeed, each different assertion) could be farmed to a separate processor, with Schematron's lexical priority for rules within a pattern performed after validation (or after context matching) to exclude spurious rules, The ISO Schematron standard explicitly avoids order dependencies either on the order nodes must be visited or the order in which the document must be validated in order to make it easy to parallelize. )

You might also be interested in:


Rick hi

You shouldn't have been *too* surprised by this - this is the FOSS release I mentioned in Okinawa :-)

The "high performance" version has a completely custom in-memory model. It will offer some modest (but maybe useful) flexibility of trading-off memory/performance compared to the XSLT implementations.

As to scaling: yes, this is a problem. I've had one client needing to do parallel Schematron processing of incoming data (large data sets), and we used a big Sun box with 12 CPUs and 24GB of RAM as our "solution".

I'd like to use Schematron for the Office-o-tron validator too, but am wondering about the heavyweight server needed. Hmm ...

Alex: Ah, this is that! Great.

Can I ask, what kinds of constraints does this large dataset-client need to test?

Rick hi

I seem to remember a lot of tests seem to be of the "compare everything to everything" type.

One in particular sticks in my mind: the internal links in this data set were held in external link documents for each "product", so in order to check the integrity of a product one needed to load the link document and every referenced document into memory before letting rip with XPath ...

- Alex.

News Topics

Recommended for You

Got a Question?