Entities and streaming processing

By Rick Jelliffe
December 19, 2008 | Comments: 2

It occurs to me that I don't think I have seen any description on the web of how XML entities/inclusions can increase the power of streaming processing. It was common knowledge in the old SGML days, when documents were often—indeed typically—greater in size than physical or virtual RAM, but I think may be under appreciated now. (I say entities, but it could also be XML fragments referenced with XInclude elements and an XInclude-enabled XML processor.)

Streaming processing uses technology like SAX, where the document to be processed is presented as a stream of events of some kind, and at any one stage only the minimum data is kept in memory. There are also several streaming technologies available, such as STX (a streaming version of XSLT with a non-random-access dialect of XPath but which allow users arbitrary data to be retained) and XStream (a term-rewriting language that automatically only keeps the minimal amount of data needed to complete the calculation. Honourable mention should go to the OmniMark language, which was streaming but had a two-pass technology (I expect it is out of patent now) called referents: you could divert output to a referent, add references to that referent anywhere (earlier or later than when the referent value was created): the processor had a built-in second pass on the intermediate data which would insert the value of the referents. This was implemented as co-routines (which have made a comeback in Lua) to minimize memory requirements.

Back to entities...

When streaming processing, it is possible to process a document in parallel, and produce two or more output entities, with one including an entity reference to the other.

Doing this shuffles the work of combining the fragments to the XML processor.

This technique is useful when you know that you need some data in a certain place at the time at least by its first occurrence, so that it can be harvested. In other words, it is not useful for when you have to make decisions in one place about data in another place.

The classic use case would be to extract a table of contents and index at the same time as processing a (literature) book. There is no decision required that you will need to have a table of contents or index, it is there because of the book design not the data. Rather than taking three passes or requiring random access three different documents are written out to in parallel.

Now XSLT2 has an interesting mechanism to write out multiple documents. However, IIRC XML implementations tend to work on a random access model, even though I presume various optimizations are possible.

Using entities in this manner can also be useful for reducing memory requirements for documents by letting you write out fragments. For example, say you have a 1 Gig XML document you want to process, say an encyclopedia (I hear Wikipedia is at least 3.5Gig). Your processor can write out each entry to a separate file, and generate the reference. This would be useful in languages that buffer up the output and don't provide any mechanism to flush the contents before the buffer is closed. Rather than occupying Gigs of buffer space, only the smallest is needed.

Similarly, entities can be used when a document is very large and random access processing is required. Chunk the input document into fragments with a hub document. Then process the document by loading, processing and closing each fragment in turn. For some kinds of transformations, every input fragment may even directly correspond to an output fragment, which may further reduce memory requirements.

I decided to write this after talking to an XSLT programmer this week who reported that they had a large (1Gig) document and in order to process it they had switch from Xerces to Saxon. being smarter with chunking (whether using entities, XInclude or other similar mechanism) is an approach that may be worth considering in this kind of case.

(When I expressed surprise that such large documents were being processed, the programmer reported that they were using Solaris, which I think has a larger process space than 32-bit Windows, so that was not surprising. Speaking of Solaris, I recently tried a more recent version of it: I tried one last year but it didn't support my 1440x900 monitor. The more recent version supports it well: a bit too bright, but it is a very nice and smooth OS.)


You might also be interested in:

2 Comments

There is no way of testing my hypothesis, but I suspect that if Omnimark had figured out a business model that worked XSLT would never have been invented.

It would be like, when you needed to eat a bowl of soup, inventing a spoon rather than going the silverware drawer to get one.

Omnimark is now owned by Stilo and is priced so that only the most successful publishing firms can afford it.

However, Sam Wilcott, the inventor of Omnimark is now employed by Mulberry Technologies and I've heard he's working on another XML processing streaming language.

We wait with baited breath, Sam.

A few years ago, I ran into trouble transforming very large documents using XSLT. Since I used the Cocoon framework, I could write a transformer that processes the XML in chunks. The combination of this 'multifragment' transformer and STX behaves a bit like the map-reduce pattern used in functional languages. The multifragment transformer has been released as open source, and my presentation about it can be found at http://cocoongt.hippo12.castaserver.com/cocoongt/nico-verwer-performance.pdf .

News Topics

Recommended for You

Got a Question?