The indexed XML website as a commodity

Syndication gone mad?

By Rick Jelliffe
October 14, 2009 | Comments: 1

Reviewing a few long-term, continuing multi-publishing projects I have been involved in recently, I am struck that several are morphing in a particular direction. The projects might have started as publishing paper or webpages, and moved to publishing high-level XML, but increasingly the commodity that needs to be packaged and distributed (for re-skinning and re-use by third parties) is the whole indexed dataset: in effect the website (without the implication of HTML pages.)

With books being regarded as a particular kind of dumb browser with brilliant styling capabilities.

The project that is most advanced in this area, we are considering moving over to distribute the project primarily as Lucene indexes of the XML files, with sample XSLTs for XML-to-HTML conversion rather than as XML or rendered formats. We already distribute several projects on CD-ROMs which runs Jetty webserver and Lucene, so the conceptual break is moving this distribution to the web as well: the client-person doesn't GET a webpage, they get a whole webite (this is for B2B not B2C.) Lucene is mature, programmer friendly (in a way that documents on file systems or URLs, and AJAX, may not be) and available on multiple platforms.

Why would we want to do this? With the information involved, the client wants to encourage dispersion of the information and value-adding. Live mash-ups or web APIs are nice, but are they reliable for value-adding business without SLAs?

In the particular case, we saw that actually the client has 3 different customized versions of the website themselves, plus two different PDA version, plus there are at least two other government agencies who rejig the website on their sites, not to mention several dozen paper-style publications drawn from the data. And they have development and deployment versions of the website (on several servers), plus they want an inhouse version for allowing some kinds of what-if pre-publication trials. And so it starts to look like their business requirement is to be able to treat the skinnable website in total as a commodity, not just either the raw data (the system inputs) nor the rendered forms (the outputs).

When we look at what we can do to make it easy for programmers to customize and value add the dataset, we see that the alternatives of providing the raw XML or scrapable HTML (or RDF) merely present programmers with new hurdles. Hence the thought that in providing a whole sample website (i.e. the Lucene-indexed data, the simple exemplar XSLTs) we can bootstrap the developers of new systems. And it is no extra cost, since the indexed-data has to be made anyway as part of the multi-publishing.

So each month, we would distribute a new version of the Lucene-indexed data. Rather than, say, distributing an XML-in-ZIP archive which in effect is just a file system, by providing the information as an XML-in-Lucene indexed file we are not building any application semantics in, nor precluding any alternative XML-in-ZIP archives, but we are saying that the needs for more database-y access by multiple criteria is useful and sensible.

Another reason for this is that the information is timely and expires regularly: the client's feeling is that we need to reduce barriers to updates. One area I have to look at over the next few weeks is that it may be that we need to integrate a dynamic channel for fast notifications outside the general monthly publishing cycle: for example, if a product mentions is determined to be unsafe, we want to be able to push that information (perhaps not skinned) to online users of the reskinned, value-added websites.

Of course, distributing datasets in the native formats of some DBMS is not new of course. And there have long been publishers of specialist databases and datasets. But the morphing from seeing the deliverable commodity as raw data or rendered data, to one where the indexed form is what is being distributed is intriguing. For example, I am not sure why someone would use the Fast Infoset (Binary XML) kinds of standards where mature, open source, programming-language-neutral Lucene indexed files was also applicable.


You might also be interested in:

1 Comment

Rick,

I agree with your observations about the need to be able to redistribute documents with integrated indexes to support search.

We have been using the eXist-db.org release 1.4 which includes Lucene and allows highly customizable search ranking. The integration of XForms and flexible URL rewriteing makes this the ideal tool for publishing large collections of documents.

We are moving away from XSLT since XQuery seems to have much more mature indexing integration.

Keep up the great blogs.

News Topics

Recommended for You

Got a Question?