XML databases have long been something of a niche category in the database world, trying with varying degrees of success to provide the level of ease and accessibility for semi-structured content that is a hallmark of SQL databases, while at the same time providing as much of the sophisticated processing that XPath enables for stand-alone documents. The need is certainly there – a significant amount of the total "data" in the world does not necessarily fall neatly into Ted Codd's relational table structures without significant shredding – yet XML databases have had a hard road to acceptance, in great part because each one offered their own (typically very distinct) mechanism for getting at that data.
It is not surprising then that as XQuery, the W3C XML query language standard released in February 2007 has gained acceptance, so too has interest in XML databases that support this standard. On the commercial side, one of the most well known (and solidly entrenched) is the MarkLogic XML Server.
MarkLogic was founded in 2001 by Chris Lindblad and Paul Pedersen in order to explore building a database that could in fact combine the speed of working with relational databases with the ease and power of working with XML. They recognized that document content was poorly treated in a normal database in great part because the structure of the document was too complex for most databases to decompose readily – meaning that while such document searches could be indexed and the text searched there was almost no one way of getting the inherent structure of documents out. Lindblad, former chief architect for Ultraseek, managed to put together a fast seek XML search architecture, and MarkLogic was born.
Dave Kellogg, the CEO of MarkLogic, came aboard in 2003 after serving as Senior Vice President at Business Objects, and was driven similarly by the realization that the potential market for content databases was huge and mostly untapped. He was attracted to MarkLogic by the sophistication that he saw in the MarkLogic XML Server, and has been a big driving factor in the evolution of both MarkLogic and XML databases in general since then. Kellogg also set up one of the first CEO blogs at http://marklogic.blogspot.com/, and to this day blogs vociferously about the state of the industry and of programming in general.
MarkLogic also made significant news earlier this year after hiring DocBook and XML guru Norman Walsh away from Sun and XForms evangelist Micah Dubinko from Yahoo, along with the release of the <a href="http://www.marklogic.com/product/introducing-marklogic-server-4-0.html">4.0 version of the MarkLogic server.</a> The company has, over the years, built up an impressive roster of clients, from Congressional Quarterly and book cataloger RR Bowker to Harvard Business School Press, McGraw Hill, Oxford University Press and O'Reilly Media, among many others, and the MarkLogic Server has gained both adherents and kudos for its speed and scalability.
Putting in the Key
I had a chance recently to "get under the hood" and spend some time evaluating the MarkLogic Server 4.0 release. I was impressed by the product; it was both feature rich and satisfyingly fast, though there were a few facets of the server that I felt could have used some improvement. Overall, however, it is easy to see why MarkLogic holds the place in the XML database space that it does.
One of the key measures that I use when evaluating XML databases is the ease of installation. XML databases are complex pieces of work, and the ease with which you can set the database up is a good indicator of the degree to which the database itself will be easy to work with. The initial installation went well – MarkLogic provides binaries for Microsoft Windows Server (though it appears to run fine under Vista, which I've long since ceased to take for granted), Red Hat Linux 3 & 4, and Solaris 8 through 10. Be prepared to allocate some space on your hard drive, however – the initial binaries took up a rather stunning 400+ MB.
Once downloaded, MarkLogic provides system links to its admin screen as well as to an XQuery user testing area. Similarly, setting up an admin account was easy enough. However, past that stage things became ... complicated. The administration section provides top level access to Groups, Databases, Hosts, Forests, Mimetypes, and Security. It took a while (and some digging in the documentation) to find out that Groups are intended to be containers for collections of application servers (such as HTTP and WebDAV servers), Databases hold Forests and manage indexing, fragmentation control and the like on those Forests, while Forests are ostensible collections of Strands, each of which is a set of documents and fragments, and Hosts handle the management of Forests. Got that? No? Neither did I.
There are Help panels throughout the administration section, but frankly, without understanding the core concepts, the help can be confusing at best and downright cryptic at worst. There is additionally help that can be downloaded (as PDFs, oddly enough) from MarkLogic's documentation site, though it shares the same common failing that the online help does.
Going to the "Use MarkLogic Server" link from the initial installation didn't help much either. This section contains a three frame window that lets you load in XQuery scripts (or write your own) and evaluate them in a display window. Other than some resize issues, this worked reasonably well, though I found that some of the scripts didn't evaluate properly (this may have been an error on my part, however). However, all of this pointed to a key point that I feel the MarkLogic usability people should pay serious attention to:
Every piece of documentation for MarkLogic Server should be in the database itself in XHTML format, and there absolutely should be a portal home page on the local servers that let you access the XQuery function sets, the What's New, should contain a live walk-through on the server complete with examples, and these examples should be written to the level of someone who has never worked with an XML Database before. You can set up triggers or cron jobs to periodically update this content, but there has to be something more than a bare bones "test bed".
I had one additional complaint about the testbed, something that I feel was reflected throughout the application. The ability to preview XML content should not be limited to one platform (i.e., Internet Explorer). The ability to preview output content is a common enough requirement, especially for people learning the technology; actually formatting the XML for web page output and displaying that within an AJAX-based page would have gone a long way to helping people learn how to work with the server, not just in the testbed application but throughout the product.
After spending several hours reading through documentation, I finally found the Getting Started section, and managed to get enough out of it to start playing with the XQuery through a web interface. To do so, I had to create both an HTTP AppServer and a WebDAV AppServer, the first to "web-enable" my queries, the second to provide a means to pull the XML content into the server. This was actually one of the more useful features I found, because it meant that I could set up multiple applications, each running in a different network port.
Again, I have a couple of suggestions for the MarkLogic development team– in setting up the HTTP App Server in particular, it should be possible to designate a default XQuery script that could be assigned from within the AppServer administration screen – say, for instance, if index.xq was your designated default, then http://localhost:8002/index.xq?a=b would become http://localhost:8082?a=b.This not only provides a modicum of security, it also makes the apps URLs look more RESTlike.
Additionally, I think a nice touch would be the ability to give the option to the user of establishing "RESTful" URLS that could be processed throught that same default. This way, you could use a URL such as http://localhost:8002/books/thefirstbook/edit, with with string /books/thefirstbook/edit being passed in as a URL "property" passed to the aforementioned index.xq server. These clean URIs are much more RESTlike than normal query string usage, something that is especially important when dealing with XML data. It also makes it easier to deploy paged content – e.g., http://localhost:8002/books?page=3 would return a listing of all "book" items on the third page, by whatever paging mechanism the index.xq function offers.
Checking Out the Engine
After a year and a half of the XQuery specification being finalized, XQuery support is de rigour in an XML database, and MarkLogic definitely exceeds expectations here. The core XQuery function set is fast and, in my initial tests anyway, seem to work flawlessly.
However, one of the real advantages of the XQuery specifications is in its extension mechanism – you can create additional libraries, either out of XQuery modules or via external libraries, and it is here that MarkLogic 4.0 supports the full 1.0 XQuery specification, but it also provides a 1.0-ml extension set that is rather stunning in its breadth, along with a 0.9-ml set that provides backwards compatibity for existing MarkLogic applications on the older 3.0 series.
The first piece of this augmented XQuery set is the introduction of transactions (note that this capability is also being discussed as part of the XQuery 1.1 specification). A transaction is a block of operations that are applied in a non-destructive manner first, and if the transaction completes successfully, then the changes are committed to the database, otherwise, the transaction is rolled back. Transactional support is essential for any number of operations – e-commerce being an obvious one, but almost anything that has potential state-changing properties should be considered for transactional applications.
This of course necessitates having both a tracking mechanism (a timestamp) and a way of updating content. MLServer has both – an ability to update data content via namespaced methods and a timestamp that's applied to each transaction to provide for versioning. Additionally, MLServer also includes locking capability, making it possible to lock a given transaction when multiple users are requesting the same portion of the database at the same time. This in turn is tied into try/catch capability that can be invoked from within XQuery in order to handle exceptions.
MLServer also includes a sophisticated evaluation method, as well as "requests", function invocations that let you run XQuery scripts synchronously as data requests, with or without updates. XQuery is a functional language, and as such the ability to pass functions as operands to other functions is a fairly critical part of higher order programming. The eval() capability makes this possible.
The update facility, as it stands is currently implemented as a set of extension mechanisms, with support both for updating whole documents in situ as well as updating only a specific element, attribute or other XML object. In an interview earlier this year, Jason Hunter confirmed that they were also watching the XQuery Update Facility (XQUF), and would roll out a compliant update module once the specification was finalized.
A second aspect of MLServer which is both (fairly) easy to use is the combination of alerts and triggers. An alert is a notification that the system raises whenever the state of the database changes in accordance with a given XQuery filter. In essence, whenever an update is performed on a database with alerts present, a "reverse query" is performed on the dataset, and if that query returns true, then some predefined action is launched.
Sometimes these alerts are relatively simple – for instance, whenever a new entry is created in a particular collection, a message might be added to an administrator's user queue indicating this event. On the other hand, alerts can also be used to perform fairly complex analyses – in a nuclear power reactor, the temperature and coolant level get logged every five minutes. If the temperature reaches a certain level and the coolant level is dropping dramatically, this might both log an emergency message or send an SMS message to the administrator's phone, and invoke the program to deploy the damping rods to cool down the core (okay, this an extreme example, but it illustrates the principle).
Triggers are an indispensable part of most relational databases; alerts and actions will likely do the same thing for XML databases.
Testing Out the Suspension and Chasis
A major point of Dave Kellogg's keynote address at the MarkLogic 2008 conference in San Francisco was the importance of content in databases. Relational databases do not hold content. They hold linear collections of properties. As a consequence, complex structure can only be built via inference within a relational database – you have to explicitly build that structure via SQL, and even there what you can build is highly limited by the degree to which you have extended SQL to cover such structure.
An XML document, on the other hand, has a hierarchical structure that's implicit to the document, something that is preserved (albeit through a lot of slight of hand in the background) within an XML database. This flexible structure is what differentiates XML databases from their relational counterparts, and the complexities (and speed) inherent in making this happen have only recently reached a level where XML databases have become competitive.
One consequence of this is that documents can be effectively passed through a content pipeline, each stage of which serves as a "black box" to take in zero ore more documents on one end and produced zero or more documents on the other. Apache Cocoon is one (albeit primitive) example of such a pipeline, Xproc (discussed below) is a second, but the MarkLogic analog is the Content Processing Framework (or CPF).
CPF works upon the assumption first that all documents are modular – that is to say, a document can be made up one of one or more external documents that are linked together using the W3C XInclude capability. XInclude is the logical successor to document entities from the older SGML framework – an XML element that indicates that the linked document should be incorporated into an existing document, but it has only been sporadically implemented because of the complexities involved in putting together such distributed content.
An important consequence of XQuery is that such "included" documents could be computationally generated rather than coming from a flat text file. Even in a purely static browser, this capability opens up the ability of inlining your web components as <xinclude> elements (or XHTML elements that include XInclude attributes). In an AJAX enabled application, XInclude makes it possible to dynamically parameterize and bind these components. Thus, the rising acceptance of both server side and client side XInclude looks to be an encouraging trend.
As a corollary to XInclude, the XPointer specification is also supported within the CPF, and here's where things get interesting. XPointer was one of the earliest post-XML specifications, but it has all but disappeared from usage precisely because it placed a huge expectation on web servers that frankly were unrealistic until now – the ability to retrieve specific fragments of content – either text or XPath nodes, via a query string command line.
XPointer is a selection mechanism – it indicates to the server that a given fragment is important. In the context of a link (such as an <a href>) link, XPointer devolves to the hash-marked id of the indicated document, which the user agent is then recommended to scroll to. In XInclude, on the other hand, the XPointer content itself should be retrieved and incorporated into the existing document.
While this is a fairly difficult task for a web server to do on its own, it should in fact be the bread and butter of an XML database – they are optimized for this sort of task. This is another area where MarkLogic server shines. MarkLogic Server supports the XPointer specification, including not only the simple id and element() methods, but also the xpath() method. This means that you can retrieve a collection of nodes or fragments from a document or a collection of documents through XPointer notation, rather than going through the construction of a formal XQuery script.
All of these are accomplished by a pipeline in which Xincludes and Xpointers within the source document are parsed, expanded and processed. This pipeline can handle a number of other things as well – the pipeline can be used to validate documents, to run XQuery "filters" on them and to expand on embedded tags.
One of the more compelling uses of the CPF is in an area called Entity Enrichment. Enrichment can be thought of as a semantic "process" – it scans a document for lexically "interesting" terms, compares these words or terms with its own map of terms and then wraps semantic information around the term itself – looking for additional context within the sample to determine which of potentially several usages are implies.
For instance, I happen to live in Victoria, British Columbia. Victoria here could refer to a number of distinct semantic entities – the aforementioned city, the British queen for whom the city was named, a company that sells exotic lingerie using buxom models, or the wife of a famous soccer star. Enrichment would find the word Victoria, then would apply a series of filters to determine which confirming terms were present ("British Columbia" or "Canada" would be a good indication of the city (while Australia would confirm that the city mentioned was Victoria, Australia) , "Disraeli" would indicate the queen, "secrets" would confirm the lingerie company, while "soccer" would likely indicate Victoria Beckham.
MarkLogic has incorporated a sophisticated entity enrichment engine into the server, accessible as XQuery extensions. Entity enrichment can also be performed as part of the general CPF pipeline, such that once a document has been transformed into its HTML form, the enrichment could be used to add semantic information as attributes before passing it on to be published. In addition, administrators can also choose to use external enrichment providers through a web API interface; the enrichment may very well be better (enrichment is becoming a major business model, and web service enrichment is evolving at an astonishing rate) though the out-process latency hit may make this use less attractive.
In addition to enrichment, another typical pipeline operation is document conversion. This capability is very useful if you have a large number of Microsoft word documents or Excel spreadsheets that you want to be able to convert into XML. It also includes modules for performing Tidy operations that convert HTML documents to XHTML, and zip and unzip operations that make it possible to work on zipped archival content, along with DocBook and CSS operations.
The one disappointment I have with the pipeline is that there is no native XSLT support as part of the pipeline, either XSLT1 or XSLT2. This isn't necessarily a major limitation – you can of course create a web service that will send an intermediate state document to an external transformation server and then pass the resulting document back into the pipeline, but as there are places where inline XSLT can prove quite useful, this lack of XSLT can sometimes mean more cumbersome XQuery scripts or fairly expensive pipelines, time-wise.
I am encouraged by the hire earlier this year of Norman Walsh. Although known principally as the guru behind DocBook, most of Norm's work recently has been in the area of the XML Processing Language, also known as Xproc, which is a W3C working draft moving toward finalization sometime in 2009. Thus it is likely that subsequent versions of MarkLogic will almost include Xproc as either an adjunct or a foundation to new a new CPF.
Automatic Transmission, Power Windows, Power Brakes
XQuery is a significant breakthrough in that it does allow for functional extensions to the core set, and Mark Logic Server hasn't stinted there. One of the most notable additions has been an extensive set of administration methods that can be used to do everything from creating and deleting application servers to building databases to establishing groups and setting up alerts and triggers. In essence, from XQuery you could effectively build entire administrative applications for your own use.
This is already being deployed to good effect within MarkLogic's administrative module itself. It is possible to configure a huge number of capabilities within the web-based module, mostly because these all rely upon the administration extensions to do a lot of the heavy lifting. Personally, I think that a simplified secondary interface would be a good idea here, one that let you do the 10% of common work that you typically need to do when configuring an XML server, then provide the much more extensive administration system as an alternative when you need to get deep into the guts of the system.
The debug module, similarly, provides tools for developers working with XQuery. The language itself can sometimes be maddeningly complex, and being able to see what is happening at any stage of an XQuery execution can prove the difference between going home at 5pm and going home at 1am. By being able to attach or detach debugging tracers to the queries you can both trace and log the state of your applications at any given time.
The search module (cts) includes a number of query related functions. For insance, one set of functions lets you build composite queries programmatically, rather than writing XQuery outright. This has real benefits in building GUIs. This makes queries composable, and as a consequence you can essentially build a query model that can be executed via a pipeline of subsequent subqueries, something that translates nicely to visual metaphors. Another handles a number of the entity related functions, letting you register specific search patterns or determine the fitness or quality parameters for the entity search.
Yet another cts: related set of tools ties into the larger geo-spatial focus of the server, a deliberate bid on the part of MarkLogic to focus on the booming geo-spatial and infrastructure related businesses and organizations that have emerged in the last few years. Beyond letting you perform geographical searches (find a point or region within another point or region, determine the political entities associated with a given region, and so on), the MarkLogic XQuery also lets you parse and create GML, KML (Google Earth's geospatial format), geoRSS and MCGM functionality
MarkLogic includes a number of other functions as well, including an extensive math library (Xquery's library is rather singularly lacking in anything beyond VERY basic math functions), a module for sending email from an XQuery command (via an MTA that also needs to be on the system and configured appropriately within the Groups section of the Admin interface), and initiating HTTP-get, post, put, head and delete calls from the server to other URLs.
The one module that I believe that MarkLogic could seriously benefit from is a way of making calls to sql databases directly, perhaps via an ODBC driver layer. While such calls are nowhere near as efficient as working with internal data stores, SQL data still represents the overwhelming majority of all data sources, and the ability to communicate with external data servers should be an integral one for MarkLogic. On the other hand, the XML ContentBase Connector (or XCC) is a shim between MarkLogic server and either .NET or Java, making it possible to invoke XQuery calls from those language frameworks directly rather than trying to invoke it via URIs.
MarkLogic has easily established itself as the market leader in the XML Database space, with good reason. The MarkLogic engine handles scaling well (and includes an extensive API in order to provide scaling across multiple systems), is perhaps the fastest of the XML databases, and provides an extensive standards compliant approach to its XQuery implementation. While there are a few weak spots – the documentation could be better integrated into the system itself and a better "introductory" UI could help immensely, and the lack of support for an XSLT transformation capability or an explicit ODBC SQL bridge is disappointing but hardly crippling – none of these significant detract from the fact that MarkLogic Server 4.0 is an impressive product, one that could easily make its way into a company's content management strategy, especially given the strong support that they have given their clients.
The server comes with two licenses, a general commercial license, and a community license that is considerably more constrained in terms of available space but is available for free (under a proprietary license). Both versions are available at http://www.marklogic.com/product/download-software.html. The Community version is a good way to experiment with the MarkLogic system and with XQuery in general, and MarkLogic has a strong, active development community working with the system.
It's worth noting that the developers at MarkLogic are also consistent about dog-fooding their own technology, often with amusing results. For instance, Jason Hunter was playing with the server when the idea struck him to read in the various mailing listings that were available on the web into a MarkLogic server, convert them into an internal XML format, then use the XQuery interface to make this data searchable in a broad number of ways. The results of this effort became MarkMail (http://www.markmail.org), a system that has become one of the more useful research tools on the web.
Overall, MarkLogic is definitely worth a look if you're in the market for an XML Database, and in this day and age, if you aren't you should be.