XProc: Drupal, XML Pipelines and RESTful Services

By Kurt Cagle
March 11, 2009 | Comments: 21

Why would you want to use XML as a programming language?

Anyone who has used languages such as XSLT should have a pretty fair idea about the complexities involved in treating XML as a programming language itself - it's verbose, forces thinking into a declarative model that can be at odds with the C-based languages currently used by most programmers, can be difficult to read, and as a syntax it doesn't always fit well with the requirements in establishing parameter signatures and related structures.

For this reason, languages such as XProc, the XML Pipeline Language, must have many people scratching their head. At first blush, it is in fact a programming language - it has many of the same lexical structures (declarations, parameters, encapsulation, control structures, exception handling and so forth) that other programming languages has, and overall, the amount of work necessary to put together an XProc "program" would seem to outweight the benefits when processing single documents.

However, working on some revisions to a RESTful Services prototype (part of a larger series of articles I'm working on about RESTful Services, to be published soon), I began to see a place where XProc is not only just a viable alternative, but may in fact be the best solution. Oddly enough, it has to do with Drupal.

Despite it's dependency upon PHP5 (a useful language but one that tends to encourage some truly dreadful programming habits), the Drupal architecture itself is perhaps one of the best I've ever seen, in great part because it has implicitly taken many of the tenets of RESTful programming to heart.

Specifically, you can think of Drupal as a database of document resources, each of which can be accessed via a RESTful URL. At its core this mapping assumes that each internal document (which Drupal refers to as a node) can be accessed directly via its ID -

http://wwww.mydrupalsite.com/node/101

would access the node with Node-Id of 101. Additionally, it's possible to create aliases to these nodes. For instance, if node 101 is a blog named "XProc: XML Pipelines and RESTful Services" then (with the appropriate module enabled), this could be referenced as follows:

blog/xproc__xml_pipelines_and_restful_services

It is important to note, however, that what gets returned from the URL is not necessarily the internal document (which is in fact simply a text field which may contain HTML markup within a database table). Instead, the page is composited - it runs through a series of filters that adds navigational elements, sidebar content, footers, and so forth, then takes the content for the node and runs that through additional filters, such as one to turn natural line breaks into div or paragraph bounderies, that expands macros, inserts image references and so on.

In other words, every time a page gets displayed (if it isn't already cached) it is run through a pipeline of operations. The same is true for input operations - Drupal takes the incoming fields from a given edit page (which was in turn generated through yet another pipeline based upon which input controls were added to the nodes themselves) and then uses a pipeline of operations appropriate to the given content type to store this information in the database.

In PHP, designing the individual modules for performing these types of operations can be fairly harrowing, as it requires keeping track of a number of global state variables while at the same time insuring that your output can in fact be used by the next module in the sequence.

Moreover, the order of the modules is largely dependent upon the order in which you loaded the module packages from a given community site (though there are community modules that can help with that particular problem). There are also some significant dependencies upon the "themes" that you use to act as scaffolding for the final output, which is one of the reasons why Drupal themes have a reputation for a certain "sameness" in their structure.

If you happen to be an XML aficionado, your thoughts may turn to XSLT for this type of problem. The challenge that you face with such an XSLT is that, like any aggregating or transformative technology, the more the number of items that need to be transformed, the more complex the transformations become, to the extent where it is often better to have multiple smaller transformations acting upon an XML stream rather than one larger, over-arching transformation (or that one transformation is actually intended to create other transformations that then act on the data, a process which, while possible, can be fraught with error).

Things get even more complex if you transformation both handles multiple input streams via parameterization and multiple output streams via the <xsl:result-document> element, something that, in distributed architectures, is becoming even more commonplace.

XQuery runs into a similar problem. You can in fact define modules within XQuery and (with a number of implementations) actually evaluate functions within those modules inline, but the ability to add and remove functionality from scripts automatically can prove rather taxing (and can typically make your operations considerably slower and more cumbersome).

On the other hand, suppose that you had a meta-language that let you perform such operations in the abstract, as XML definitions. If, for every module operation (every pipe in the pipeline), you had a corresponding XML pipe definition, then so long as you insured that the inputs and outputs of the pipes in question contained the proper content (the hot water from the furnace was going to the hot water pipe) the pipelines should handle the actual processing just fine. This is the essence behind XProc.

XProc starts off with a number of primitives (see Table 1).

Table 1. Required XProc Primitives
Command Description
add-attribute Adds a single attribute (name and value) to any elements in the incoming document that match an XPath.
add-xml-base Adds an xml:base that determines the relative position of linked entities within a document.
compare Compares two documents for equality
count Returns the number of documents in the input sequence
delete Deletes the items in the incoming stream that match a given XPath
directory-list Returns a document showing all of the files and directories for a given URI
error Generates an error based upon the incoming document
escape-markup Converts an incoming document into serialized markup
filter Returns a portion of a document based upon a given XPath expression
http-request Makes an HTTP request to retrieve content from content on the web.
identity Makes a verbatim copy of the incoming stream
insert Inserts one document into another at the position(s) indicated in an XPath expression
label-elements Generates a label for each element that matches an XPath, stored in the element's attribute list
load Loads an XML document from an external resource
make-absolute-uris Converts relative URIs into absolute ones based upon the xml:base.
namespace-rename Renames a namespace URI to a different URI
pack Merges two documents in pairwise fashion (useful for merging linear table data)
parameters Exposes one or more parameters in an XProc.
rename Renames elements or attributes in a document based on an XPath
replace Replaces given elements with new documents based on an XPath.
set-attributes Sets the value of an extant attribute in a matched element.
sink Accepts a sequence of documents and ignores them. Useful for terminating operations.
split-sequence Splits one sequence into two others
store Stores a serialized version of a document to a URL.
string-replace Replaces text strings in a matching element with a target string.
unescape-markup Parses XML markup into a document
unwrap For a matched element, removes the element and connects the children to the element's parent.
wrap Wraps nodes in an element with a new element
wrap-sequence Wraps a sequence of sequence of documents in a containing element.
xinclude Processes XInclude statements in a document.
xslt Transforms the input using a supplied stylesheet

In addition to these core "required" pipes, there's a second set of "optional but recommended" pipes:

Table 2. Optional but Recommended XProc Primitives
Command Description
exec Runs an external command within the environment.
hash Generates a hash string, or "digital signature" of a given document.
uuid Generates a Unique Universal Identifier and appends it to a given document.
validate-with-relaxng Performs validation of document based upon a supplied Relax-NG schema.
validate-with-schematron Performs validation of document based upon a supplied schematron document.
validate-with-xml-schema Performs validation of document based upon a supplied XSD.
www-form-urldecode Decodes an incoming x-www-for-urlencoded string into parameters.
www-form-urlencode Encodes a set of parameters as an incoming x-www-for-urlencoded string and injects it into the document.
xquery Executes an XQuery script using the incoming document as its source.
xsl-formatter Accepts an XSL 1.1 document and renders output from it (by default, a PDF file).

In and of themselves, such a set of pipes can be likened to a function library with clearly defined inputs and outputs (not that unlike most functions, however, pipes can in fact have multiple output streams). You can in fact make perfectly serviceable pipelines from them. Beyond these, however, are also a set of additional "command" pipecs that provide capabilities more typical of most computer language "command" keywords such as for, if and group (Table 3):

Table 3. Command Pipes
Command Description
pipeline Creates a new named pipe from a pipeline of existing pipes.
for-each Iterates over a sequence of documents, useful when you have an extant pipeline that only works on single documents.
viewport Taking a document and an XPath match expression, viewport applies it's sub-pipeline to each matched node and replaces the former with the latter, then returns the document in its entirety.
choose Selects on pipeline among several based upon an XPath predicate on each alternative, using when and otherwise as template matches. Akin to the xsl:choose statement.
group A group is a convenience wrapper for a sequence of commands in a pipeline, used primarily for organization.
try Executes an operation (typically contained in a group), and if the operation throws an exception, invokes a contained "catch" element. This is the primary error catching mechanism for XProc.
input Either creates an abstract input stream for a given pipeline, or identifies the specific source of that input. Inputs use ports to identify the expected incoming streams. Inputs can either be parametric or document oriented.
output Either creates an abstract output for a given pipeline, or identifies the specific target of that output. Outputs use ports to identify the outgoing streams.
variable A variable is a specific computed result or assignment. Variables can only be set by the XProc statement, though variables can be shadowed (a variable can be used in different scopes, having different values in those scopes).
option An option is analogous to an XSLT parameter - it is variable that can be set by a user, or set to a default value by the XProc engine directly.
with-option Whenever a compound step is executed, the with-option element makes it possible to pass an option into the pipeline. This is analogous to the XSLT with-param command.
declare-step The declare-step statement is used to create the signature for a given atomic step - its inputs and outputs. For atomic steps (i.e., those which are defined by some external agency, such as xquery or xslt) the declare step is typically used by a processor to signal the signature for the external operation.
library A collection of declare-steps and pipelines.
import This command loads in a pipeline or pipeline library.
pipe A pipe connects an input on one step to a port on another step.
inline Provides an inline document, such as a data block or a transformation that's kept local to the XProc library.
data Reads an arbitrary resource from a URL. This is used primarily for inline data-fields within img elements or similar resources.

Most of the command pipes provide mechanisms for working with (and as importantly, naming) pipelines (sequences of pipes). Put another way, with the command pipes, you have the ability to define your own pipes made out of sequences of more primitive pipes - pipelines that can be stored as separate, distinct modules and then loaded in when necessary.

Again, none of this seems like a radical departure from ordinary programming - you have the ability to create and load modules in most computer languages, and modular programming has been the foundation of computer science for decades.

Yet there are some very important differences here. Consider again the example of Drupal, which employs a similar pipelining architecture. A typical pipeline for producing an output for a given URL looks something like this:

  1. For the given document, retrieve the associated theme template document.
  2. For each region in the theme, retrieve the block information (subordinate "widgets") associated with that region.
  3. Retrieve the sequence of documents associated with each block and render them according to layout information bound to that block
  4. Once each block has been rendered, walk back up the tree for the primary block.
  5. Retrieve the body and fields of the document associated with the URL.
  6. Apply a sequence of filters (again, a pipeline) upon the body to convert the internal representation into an appropriate output format.
  7. Insert this content into the page output.
  8. Send the contents to the client.

Note, this is a deliberate simplification, meant for illustrative purposes only.

For Drupal, the process of creating themes can be fairly complex, typically required hard-coding specific configurations in PHP and employing mixed PHP and HTML code. It is remarkably easy, from personal experience, to bollix such a theme. On the other hand, if the source file was an XML configuration template, not only can you validate the document prior to running it, but it becomes much easier to build tools to visualize what the template would like like (and to build the templates in the first place) with an XML foundation.

Similarly, the process of rendering widgets become XSLT transformations acting on specific data that may be generated via an XQuery command, one that can be parameterized and defined as a distinct step in an XProc pipeline. Indeed, the whole view mechanism that gives Drupal so much of its power has direct analogs in XML technologies - filters become XQuery WHERE phrases, paging is a simple XPath function (subsequence()) , sorting is handled by the ORDER BY expression, arguments are parsed from the URL and passed in via XProc options, Goand so forth, while the final rendering can be handled via either simple XSLT transformations that can be autogenerated or via more complex XSLTs that can be loaded in and again wrapped as named pipes.

The process of applying filters to the body of a DRUPAL node is essentially the same as applying a pipeline sequence to a given input document (or sequence of documents). Note additionally that such a process is in turn potentially recursive. The initial thematic document would have associated sub-elements that would be expanded out in turn until no more expansion was necessary at all phases of the pipeline, most likely be the simple expedient of testing to see whether any element or attribute still retained an intermediate namespace.

So far, this would seem to indicate that is at least possible to create a Drupal like pipeline CMS using something like an XML database and XProc. The next question would be whether it would in fact be beneficial to do so. The answer, in my opinion, is that there would be major advantages to using this kind of an architecture, for a number of reasons.

First, XML is queryable at a level that simple text isn't. Drupal has a module called CCK, which makes it possible to add additional fields of content to its internal nodes, but the addition of such fields can come at a considerably processing cost. XML databases utilizing XQuery and XSLT would be able to deal with documents of considerably higher degrees of complexity (think HL7 or XRBL, for instance, which may have dozens or even hundreds of properties at various levels of folding). This also holds with the addition of the full-text search XQuery capabilities out of the W3C that are coming online in the next year.

One of the central challenges that Drupal faces as well is the issue of versioning and module addition. Creating a module in Drupal is non-trivial. Besides needing a fairly solid understanding of PHP, you also need to be aware of the various permutations of components, quasi-global variables, and frequently changing versions that, when they fail, usually fail catastrophically (the page refuses to render, and you're left staring at a blank page). XProc modules, on the other hand, are implicitly abstract, making it much easier to query such a module and get everything from pipe configurations to parameters to internal documentation - and because you can try/catch XProc expressions, it becomes far harder to end up with the dread white screen, and in many cases even to provide decent diagnostics about where the errors occured in the rendering process when they do happen (likely with considerably less frequency).

Indeed, going from an XProc document to an SVG rendering of that pipeline is quite doable, and with a little creativity at the JavaScript level should be something that can be done on most contemporary web browsers that support SVG to actually create a graphical user interface specifically for building pipelines ... to the extent that such pipeline development might look more like working with Visio than actually programming.

Another critical distinction between a Drupal pipeline and an XML one is that XML in general handles linkages far better. If you wrap a declare step around an http-request, for instance, then you have the potential to create an integrated service that could load resources from external sources (such as Atom feeds), and then treat them in exactly the same manner as internal XML resources (Drupal usually has to store this information in the database, at least temporarily, creating some complexity for working with services).

This support extends to modules themselves. It's not hard to envision an auto-update facility that would, at a periodic cron interval, retrieve a listing of all available module libraries from another server (possibly available in a distributed fashion), then automatically update local modules by storing them IN the database. Adding new modules would involve displaying module summaries from external repositories with associated documentation, along with the option to load and incorporate the modules. (Moreover, because the module summaries would likely end up including the associated signatures for pipelines, it becomes possible to model virtual pipelines without actually having to download the full code for those modules - or even to test drive a virtual pipeline using remote services invocations without wiping out existing pipelines).

Beyond this, there's one other area that I see as one of the biggest benefits to this approach - if your XML database supports transactional support, it becomes possible to roll back a pipeline (or even an entire site) if you do in fact do something that puts the pipelines into an incomplete state. In my experience, Drupal sites have a tendency to become unstable over time as you add or remove modules, especially experimental ones, often leaving you in the position of having to support partial functionality until you either find the problem or rebuild from scratch (all too often the latter).

This is not to say that if you use XProc, you also need to recruit all of Drupal. The Drupal application also tends to have a fairly heavy overhead in order to support this kind of architecture, which again contributes to some of the performance problems that Drupal applications tend to be prone to. The overhead of an XProc based solution may be far smaller, in part because you may not necessarily need the whole infrastructure for your particular application, in part because it may be possible to pre-compile XProc code to give a much higher performance.

Right now most of this is conceptual. XProc implementations are coming online - MarkLogic will likely support it in a near future revision (Norman Walsh, now of MarkLogic, is the editor for the specification), Alex Milowski (another XProc alumni) has been developing an XProc implementation for the eXist XML Database, XProc is part of EMC/Documentum Dynamic Delivery Services (through Vojtech Toman, yet another XProc alum) and can be run in conjunction with both xDB8 and the upcoming xDB9 XML database..

Other pieces of this puzzle, including a growing awareness of the benefits of RESTful services, are similarly emerging, as is the general acceptance of URL rewrites within XML databases, making it possible to bind dispatch services written in XQuery to specific URLs, HTTP methods and presentation face - the interfaces of such services. While not all XProc pipeline engines will be written in XQuery, that most of them will be callable from XQuery is fairly likely.

Yet in all of this, it's worth looking closely to Drupal's lead in this area. For all of the weak points that Drupal has (and it has its share), there are a number of things that it has done right that are worth emulating on the XProc/XML Database side. A pipe or step in a pipeline is a component, something that does a specific task, and components raise the very real possibility of community development.

Not all such components will be trivial. For instance, consider just a smattering of what kind of pipes could be developed:

  • A pipe that will do macro substitution of your own configurable markup,
  • A pipe that will generate Google Earth KML from xquery results of your database,
  • A pipe that lets you read through sales data and generate SVG charts based upon parameters you set,
  • A pipe that will generate an XForms package from a schema.
  • a pipe that converts an XBRL document, including embedded links to documentation, into an XSL-FO document which can in turn be reproduced as a PDF,
  • a pipe that can send out Twitter notifications when a particular document is processed,
  • a pipe that can create form emails from an HL7 medical encounter document reminding a patient what medications and tests they need to perform and when their next appointment is,
  • a pipe that will convert HTML content into VoiceML, and another that will then render the VoiceML through a text-to-speech filter,
  • a pipe that performs text enrichment through a third party service then passes the enriched document into the next pipe in the chain
  • and so on...

Moreover, because XProc is itself an XML abstraction, this raises the possibility that the same XProc can be used with multiple xml database environments, as the abstraction of the signature could also provide for selection of the right XQuery extensions or similar code for a given database.

Whether this would lead to the same kind of community support that Drupal now has of course remains to be seen, but it is not hard here to envision both a strong community following, especially in the realm of publishing, blogging and general content management, and commercial opportunities (XBRL, HL7, S1000D, HR/XML, DITA integration - really anywhere where you're dealing with enterprise-grade XML). This also makes bridge applications between non-XML data formats, such as JSON, EDI, email formats and so forth much easier to build without having to worry about establishing complex Java programming, with all of the messy creation of readers and writers that can make such code so problematic, not just in a one-off fashion, but in a way that makes these capabilities available to customers or community members as appropriate.

Much of this is still extrapolation, though I don't doubt that this type of capability will come about sooner rather than later - as XML databases combined with XQuery increasingly become full-fledged web environments, as XProc goes from conceptual specification to full implementations, and as the need for sophisticated XML processing continues as part of a greater drive towards transparency in all forms of business, government and society, RESTful Services and XML pipeline architectures seem to be the best way to get there.

Kurt Cagle is an online editor for O'Reilly Media. Please feel free to subscribe to his news feed or follow him on Twitter.


You might also be interested in:

21 Comments

Kurt,

You know how much I'm fan of XML technologies and XProc was no exception to that but here you write a piece to tell us that "if you use XML all the way in your architecture then it will make sense to use XML technologies". It feels completely circular.

Indeed, if you were to have a XML database that you can query using XQuery, transform with XSLT and now even create processing pipelines with XProc... then you'd be in luck because XML would be so much easier.

You are right one aspect nonetheless, XProc was on of the missing piece in the XML world to allow developers to work almost entirely in XML all the way through. Perhaps it's one of the reason why JSon eventually took over XML on the web. Pieces to process JSon were already there and mature making it more immediatly attractive.

The heavyness Drupal may have is probably due to its architecture which, as brilliant as it looks, is certainly monolithic. You could consider making it better using XML technologies as you explain or... not.

Sylvain,

As far as I can tell, even after more than fifteen years (and a couple of rather disastrous attempts), JavaScript is still not the standard operating language on servers, and for all the presence that it has on the client, JSON barely registers as a server level communication layer (and likely never will).

My argument is actually something of an inversion to what you're claiming I'm saying here - if you can provide a common structured metalanguage internally and work with it exclusively, you significantly reduce the impedance level involved in translating between alternative types of formats (XML to JSON to EDI to CSV to whatever) for the bulk of your processing efforts. Should you need to produce or consume JSON, say, then you make sure that at a public-facing endpoint in such a pipeline you have a filter that performs such a mapping.

Could you do that with JSON? Well, sort of. Any time you needed to convert that JSON into markup, then you're adding to the impedance - last I saw, most web pages didn't render JSON structures. You have no validation facilities with JSON.

You have no consistent transformation facilities with JSON. The SQL to markup layer requires a level of semantic awareness that tends to be both a performance bottleneck, a development bottleneck and a significant source of errors - you have none of those problems in properly designed XML-based stacks.

You have no way of embedding multiple contexts in JSON, whereas such facilities have been part of XML from the beginning via namespaces.

You could argue that you don't need these capabilities, and quite honestly for a fair number of applications, you probably don't. Most of those are low hanging fruit types of applications. However, its been my experience working with both (and despite my reputation I do actually work with both) that to get beyond that first fifty percent of applications, you end up with structures and constructs that look more and more like XML, but without the decade+ of standardization and consistency brought to the development of XML.

So I guess I'd have to say in response to your argument that IF you want to take advantage of those things that XML does better than JSON (transformations, queryability, encapsulation of multiple contexts, serialization to markup, internationalization, intrinsic linking conventions, globally established schemas and so forth), then I think that pipeline architectures are definitely the way to go, and XProc provides the framework for such an architecture.

Moreover, if you're looking for JSON streams for mashups, then this architecture provides that too (XQuery is perfectly capable of producing JSON, and with a bit more work, of parsing it). This is where the faces component above comes in. I can readily say "hey, I want the stream coming from this URL to be json rather than XML" and the pipe for doing the JSON serialization rather than a final XML one would be used instead.

I would argue using the same 'anything' throughout an architecture would always have compelling benefits and agree that if this 'anything' is XML it would be an even better thing ... but pragmatically consolidation is a difficult trick to pull off and I will confidently predict that the moment it might occur it would then quickly diverge. Its the way our business works. I don't think we should convince ourselves that using the same thing everywhere is remotely realistic; yes you and I will be making 'all XML' architectures but most developers won't be.

On the point you make about JSON, If you are doing pure serialization XML does look a bit heavy handed to say the least and using something like JSON makes sense... the moment the data you are working with starts looking like a document JSON starts looking like the wrong option.

To put another way, if you know exactly how to solve your problem and you know that there are no new requirements then why not choose something perfect for the job (like JSON)... but when requirements are volatile (as they often are) then going down a general purpose route makes sense and that is the compromise that XML represents (IMHO). I like JSON, because its fast ... and easy to generate marshaling code specifically on the client end (aka Javascript) but I don't make it a 'pillar' of any solution until that part of the code calms down; which means I lose out on speed of development at the beginning of a project. As a disclosure: I have yet to actually use JSON because requirements change to much with the stuff I work on but I will continue consider using it as an optimization.

Thinking of XProc as a magic bullet to solve architectural problems is wrong ... people like Kurt (and myself) will ab/use XML technologies to do our own bidding; what I find interesting is how 'teachable' is a technology and the jury is out on XProc ... but to give some indication the 'teachability' of XSLT versus XQuery is probably a magnitude easier to the later ... I wonder how XProc will plot, in between maybe ? and what is XProc 'versus' you are making the case that it might be Drupul or PHP itself. I don't know.

I would suggest that a negative space can be helpful in adoption ... for example how popular would REST be if we had no SOAP; I am not proposing for a moment to make bad faux technologies, but it is interesting to note that adoption never occurs in a vacuum we are always going from something to something else ... just wondering where people are coming from (outside of XML) to XProc and how to tell that story to them.

I suspect once we have some XProc applications (working on it...) then we can start really getting excited about XProc.

nitpicking ps: I like your usage of 'pipes' in the article, but calling them 'steps' will be less confusing for the poor souls who read the spec ;)

Kurt,

I see more clearly what you meant now. I have then to agree with Jim when he says that as long time XML convinced technologists you guys may be bending the World slightly too much to what feels should/could be the best solution.

I will be honest, I don't know that well the XProc spec. nor its potential so I'm quite ready to trust your judgment when you announce that it's the most adapted choice when you need the powers lying within XML technologies.

But to me the real issue today isn't so much about solutions but data and their organisation. I will be careful not to fall into the Web Semantics world here but it is nonetheless a reality that has hit me not too long ago is how much so-called open API offered by services like Facebook, last.fm and the like are just a way to hide their data in a way that makes their reusing almost useless.

Few uses Atom for instance but even those that do, actually do it poorly (Wikipedia for instance is one sad instance of it). The data is usually poorly structured and therefore almost worthless.

Then you might wonder, what good is there in having all those cool technologies like XProc if you have no well-structured data to work on.

Of course, I assume you are considering those XML technologies for internal business architecture where there is more end-to-end control. In that field, XML technologies have probably more chances to thrive on.

Sylvaen,

I want you to consider a few interesting "data"-points then - one of the initiatives recently announced by the new Federal CIO, Vivek Kundra, was data.gov, which is intended to be a central clearing-house for ALL of the data that the US government produces ... petabytes worth of it - and it's very likely that almost all of that will be coming in the form of XML content.

Realistically, one of the reasons that I'm pushing all of this is not to take advantage of social networking sites. Frankly, they have a STRONG motive to keep any but the most trivial of information contained within their particular sandboxes, "open API" or not.

Instead, there are three major sources of new data coming online in the next year - financial information in the form of XBRL financial reports, electronic medical health records as HL7v3, and the data coming from the US Federal Govt (and likely other governments once they have a chance to assess what the US is doing).

Frankly, the reason I'm so big on XML right now is that I expect the pipes are going to start filling up VERY quickly - people will need to have the ability to process it, to manipulate it, to produce it and to filter it. This means that we need to be thinking about architectures based upon the tools that do exist now, not just what existed five or ten years ago - and what exists today in that regard is XProc.

What doesn't exist yet is a sense of best practices, of what the most appropriate format for building XML architectures with it is. That's a lot of why I'm trying to drive the debate here - XProc may not necessarily factor significantly in the type of work that you're doing at the moment, but it has a huge impact upon the clients that I advise. These are LARGE data producers and consumers, and for them XML is their lifeblood.

I think this article provides a good introduction to a possible application of XProc.

I've read the XProc spec a couple of times and I've found it far more difficult to understand than either XSLT or XQuery. I'm not saying that the design of XProc is bad, but rather that the specification document could probably be clearer.

In this article, listing XProc commands and steps in their own tables helps to clarify and summarize XProc functionality. It would be nice to see more articles like this one that a PHB could read and understand the value proposition of XProc.

Several of my clients could use XProc now to replace existing procedurally written transform streams and simplifying more complicated ones I expect to write in the future.

XProc in eXist will be great.

But I also need XProc for Python. I'm toying with the idea of using lxml to write an XProc processor for CPython. I know there's at least one other person 'on the net' with a similar idea...

We will be covering XProc in more detail relatively soon. There are a number of XProc initiatives underway at the moment, and I'd prefer to actually show some examples in the relatively near future.

I hadn't thought about XProc in Python, but I could definitely see where it would be useful. Please keep me informed about that.

Good introduction to XProc, Kurt. Just one clarification: as you mention, EMC has implemented XProc in our Dynamic Delivery Services (DDS) product. You say that DDS "will likely be linked with their upcoming database", but that's already the case. DDS was launched last March as the development and runtime services for xDB, our native XML database. We've just launched version 9 of xDB, which is maybe what you were thinking of.

An excellent article, and an interesting read all in all. Thanks! One small thing regarding XQuery and XProc: I don't see the xquery step in the list of optional XProc steps. You seem to have missed that one somehow.

I am developing the XProc engine at EMC and I feel I should share more details about our implementation. You are right that the engine is part of EMC/Documentum Dynamic Delivery Services - XProc is one of the core technologies in the product and we are really happy with the possibilities it provides us. DDS is all about publishing XML content and interfacing with end-users (we use XForms heavily) and XProc just makes many things so much easier (to develop/to maintain/to customize). For me, there is no better technology to process XForms submissions than XProc...

Our XProc engine is not directly dependent on DDS nor EMC/Documentum xDB (our native XML database); it is a standalone component that can be run separately. Obviously, we deploy it on top of xDB to get all the nice features like XQuery, indexes support, security, and transaction control (so yes: we can rollback a pipeline).

Although proprietary at the moment, our XProc engine will be soon released to the general public, free for developer use, as part of a wider suite other XML-related tools. One of these tools will also be a graphical, in-browser XProc designer.

(One small correction: It is not Jeroen van Amsterdam, but Jeroen van Rotterdam, but he is not really involved with the XML Processing Model WG nor with our XProc implementation; I am.)

Oh, a couple of other corrections - it's Jeroen Van Rotterdam (not Amsterdam), and Vojtech Toman is EMC's representative on the XProc working group.

An XProc tutorial courtesy of Roger Costello: http://www.xfront.com/xproc/

Jerry, Vojtech and Jeroen,

Much thanks for the clarifications. I had actually made the correction in Jeroen's name in a later draft but a glitch put an earlier draft up (this also included the lack of the XQuery step, which I use all the time myself). The joy of being both writer and editor is that when you DO mess up there's no one to look over your shoulder and say "Are you sure that's right?"

Corrections have been made in the original article.

Hi Kurt,

Nice article on XProc, thanks!

With respect to Brad's comment about clarity of the spec, it's always a challenge to find the right balance between tutorial material and concise, normative prose.

Perhaps the current spec is weighted more towards normative prose than is ideal for an introduction to the spec. Hopefully folks (myself included!) will be able to write some more tutorial material to help bridge the gap.

Hi Kurt, Thanks and appreciation for the XProc coverage, it's an area I am *trying to* follow. We explored and prototyped a "homebrew" serialized-XSLT pipelining approach using classic ASP as the transform engine and URLs as the calling interface. Even in this tiny-scale demo the coolness and modularity and coding-efficiency of pipelining is obvious, and we've got a toolset that matches a large number of basic use-cases directly (user just writes a URL referencing their data and ordering up a series of transforms). The modular pieces read many MSOffice, SQL rowset and HTML screen scraping, and transform results into various XHTML snippets views, SVG pictures, Javascript, fillers, and related useful forms for the browser.

Hello,
IMHO separate language for XML Pipeline does not have sense, since every single feature can be implemented in XSLT:

http://www.gerixsoft.com/blog/xslt/xml-pipeline-xslt

Andriy,

It can be implemented in XSLT 2.0 certainly - I've developed a few prototype XProc pieces that were built exactly this way. The benefit of the XProc approach is that XProc can be implemented in a number of different languages - my latest is in XQuery, and while there are some second order functional issues that pop up (an eval function is VERY useful there) it's pretty straightforward to build such a pipeline in XQuery. However, it could also be built in Java or PHP or Python or Ruby as well, and I've heard of people working on implementations in ALL of those languages.

I personally love writing XSLT2 code, and have been doing it since well before the spec itself was finalized in 2007, but I made the realization years ago that the number of people who are proficient enough with XSLT2 code to make it work to its full potential numbered in the high dozens in the US (that may have changed to low hundreds by now, but its still tiny compared to the number of Java or even PHP programmers).

There are things which XSLT2 is the absolute best tool for - I think that as a transformation language it's at its best, and there are definitely places where I'd prefer going out to an XSLT rather than build a treewalker in XQuery - but there are other places where it can be a kludge. XProc opens up the possibility of working in both worlds, and can very easily be tied into external processes as well.

You write:

> ... the complexities involved in treating XML as a
> programming language itself - ... forces thinking into a
> declarative model ...

Why do you think using XML implies a declarative model?

If you define declarative construct in XML syntax, you have a declarative language.

If you define imperative construct in XML syntax, you have a imperative language.


XSLT implies declarative model, it is a declarative language. You still may use xsl:call-template, but that will be like using goto.

Thank you! Great podcast as usual!

Kurt, is there any latest news on the availability of xproc libraries for use in PHP? Likewise anything hopeful about XSLT 2.0 for PHP? I have use cases for which an all-Java requirement is not yet a good fit.

News Topics

Recommended for You

Got a Question?