XProc and SMIL: Orchestrating Pipelines

By Philip Fennell
September 14, 2009 | Comments: 3

Although the W3C's XML Pipeline Language (XProc) hasn't even left the stable yet, people are already looking beyond what it was originally designed for. Previous threads on the XProc mailing list discussed the topics of parallel step execution, process orchestration and comparisons with the Business Process Execution Language (BPEL). On the subject of 'Sequential steps, parallelism, and side-effects', the recommendation states:

In the simple, and we believe overwhelmingly common case, inputs flow into the pipeline, through the pipeline from one step to the next, and results are produced at the end. The order of the steps is constrained by the input/output connections between them. Implementations are free to execute them in a purely sequential fashion or in parallel, as they see fit. The results are the same in either case.

However, some people feel that a more explicit, 'these steps must be executed in parallel', is required and it is easy to sympathises with that view when you too need that functionality. Last year I was involved in the aggregation of results from several term extraction services; we chose Apache Cocoon to handle the pipeline processing of requests and transforms. To make our service more efficient we added a parallel request transformer component that would explicitly make concurrent requests and merge any results returned within a set period of time. The act of making explicit concurrent requests improved the performance of the service considerably.

XProc was originally designed to solve the problem of how to describe the joining together of multiple XML processing steps. For anyone not familiar with XProc, I suggest you take a look at Dave Pawson's An introduction to XProc. It's a very good jump-off point for exploring the possibilities of this promising technology.

So, the question is, how do you extend XProc to handle new features like explicit concurrency? Firstly, XProc is very extensible, you can write extension steps or extension attributes (in their own namespace) for an implementation. There's already an EXProc community that has been formed, following in the footsteps of EXPath and EXSLT, it's job is to co-ordinate extensions across implementations.

It occurred to me, as a result of those mailing-list treads, that there is another rather intriguing possibility. If you want to dictate the active duration of a pipeline step i.e. when it starts, how long it runs for, if it should repeat, how often it should repeat and under what circumstances it should start and stop (step orchestration) then you could use the W3C's Synchronized Multimedia Integration Language (SMIL) to describe these types of behaviours. To ensure a clean separation between the descriptions of the pipeline and its orchestration you would want to use SMIL Timesheets.

Now, I'd admit that there isn't much within an XProc pipeline that you would choose to animate but the Timing and Synchronization features may prove useful in this case. SMIL Timesheets came into being a couple of years ago with the intention of separating the description of animation behaviours from content. They can be used with SVG, XHTML or any other XML language that would require animation, timing and sequencing. The idea being that you define your sequential (seq) and parallel (par) time containers and within these you identify, using CSS selectors, items within the host page that the timing and animation behaviours apply to. Personally, I would have liked to have seen XPath selectors allowed too, but we won't worry about that for now. Going back to my previous concurrent HTTP requests example, you could define such a behaviour using SMIL and XProc like this:

1 <?xml version="1.0" encoding="UTF-8"?>
2 <p:declare-step
3     xmlns:p="http://www.w3.org/ns/xproc"
4     xmlns:terms="http://example.org/service/terms/"
5     name="term-aggregation">
7   <p:input port="source"/>
8   <p:output port="result"/>
10   <p:pipeinfo>
11     <smil:timesheet xmlns:smil="http://www.w3.org/ns/SMIL30">
12       <smil:par>
13         <smil:item select="#service1" begin="aggregate-terms.begin" dur="11s"/>
14         <smil:item select="#service2" begin="aggregate-terms.begin" dur="11s"/>
15         <smil:item select="#service3" begin="aggregate-terms.begin" dur="11s"/>
16       </smil:par>
17     </smil:timesheet>
18   </p:pipeinfo>
20   <p:declare-step type="terms:get">
21     <p:documentation>
22       Makes an HTTP request to a term extraction service.
23     </p:documentation>
24     <p:input port="source"/>
25     <p:output port="result"/>
26     <p:option name="href"/>
28     <!-- ==================================================================
29        Omitted so as not to distract!
30     =================================================================== -->
32     <p:identity/>
33   </p:declare-step>
35   <terms:get xml:id="service1" name="OpenCalais"
36       href="http://opencalais.com/"/>
37   <terms:get xml:id="service2" name="MetaCarta"
38       href="http://www.metacarta.com/"/>
39   <terms:get xml:id="service3" name="Yahoo"
40       href="http://search.yahooapis.com/"/>
42   <p:wrap-sequence xml:id="aggregate-terms" name="aggregate"
43       wrapper="terms:group">
44     <p:input port="source">
45       <p:pipe step="OpenCalais" port="result"/>
46       <p:pipe step="MetaCarta" port="result"/>
47       <p:pipe step="Yahoo" port="result"/>
48     </p:input>
49   </p:wrap-sequence>
50 </p:declare-step>

In the above example, the first thing to note is the mechanism XProc uses for annotating a pipeline with implementation specific information. The p:pipeinfo element, at line 9, contains the SMIL Timesheet declaration, and this would be ignored by any processor that could not understand such mark-up.

The pipeline itself is relatively straight forward and illustrates some of the more interesting features of XProc. The terms:get step (line 20) is declared with a type attribute to indicate that it is a definition only and must be called explicitly for it to be executed, which in this case it's called three times (lines 35, 37 and 39). The step, who's contents have been hidden in order to keep the example short, would use the p:http-request step to submit a request to a term extraction service. An href option that supplies the URI for the service it is calling. In a normal XProc processor the results of the three terms:get steps are used as a sequence of inputs to the step named 'aggregate-terms' (line 42) by having their result ports connected with the source port of this step (line 45-47). The step itself, is a standard XProc step for wrapping a sequence of nodes in a container element.

In this simple example a normal XProc processor would be free to execute the steps as it sees fit, but for this example, we want the terms:get steps to be executed concurrently and only wait for 11 seconds for a response. In lines 13-15, the timesheet identifies three items using ID Selectors that are to be executed according to their parent time container, in this case smil:par which implies parallel execution. That covers the synchronization; as for timing, the identified items are set to begin when the aggregate-terms step begins. In addition to using time based values, SMIL allows event values too. So begin="aggregate-terms.begin" implies that the identified item must begin when the target's begin event is dispatched. This is a powerful feature of SMIL and can be used to great effect when linking behaviours together.

Due to the semantics of the dur attribute, this example would enforce a duration of 11 second regardless of how quickly all three service responded. SMIL 3.0 also has a max attribute that defines 'the maximum value of the active duration'. However, this attribute along with its opposite: min, appears to be missing from the current draft of the Timesheets recommendation. This is strange as the max attribute would be ideal for the job of defining time-outs.

To summaries what is going on here, the timesheet is instructing the processor that the three steps that get the extracted terms must begin when the step that attempts to wrap their results in a container element begins and that they must run in parallel and that their duration should be 11 seconds. This is a very simple example and SMIL has some very deep and powerful features that could lead to some interesting and potentially complex pipeline orchestrations.

The idea of using Timesheets in this way is novel and requires some further exploration which is why I've started a Google Code project that will attempt to explore this in a rather original way too. XProc+Time will make use of SVG, with its built in support for SMIL, and may be even SMIL itself to mimic how a SMIL enabled XProc processor might actually work. So, no final implementation but rather a series of what I hope will be a wide range of use-cases for process step orchestration using XProc and SMIL.

You might also be interested in:


Nice post - and an interesting demonstration of the extensibility/customizability potential of XProc.

Just a small remark - you say that: "... a normal XProc processor would be free to execute the [terms:get] steps as it sees fit...". This is not entirely correct, because the steps are declared as having a single (primary) input port and a single (primary) output port. Because of that, and because you don't provide explicit bindings for their input ports, the terms:get steps will get connected in a sequence automatically. Therefore a 'normal' XProc processor would always execute the steps in sequence (and the SMIL-enabled onw would have to 'break' the implicit connections to enable parallel execution).

Changing the pipeline as follows would give XProc processors freedom to execute the tests:get steps in any order (or in parallel):




The example in my previous post is broken (I forgot to do XML escaping). I hope the code below is correct. Basically, I put p:sink steps in between the terms:get steps to break up the (implicit) linear sequence of the steps.

<terms:get xml:id="service1" name="OpenCalais"
<p:input port="source">...</p:input>


<terms:get xml:id="service2" name="MetaCarta"
<p:input port="source">...</p:input>


<terms:get xml:id="service3" name="Yahoo"
<p:input port="source">...</p:input>



This is one of the more interesting and challenging aspects of XProc, the whole business of input and output ports and how and when they get bound.

In my next post on this subject I'll be looking at the implicit timing and synchronization of a pipeline's steps, so thank you for your feedback.

News Topics

Recommended for You

Got a Question?