ETL and Publishing

By Rick Jelliffe
June 5, 2011 | Comments: 3

I have for a few years been trying to come up with a good definition of publishing workflows: as an architectural pattern. The two key distinctive features, I think, are that publishing workflows are one-way flows rather than two-way flows (e.g., database/middleware CRUD and triggers), and that there is some kind of snapshotting going on: a edition is published: many individual items get a common status or version or milestone at about the same time.

When you have a publishing workflow, you can use publishing technology, such as XML with a pipeline/functional/event bent. When you don't have a publishing workflow, you may be better off using databases or objects: quasi-XML systems such as XQuery or, more likely, not use XML at all. This is the kind of issue that an introductory course on Document Engineering might cover, of course.

Wikipedia's page on ETL is highly relevant to XML/XSLT developers and explainers, even those involved with more publishy flows than those of the typical datawarehousey/enterprisey Extract/Transform/Load scenario. (One sentence in the Wikipedia page that caught my eye: in the Best Practices section: Use file-based ETL processing where possible. Nice to see the recognition that files still have their uses!)

The Wikipedia section Real Life ETL Cycle is a pretty good prototype for the steps a large XSLT system might need, even if the source was not a DBMS and the destination was not a warehouse, but just plain old XML:

The typical real-life ETL cycle consists of the following execution steps:

  1. Cycle initiation

  2. Build reference data

  3. Extract (from sources)

  4. Validate

  5. Transform (clean, apply business rules, check for data integrity, create aggregates or disaggregates)

  6. Stage (load into staging tables, if used)

  7. Audit reports (for example, on compliance with business rules. Also, in case of failure, helps to diagnose/repair)

  8. Publish (to target tables)

  9. Archive

  10. Clean up

So it seems to me that an ETL system, operating to periodically load data to a data warehouse, is just a publishing system. Albeit one with lots of specific details and requirements.

[Update: I liked this introduction from IBM: EII, EAI and ETL: What, Why and How!. Notable: no mention of web technologies.]

You might also be interested in:


Maybe I missed something...

Is this different from what Ralph Kimball said in The Data Warehousing Toolkit book in 1996 - Chapter 1?

On ETL, I hope I haven't written anything original! The italicized text is, obviously, quotation. (If you are saying that my choice of quote fairly characterizes ETL, then thank you; or, at least, thank you to whoever made the Wikipedia entry.)

What I thought might be interesting for my readers (who I take to be techies involved in developing publishing system developers, and not enterprise people at all) was that ETL literature has some well-elaborated procedures for doing similar flows to those found in publishing.

Publishing flows often would neither start with data in DBMS nor end with data in a warehouse; but while the specifics, the jargon and the characteristic gotchas of ETL might be very different from those of industrial publishing, my point is that publishing types might find a good amount of relevance in being aware of ETL ideas.

Nice post. I like the way you show similarities between ETL steps and publishing workflows.

One change for me has been to move away from XSLT and toward XQuery typeswich transforms. I have found XQuery typeswitch to be much easier to write and maintain as well as easier for my students.

News Topics

Recommended for You

Got a Question?