XML's Dialect Problem

Diversity is not the problem; it is the requirement

By Rick Jelliffe
March 16, 2012

Many musicians will be aware of record producer Brian Eno's Oblique Strategies. When stumped, he resorts to these cards at random, with some pithy suggestion, such as "Change ambiguities to specifics ", "Decorate, decorate ", and "Magnify the most difficult details " and so on. It brings industrial process improvement to art, in a way.

When I go through the people I have met who I consider quite effective at what they do, I think they often have their own characteristic oblique strategies: I'd say Tim Bray's might be "what is the simplest thing that could just work", Charles Goldfarb might be "how can we express this as a contract?", Allette's Nick Carr might be "start with something achievable", Murata Makoto might be "the devil is in the details, so how can we have fewer details", and so on).

While I am not effective in their league, one of my favorite and compulsive strategies is "diversity is not the problem, it is the requirement" (i.e., Rather than making variation go aware by fiat, perhaps the better approach is to tame it so that the intrinsic qualities of each variant gets to live or die: the result of supporting plurality may be eventual consolidation or it may be continued plurality, of course. Charles Goldfarb's version was that standards need to enable, i.e., rather than disable.)

When I apply this strategy to XML, I think an interesting possibility emerges: the underlying cause of many of the common and often strugglingly-expressed problems that new developers actually have with XML, such as


  • "it is too complex",

  • "what is the difference between attributes and elements?",

  • "we couldn't find a way to do it without breaking the XML Schema",

  • "we needed a schema Guru but none was available so we had to go to X" (where X is JSON, CSV, database, objects, etc),

  • "we to fix so many things in tandem, a change in the schema needs to propagate down a long pipeline: I though XML was supposed to free us from this tight coupling?",

  • "we have to maintain so much parallel code for essentially the same thing" and even

  • Tim Bray's "The world does not need two ways to say "This paragraph is in 12-point Arial with 1.2em leading and ragged-right justification"

is that XML standards and technologies do not provide an adequate layer for coping with dialects.

It may be that many more of us have dialect problems than we thought.

Dialects

By dialect I have a very specific meaning: expressed using XPath terminology, it is that one document type is a dlalect of another document type if, any specific access of parent, child, attribute, value, name, next-sibling etc (i.e., any of the XPath axes) on node in a document of the first type can also be performed on the corresponding node on the other document using some generic access rules and name mappings.

(Two schemas could be quite similar, without being dialects, if it requires a convoluted and custom transformation to go from one to the other.)

For example, if one document has <man name="fred"> and another has <man><name>fred<.name><name> then we can say they are both dialects of the same information language, where our generic access rule is if you cannot find an attribute with the intended name, look for a child element with that name.

Extending this with name mappings: if one document has <man name="fred"> and another has <person><who>fred<.who><person> then we can say they are both dialects of the same information language, where our generic access rule is if you cannot find an attribute with the intended name, look for a child element with that mapped name, with mappings man-person and name-who.

A program could be written to read the same information from any XML dialect of the same language only requiring to know what the rules of that dialect is (plus what the name or type mappings are, plus what the schemas are, for maximum findability). Much less effort and information than a complete transformation with individually elaborated rules.

Because our systems do not support dialects, trivial syntax issues become a big deal. But do we only have two choices: babel or newspeak? Is there another approach that might increase the effectiveness of muddling through?

You can imagine the kinds of rules: a rule such as "if there is a sequence of elements of the same name as a mapped element, you can imply the wrapper element" when given the mappings l-list and li-item could see that the lists were the same in

<x/><l><li>fred</li><li>ginger</li></l><y/>

and
<x/><item>fred</item><item>ginger</item><y/>


(If I recall correctly, the original version of ISO DSSSL, which morphed eventually into W3C XSLT, was expressed in terms quite like these: it would have generic operations for "wrap" and "unwrap" and "promote to attribute" or whatever. )

Adaptors

How can we build systems that are robust in the face of dialects? The first thing that comes to mind is adaptor in the pattern language sense. In object-oriented languages, it is common to access and object's member fields through getter/setter functions rather than directly, so that the interface is independent of the implementation. You want to program revealing to your intention so that your code is lucid, not just according to the accidental names and syntax required by that object.

XML itself is commonly held up to be very useful as an adaptor: by picking an XML format that you can convert in and out from you can convert n:m problems into n:1:m, where the n and the m are distant data consumers and producers. When the XML itself is the problem, with multiple XML schemas, then very commonly developers invert this, so that object-oriented classes are used as the adaptor.

Lets look at a naive implementation of adaptors: you are using XSLT, you want to get an attribute value, rather than accessing //person/@name you access find-node("person")/find-related-node("name") (or find-related-node(find-node("person"), "name") where you have defined those functions.

This approach could head in the direction of information retrieval being a form of search rather than navigation, I suppose. But if you consider how default values for properties are handled in, say, SVG, where you have to do a lookup of ancestors to find the inherited values, then it is not so strange.

What is good markup?

Having come up with a definition for dialect in an XML context, it is possible to more rationally answer some important questions that come up in schema design, standards making and content analysis.

Lets go back to OOXML and ODF: during the kerfuffle, I was repeatedly challenged with questions like this "OOXML is a pile of shit: why won't you say it is bad markup?". And the reason was because my view is that there is no real difference between one dialect an another, or at least, that if we get our technology right there should be no difference.

For example, ODF uses mixed content, which OOXML marks up each text segment in a small element. Which is right? Mixed content is terser for authoring directly, and reading, and more congenial for people raised on HTML, but they were not the drivers. To me, it is a trivial syntactical difference that should make no difference: however, our XML APIs such as XPath enshrine these differences. When someone chooses one approach, they in effect impose a barrier that makes other approaches more difficult.

Another example: ODF tends to use attribute values to set properties, while OOXML tends to use a properties subeleent, with sub-sub-elements for each individual properties. Think of HTML's meta element, for example. Which is better, attributes or terminal sub-elements? Well, again, why should we care? The thing that makes the question important is the lack of support for dialects in our technology (and in our thinking).

(I remember ODF Editor Patrick Durusau commenting to me that he thought that the ODF/OOXML issue would not be resolved by harmonization but by mapping. Maybe 75% of ODF expresses the same information as 75% of OOXML, so they probably are not dialects in the my sense, but by the time you add version control, fallback formats, ZIP access and politics I wouldn't vouch for it, but maybe.)

Variant? Problem! Dialect? Meh!

Now, of course, not everyone is working in the same industry as me (currently, legal publishing), where we can have scores of legacy schemas and multiple transformation formats and millions of documents for essentially the same content type, from historical reasons, business aquisitions and so on. But the issue does come up regularly even in new operations: for example, the NZ government had a problem where they wanted to adopt whole-of-government schemas, yet the health system wanted to adopt the international standard HL-7: the two being incompatible at the schema level.
So if we assume we have programs that use APIs that are parameterized by generic accessors, we have a different basis for evaluating whether one schema is better than another, and how compatible they are. And possibly how good they are.

In particular, rather than everyone having to adopt the same schema for the same content type, all that is necessary is for people to revise (or create) each schema so that they are dialects (in the sense above) of the same language. That "language" is close to being the superset information model.

If they are all dialects (in the sense above) then perhaps we can have our cake (not disrupting existing systems too much) and can eat it (with simpler uniform access). It remains to be seen: the proof of the pudding is in the mixed culinary metaphor.

Actually, I suppose there are two business opportunities, if your documents of different schemas are actually dialects (or can be turned into dialects with small shims or slight schema revisions): you could go the route where you keep the dialects and their systems but get easier interoperability, or you could decide to converge to a single superset schema. I think the latter is what often happens.

I wrote above that because our systems do not support dialects, trivial syntax issues become a big deal. This becomes an issue for content architects to consider when developing schemas (yes, Content Architect is now my job description, so I may as well use it sigh): does doing something in a schema elevate a trivial syntax change into being a big deal? For example, if in my schema I decide to fix an order to information that has no intrinsic order, aren't I elevating that syntax into being a big deal? If we move to a dialect approach, do we need to loosen the individual schemas each dialect cope with structures of other dialects?


Moving to a two-layer schema approach is perhaps what applies here. The bottom layer is the dialect schemas, using e.g. RELAX NG or XSD. These can be accessed using generic accessors as above: if the information in each document also is complete enough, they may even be interconvertable using the dialect accessors and mapping information. But then we remove any constraints that are not expressable as a dialect issue and make them into a second layer, for which Schematron is very useful.


Digression


Now there are obviously many things to get in the way. One of them is information broadening and loosening. For example, take two forms for addresses: (1) with a content model of (line1, (line2, line3?)? ) and (2) with a content model of (unit?, street?, city?, state?, country?)

(1)
<address>
<line1>Sebastopol</line1>
<line2>CA</line2>
</address>

and
(2)
<address>
<city>Sebastopol</city>
<state>CA</state>
</address>

So we might say that (2) is a dialect of (1) because a generic rule could be made, to map city-line with the position as a rank. However, we cannot use generic rules to go from (2) to (1) so (2) is not a dialect of (1).


By the way, Schematron provides a feature called abstract patterns, that allows you to express schema constraints independently of the particular syntax used in a dialect.


You might also be interested in:

News Topics

Recommended for You

Got a Question?