Now I have seen everything!

Context-free XML

By Rick Jelliffe
September 29, 2009 | Comments: 2

The classic example of a content-sensitive language is  n number of a followed by n number of b followed by n number of c where n is greater than or equal to 1. This is the kind of pattern that cannot be represented using a regular-grammar schema language (RELAX NG, XSD, DTD) well: though you can simulate it with { a, ( b, c) | ( a, (b,b, c,c,) | (a, (b,b,b, c,c,c) | ... ))} for a finite n at the cost of an explosion.

I always thought this was a kind of theoretical construct that you would never see in a real-life XML document. (Of course, there are lots of non-regular constraints around, but not this exact, calssic one.)

Today, I have actually seen one!

It comes from XML data dumped directly from an old ADABAS database. The database has a maximum text field size, so longer strings are allowed to overflow into new a second tag of the same name. Or just be repeated to embed an array in there, it seems.

So the data looks like this:


Of course, if I wanted to write a schema for this, I'd use Schematron:

<assert test="
>There should be the same number of
prefix print strings as ids.</assert>

<assert test="
>There should be the same number of
postfix print strings as ids.</assert>



Note that this gives the position-independent constraint! For positions, you could either use a grammar ( a+, b+, c+) or extra Schematron items. In this case we could be a bit clever if we wanted to lose diagnostic specificness, along the lines of testing:


You might also be interested in:


It frightens me that I may actually have an application for this. It has to do with using XML as a way to carry a loadable-data format that is essentially a reverse-polish arrangement of elements.

There is a context-free constraint that, in XML, has to be context-sensitive.

The context-free grammar boils down to

exp ::= term | exp un-op | exp exp bin-op

To avoid nesting each of the forms in an element, I will use a simple sequence of term elements mixed with un-op elements and bin-op elements.

The trick is to express the constraint that the exp grammar captures on the (exp-) well-formedness of the XML sequence (thereby excluding the well-formed XML sequences that are not exp sequences).

This is clearly XML abuse on my part, because the natural way to do this is with a recursive descent into elements where an exp is a choice of a term, unary, or binary, and those elements have their operands as content elements.

But this is a loader format, and the abuse is practically very important. (It is also harder to produce than to consume, but that is tolerable for a loader format.)

Implementing this is easy. It is providing a schema that reflects the constraint that seem to be exquisite pain.

For those wondering what inspired my reaction here, there is a blog post that describes the situation:

News Topics

Recommended for You

Got a Question?