Now I have seen everything!

Context-free XML

By Rick Jelliffe
September 29, 2009 | Comments: 2

The classic example of a content-sensitive language is  n number of a followed by n number of b followed by n number of c where n is greater than or equal to 1. This is the kind of pattern that cannot be represented using a regular-grammar schema language (RELAX NG, XSD, DTD) well: though you can simulate it with { a, ( b, c) | ( a, (b,b, c,c,) | (a, (b,b,b, c,c,c) | ... ))} for a finite n at the cost of an explosion.

I always thought this was a kind of theoretical construct that you would never see in a real-life XML document. (Of course, there are lots of non-regular constraints around, but not this exact, calssic one.)

Today, I have actually seen one!

It comes from XML data dumped directly from an old ADABAS database. The database has a maximum text field size, so longer strings are allowed to overflow into new a second tag of the same name. Or just be repeated to embed an array in there, it seems.

So the data looks like this:

<RSTRN-ANS-RESTRICTION-DATA>
<RSTRN-ANS-RESTRICTION-ID>1139</RSTRN-ANS-RESTRICTION-ID>
<RSTRN-ANS-RESTRICTION-ID>1198</RSTRN-ANS-RESTRICTION-ID>
<RSTRN-ANS-TEXT-PRE-QUAL-PRINT>N</RSTRN-ANS-TEXT-PRE-QUAL-PRINT>
<RSTRN-ANS-TEXT-PRE-QUAL-PRINT>N</RSTRN-ANS-TEXT-PRE-QUAL-PRINT>
<RSTRN-ANS-TEXT-POST-QUAL-PRINT>N</RSTRN-ANS-TEXT-POST-QUAL-PRINT>
<RSTRN-ANS-TEXT-POST-QUAL-PRINT>N</RSTRN-ANS-TEXT-POST-QUAL-PRINT>
<RSTRN-ANS-TEXT-PREFIX />
<RSTRN-ANS-TEXT-PREFIX />
<RSTRN-ANS-TEXT-ARROW-FLAG />
<RSTRN-ANS-TEXT-ARROW-FLAG />
<RSTRN-ANS-QUAL-ARROW-FLAG />
<RSTRN-ANS-QUAL-ARROW-FLAG />
<RSTRN-ANS-TEXT-SUFFIX />
<RSTRN-ANS-TEXT-SUFFIX />
</RSTRN-ANS-RESTRICTION-DATA>

Of course, if I wanted to write a schema for this, I'd use Schematron:


<rule test="RSTRN-ANS-RESTRICTION-DATA">
<assert test="
count(RSTRN-ANS-RESTRICTION-ID)
= count(RSTRN-ANS-TEXT-PRE-QUAL-PRINT)"
>There should be the same number of
prefix print strings as ids.</assert>

<assert test="
count(RSTRN-ANS-RESTRICTION-ID)
= count(RSTRN-ANS-TEXT-POST-QUAL-PRINT)"
>There should be the same number of
postfix print strings as ids.</assert>

...

</rule>

Note that this gives the position-independent constraint! For positions, you could either use a grammar ( a+, b+, c+) or extra Schematron items. In this case we could be a bit clever if we wanted to lose diagnostic specificness, along the lines of testing:

    a[not(preceding-sibling::b)]
[following-sibling::b[not(preceding-sibling::c)]
[not(following-sibling::a)]
[following-sibling::c
[no(following-sibling::b)]]]


You might also be interested in:

2 Comments

It frightens me that I may actually have an application for this. It has to do with using XML as a way to carry a loadable-data format that is essentially a reverse-polish arrangement of elements.

There is a context-free constraint that, in XML, has to be context-sensitive.

The context-free grammar boils down to

exp ::= term | exp un-op | exp exp bin-op

To avoid nesting each of the forms in an element, I will use a simple sequence of term elements mixed with un-op elements and bin-op elements.

The trick is to express the constraint that the exp grammar captures on the (exp-) well-formedness of the XML sequence (thereby excluding the well-formed XML sequences that are not exp sequences).

This is clearly XML abuse on my part, because the natural way to do this is with a recursive descent into elements where an exp is a choice of a term, unary, or binary, and those elements have their operands as content elements.

But this is a loader format, and the abuse is practically very important. (It is also harder to produce than to consume, but that is tolerable for a loader format.)

Implementing this is easy. It is providing a schema that reflects the constraint that seem to be exquisite pain.

For those wondering what inspired my reaction here, there is a blog post that describes the situation: http://miser-theory.info/astraendo/pn/2009/11/friday-miser-note-interchanging-code.asp

News Topics

Recommended for You

Got a Question?