Jotting on parsers for SGML-family document languages: SGML, HTML, XML #5

Collapsing bubbles

By Rick Jelliffe
September 9, 2009

This is #5 in a series of 5. 1) You say automata and I say automata, 2) Stateless semicoroutines may be convenient, 3) Putting it all together, 4) Some links to research, 5) Collapsing bubbles

My intent in the previous parts of this series was to show that SGML-family languages could be characterized in various ways: that there was method (characterizability into some language-theoretic classes) to the madness (the pioneering text of IS8879.)

I think some reader have gotten the take-away message that, with all the bubbles in the seven-layer model, I am just proving how complicated it is!

Two comments: first I guess it depends on how much the bubbles need to do. If they are relatively simple, then having multiple of them doesn't mean much.

The second is the distinct possibility that in practice having so many bubbles is completely bogus and exactly the kind of thing you don't want to do!

The thing that has become much clearer to me over these last few blogs, is that while that knowing the larger class of grammars is interesting, it is actually the small operational details that don't affect the major class are important for making an SGML-family parser easier to analyse and implement. Details such as whether the matches are greedy, whether there is a lookahead, whether there is a convenient way to express a run of unordered single-appearance terms (such as attributes), and whether the matches can simply just attempt the first choice with no back-tracking where there is potentially multiple choices.

Indeed, we have seen that the theorizing in the last few decades has centred on enumerating academically interesting subclasses along these lines, to try to cope with real phenomena: natural languages and "anti-theoretical" computer languages such as C, FORTRAN and the SGML family.

In particular, I want to bring out that if we get these details right, we can actually collapse the multiple layers. What are regarded as feedback loops in those diagrams can, at least to a certain extent, be minimized or eradicated.

For example, did you know that it is possible to have a state machine recognizer (i.e. does not build a parse tree) that can cope with DTDs with tag omission, and which parses characters directly, not tokens? We collapse the delimiter, token and validation stages into a single humungous regular grammar with no stack involved at all.

Converting a DTD with tag omission to a regular grammar

Lets assume we have a recognizer system which is greedy (if it is matching something like text, it tries to match all of it, if it is matching a * or + it tries to do so) and prioritized (it always tries to use the first term that matches where several are possible) with no backtracking.

Lets simplify things for the sake of argument and forget attributes and whitespace (they can be coped with) and entity niceties. [Update: in SGML DTDs, the - - means that neither start nor end tag may be omitted from the markup. A o means that it might be possible to omit them, in some common situations. But the method used for figuring out which tags to imply is not as sophisticated as some people hope. And it introduces the chance of undetected markup errors, so the more important the structure is, the safer it is to have explicit markup. XML considers all elements so important that none of them can have omitted or short markup.]

<!ELEMENT  a - - ( b, c)>
<!ELEMENT  b o o ( d?, e*)>
<!ELEMENT  d o o ( #PCDATA )>
<!ELEMENT  e o o ( #PCDATA )>
can be converted to this
"<a>" 
     "<b>"? "<d>"? [^<]* "</d>"? ("<e>"? [^<]* "</e>"?)* "</b>"? 
     "<c>" ... 
"</a>"

The thing is that there is no recursion in this content model, so we can just substitute out the rules. It may explode a little, but lets say memory is cheap enough for now.

However, what happens when there is recursion? In theory we need to switch from a regular language (which can be parsed using a state machine) to a context-free language (that needs a stack). But consider this: we may have paragraphs inside paragraphs such as p//p, but how often (especially in documents) do we get triples or quadruples of nesting? p//p//p//p for example?

So we in that case we can merely use inline substitution for the desired number of levels. Lets say

<!ELEMENT  a - - ( b, c)>
<!ELEMENT  b o o ( d?, e*, b?)>
<!ELEMENT  d o o ( #PCDATA )>
<!ELEMENT  e o o ( #PCDATA )>
can be converted to this
"<a>" 
     "<b>"? "<d>"? [^<]* "</d>"? ("<e>"? [^<]* "</e>"?)* 
            ("<b>"? "<d>"? [^<]* "</d>"? ("<e>"? [^<]* "</e>"?)*  
                  ("<b>"? "<d>"? [^<]* "</d>"? ("<e>"? [^<]* "</e>"/?)* 
                  "</b>"?)? 
            "</b>"?)? 
     "</b>"?  
     "<c>" ... 
"</a>"

If we get more than b/b/b/ it will be an error. But that may a sign of a bad markup too.

Note that this only works because of our greediness and prioritization. But it gives us our markup minimization up and down element levels for free, it seems to me, if we order the terms in the grammar correctly. We let the regular grammar handle the tag omission.

So what about the explosion here?

Another option would be to allow nesting hints:


<!ELEMENT a - - ( b, c)>
<!ELEMENT b o o ( d?, e*, b?)><?max-nesting b 2 ?>
<!ELEMENT d o o ( #PCDATA )>
<!ELEMENT e o o ( #PCDATA )>

If we take this as our model, then exclusion exceptions are handled at the generated grammar level

<!ELEMENT  a - - ( b, c)>
<!ELEMENT  b o o ( d?, e*, b?) -  e >
<!ELEMENT  d o o ( #PCDATA )>
<!ELEMENT  e o o ( #PCDATA )>
can be converted to this
"<a>" 
     "<b>"? "<d>"? [^<]* "</d>"? ("<e>"? [^<]* "</e>"?)* 
            ("<b>"? "<d>"? [^<]* "</d>"?  
                  "<b>"? "<d>"? [^<]* "</d>"?  
                  "</b>"?)? 
            "</b>"? )?
     "</b>"?  
     "<c>" ... 
"</a>"

However, the exclusion would not survive crossing into a nesting non-terminal (subelement with no tag omission, i.e. - -) boundary, so either there needs to be special grammars or all productions must be inlined after an inclusion is found.

I haven't thought about inclusion exceptions, I don't expect it is technically difficult but I do expect it would explode. The regular grammar could also support short reference recognition too and even <!USEMAP>, by the way: each short reference is just another branch, and each one probably causing an explosion. Memory is cheap, but it still may not be that cheap!

If our grammar system allowed permuted choices to cope with attributes and xs:all, then that would cut down on another source of explosions. But the stable door is already open.

Anyway, my purpose here is just to show that even though SGML-family languages can be thought about as a layered system, a lot of the grammatical complexity can be evaporated by using grammars with appropriate properties. Indeed, it may be that the central issue with implementing SGML-family languages without having to go through hoops is not so much figuring out the class of grammar(s) involved, but the appropriate properties such as greediness.

Needing the stack for less

Of course, the point of the earlier entries of this series was that it is prudent to start off with a stack to prevent ad hoc coding messiness apart from any technical necessities to support entities and so on.

And even though we can build a recognizer using a regular grammar, it wouldn't provide the information set, you need to go to a proper parser for that. For example, when I made an SGML mode for the Topologi editor which just has colored (and circled) text, I just used a regular grammar, parameterized to jump in at possible line ends. I used the subset that just looks at delimiters, but I could have made a line parser that used the DTD and minimization, as above. The information from a recognizer is certainly enough for that: you can say "this is a short reference delimiter here" or "this is an entity reference" without having to expand them.

Which means for parsing we are back to either having to go down a Chomsky level and have context-free languages (i.e. get a stack) or we need to have multiple layers such as the bubble diagram (which may need to be more than state machines in any case.) Of course, by the time you get to XML with no DTD, you need a stack for WF checking in any case.

Nevertheless, the possibility of inlining some kinds of non-recursive productions of a grammar may allow some useful implementation tricks.

As a side note, when an element declaration for SGML has - - then both the start and end tag are required. It provides a boundary for tag omission. To reduce explosions in some cases, but not reliably, we can allow those elements to be stacked (and implemented as re-usable functions or table rows, for space saving) i.e. treated as a single production of the grammar.

Term rewriting

Another approach that is worth considering is that we can still have a Conway pipeline, but that the outputs can be the same kind of information as the inputs. The bubbles I had in my layer diagrams all have different kinds of information being extracted from the input: octet to character, character to delimiter, delimiter to token, token to information item.

But the notional stage could be a term rewriting stage of some kind. In fact, many of the stages need this. When transcoding octets to characters, it is not enough to merely use the simple code transformation: best practice is to use a particular Unicode Normalization Form. For example, for accented Latin letters, you want to use the single combined character where it is available, rather than the letter base character followed by combining accent characters.

I have given a few examples before from other SGML-family languages: full SGML's Record End movement rules and the mooted tree fiddles of HTML 5. Attribute value defaulting and #CURRENT processing are obviously other examples on the infoset.

I think it is easy to see how such stages can simplify later stages. To follow on with the example above we have almost the same grammar as above with no end-tag ommission.

<!ELEMENT  a - - ( b, c)>
<!ELEMENT  b o - ( d?, e*, b?)>
<!ELEMENT  d o - ( #PCDATA )>
<!ELEMENT  e o - ( #PCDATA )>
but a rewrite stage
  "<a><d>"      becomes   "<a><b><d>"
  "<a><e>"      becomes   "<a><b><e>"
  "<a>" lookahead "[^<]"      becomes   "<a><b><d>"
  "<b>" lookahead "[^<]"      becomes   "<b><d>" 
  "</d>" lookahead "[^<]"      becomes   "</d><d>" 
  "</e>" lookahead "[^<]"      becomes   "</e><e>"
allows us to simplify the DTD to something like
<!ELEMENT  a - - ( b, c)>
<!ELEMENT  b o - ( d?, e*, b?)>  <!-- can only omit top level b start-tag not subsequent -->
<!ELEMENT  d - - ( #PCDATA )>
<!ELEMENT  e - - ( #PCDATA )>

and thereby the grammar to something like this (please excuse trivial errors here):

"<a>" 
     "<b>" "<d>" [^<]* "</d>"? ("<e>" [^<]* "</e>")* 
          ( "<b>" "<d>" [^<]* "</d>"? ("<e>" [^<]* "</e>")*  
                  ("<b>"? "<d>" [^<]* "</d>"? ("<e>" [^<]* "</e>")* 
                  "</b>"?)? 
            "</b>" )?
     "</b>"  
     "<c>" ... 
"</a>"

I think this shows that one of the points of multiple layers is to refactor grammars in a way that reduces the complexity of the grammar (e.g. the number of transitions in an implementing FSM.) In some cases it would be useful, in others it may be fiddling while Rome burned, I guess.

And, of course, this pre-processing stage itself does not have to be merely a list of substitutions (regular.) It could be a loop where the term rewriting pushes to the input as well as to the output. For example, in the case of omit-tag minimization of element e, you would have

"</e></b>" is passed through "</e></b>"
"</e>" "<[^eb/]*>" becomes "</e></b>" and push $2 back on input

which may be better done with a peek (lookahead)


"</e></b>" is passed through "</e></b>"
"</e>" lookahead "<[^eb/]*>" becomes "</e></b>"

This looping then in effect can take you up the stack, by one level at a time.


Limits to pair-wise tag rewriting


However, as an optimisation it can only be applied in some situations, in particular where there is only every possible occurrence of the duo of tags in all the possible paths through the grammar can allow this substitution, taking minimization into consideration. For example, if c in our example had the same content model, so that

<!ELEMENT  a - - ( b, c?)>
<!ELEMENT  b o - ( d?, e*, b?)>  
<!ELEMENT  c o - ( d?, e*, b?)>  
<!ELEMENT  d - o ( #PCDATA )>
<!ELEMENT  e - o ( #PCDATA )>

then the pair </d></a> could be </d><b></a> or it could be </d><c></a>: we don't know where we are up to in the content model. (The issue of how close a model you can get just taking pairs rather than a full content model of course also came up during development of the XSD-to-Schematron converter, which is highly pair-based: x/y, x/following-sibling::*[1][name() =y], x:*[1][name() = y], x:*[last()][name()= y] and so on. Usually it is pretty good.)

Of course, where you have only one object pushback, you don't actually need a stack on the input or output. You just need a variable to store it in, and access that for the next match. Same difference.

If your aim was just to provide an output in which every start tag had some kind of end-tag, then you could achieve your goal (for many DTDs) by introducing a kind of wildcard tag. For example, lets say we can use </> to mean an end tag. (In SGML terms, this is a short tag.)

So now with that grammar still as


<!ELEMENT a - - ( b, c?)>
<!ELEMENT b o - ( d?, e*, b?)>
<!ELEMENT c o - ( d?, e*, b?)>
<!ELEMENT d - o ( #PCDATA )>
<!ELEMENT e - o ( #PCDATA )>

we can use the rewrite rules

"</d></a>" becomes "</d></></a>"

And, in our grammar all the "</b>" become ("</b>" | "</>").

We could use the similar logic and even introduce a </*> tag, which would signify that "there are some close tags missing, but I don't know how many!" which might be better information than none. Possibly. I'd probably be worried if ever I saw it: the old hack-alert goes off.

On the fly addition of rules

One of the areas of feedback in our bubble diagram is the idea that when you parse a DTD, you then need to feed back the rules to the parser/validator so it can continue to parse the document. But we could rearrange this in a Conway-style pipeline into a feed forward arrangement, for instance where the DTD parser feeds DTD information out then just passes forward the subsequent input tokens. (Obviously you would look at passing forward the whole stream, for efficiency.)

An alternative to rearrangement is to have a class of extensible grammar, that allows its own rules to be added. This is another way of coping with DTDs from within a single bubble.

I was interested to read that there is at least one system (katahdin) based on the Rats! parser (and here) that allows on-the-fly addition of syntax rules: so I would expect that Rats! could cope with reading the DTD and generating the rules for parsing the instance without needing any pre-compiled DTDs (or schemas). SGML-family documents are often very long (megs or gigs) rather than being very deep, so that would be a constraint to investigate first about Rats!, it seems to me.

Are SGML-family documents trees?

In one sense, of course full-SGML documents have never been trees. The ID features means they are not, except by squinting to only see containment as the only kind relationship being allowed. But of course SGML documents are primarily or at least trees as far as their information set goes.

So the changes mooted in HTML 5 w.r.t. interleaved inline format elements should cause a wailing and gnashing of teeth. I think they are a bad idea, just because their justification (that this is what HTML systems actually do) is bogus.

But I would say this too: the central premise of SGML is not that documents are primarily trees. It is first descriptive markup and second rigorous markup. The inline elements are those grey areas where you have semi-descriptive markup (where the labeling semantic of what rhetorical or semantic structure the formatting corresponds to is not interesting or useful) but they are certainly not specific format instructions. But the need for rigorous markup, where the parsing rules are explicit, is clearly exactly what the HTML 5 spec is trying to achieve.

I see the change in rules mooted for HTML as being moving from defining HTML as a shonky application of SGML to being a transformation of a near SGML syntax into the SGML information set. If you look at the contentious transformation of the inline formatting items, you see that they are transformed into a tree structure, not an overlapping markup structure.

So I while I don't think it is a good idea, I don't think it is a bad idea!

SGML as a centre of gravity no more?

SGML is a kitchen sink containing babies and bathwater, to elaborately mix metaphors. HTML threw out some babies, XML threw out others. Enabling these kinds of familiy members who retain as much commonality as possible is indeed the distinctive feature of full SGML, witness the SGML Declaration: full SGML is so complex because it is highly parameterized to allow the selection of features for particular family members.

Where SGML went wrong, I think, is that the only way that this is sustainable in the long run is if the base language keeps up with external realities. XML was a shock that the SGML standard coped with: it was tweaked as necessary. HTML has never been coped with, in large part because it was being done by different committees without adequate liaison (and probably positive disinterest on both sides on time reasons alone.)

So we are increasingly in the situation where we can expect HTML to diverge from SGML for what are ultimately institutional reasons. In effect, SGML has split into four camps, HTML, XML, Wikis and Domain Specific Languages, who have no interest in making agreements with each on meta-syntactical issue outside at most a common vocabulary on information items.

The markup community came together in the 1980s but it has split apart with the advent of the WWW. GK Chesterton said something like tradition is democracy that doesn't exclude you if you happen to be dead, and this is true of standards: if a standard is an agreement, what happens when the stakeholders move to disagreement? Without stakeholder agreement the standard cannot be evolved (new sections to IS8879 are not likely!) and without evolution the different stakeholders have no centre of graivity that pulls them to a common ideal.

So in its first decade, SGML acted as the kind of lexical standard it was intended for. In its second decade it acted more as a centre of gravity. But now it is the *ML Information Set that has replaced SGML as a centre of gravity: the mechanism being that many of the non-XML syntaxes have an alternative XML form (XML being now the de facto SGML default concrete syntax).


You might also be interested in:

News Topics

Recommended for You

Got a Question?