Jotting on parsers for SGML-family document languages: SGML, HTML, XML

You say automata and I say automata

By Rick Jelliffe
August 30, 2009 | Comments: 4

This is #1in a series of 4. 1) You say automata and I say automata, 2) Stateless semicoroutines may be convenient, 3) Putting it all together, 4) Some links to research


Years ago, when I first started looking at SGML and how parsers for it might be written, I became confused: SGML (1986) didn't fit into the kind of grammars or automata I had been taught at university. In a way, this was not surprising: the kinds of automata that are probably* appropriate were not invented/formalized by theoretical computer scientists until for example 1988 (EPDA), 1990s (adaptive grammars), 1994 (2-SA), and as late as 2007 (unambiguous boolean grammars) and MVPLs, though other aspects were certainly floating around prior, but at a fairly rarefied altitude (indeed, 2-SA still has no Wikipedia entry.)

When people talk about HTML or XML 'abandoning' their SGML roots, I think they mean a few different things.

There are those who are merely being practical or adventurous: no-one wants to pay for the ISO standard, so it is an impractical thing to want to use; furthermore, one of the points of an enabling standard like SGML is that it causes less work than doing it yourself and simplifying, because you can explain your specific language in more general terms: however, HTML has so many exceptions that couching it in explicit SGML terms will be more work not less, and HTML is so much more well-known than SGML, SGML is no longer the enabling technology: the baby bird has flown the nest and turned into a tribe of monkeys. They have a good point.

But at a deeper level, I think there is a group of users, or developers, who really want HTML and XML to change to a different type of grammar, and preferably one neatly in the Chomsky hierarchy: some kind of pushdown automaton. One of the problems with all the SGML-family has been that developers may come to them expecting nice pushdown automata, but then finding they don't shoe-horn in well as simple PDAs and FSMs get disaffected, confused about how to apply their favourite tricks.

But it is no use me sitting here complaining that people are saying "drop SGML" without even knowing what it is they are dropping. I've been repairing the SGML entry at Wikipedia recently (and Tim Bray and the XML-DEV-ers have been doing good work putting lipstick on the XML entry) so I my mind has been on this recently. Also, I have been impressed by the HTML 5 WG's efforts w.r.t. syntax, and I wanted to sort out what I think about syntax better. So I thought I'd make some little diagrams roughly scoping a basic machine for SGML family parsers.

* I say Probably appropriate because working SGML parsers tend to look along these lines to one extent or another, not because of any hard theoretical insight. I suppose SGML could also even just be strictly a PDA (see the large dialect implemented by TEI for example) but only implementable in practice by a kind of EPDA or 2-SA... Theoreticians welcome. Update: Quinn Jackson has a paper that looks relevant: Efficient formalism-only parsing of XML/HTML using the §-calculus. I am struck on glancing through Jackson's paper on PDA-T how much the thought yeah, that seems in the ballpark comes to mind. I see that Jackson makes a similar point to mine: That is, φ-expressions and predicates are so useful in practice that it is tempting to use them when developing grammars even if they are not formally necessary. Eliminating them if they are not strictly needed is somewhat of an art.(footnote 15)

Stacks!

These are not intended to be complete (or proper automata) but to give a better idea and help discussion: there are lots of people better qualified than me to do this, but they have not as far as I know. Developers intending to write parsers might find it interesting, though I am not proposing this as the actual architecture: there are many ways to kill a cat.

My model is that there is a parser with access to various maps: of entities (whether these are local or external, streams or files, it doesn't matter), of grammars (the element and attribute declarations in the DTD), of delimiters (the delimiter strings and their functions), and of defaults (for filling in the values of attributes, however it could be generalized for various aspects of repair and type annotation as well.) Lets skip over how these maps were loaded—for an HTML parser they would be hardcoded; for a full SGML parser they would be loaded by the prior stages of the SGML Declaration and the DTD. (There are also some other minor dynamic structures needed, which I haven't shown: a list of ID attributes to check uniqueness, and a map of the current values of attributes with the #CURRENT keyword for example.)

Then the parser has access to three stacks of stacks. The first is the input: this holds characters. The second is the context stack: this holds references to items in the maps. The third is the output: this holds information set items—the parsed document.

First, lets see the parser at the start of parsing a document. We just have the initial entity. (Apologies for the uglies here: I couldn't figure out how to drive Word.)

graphic of parser at start

And here is the document being parsed.

graphic of parser n progress

Finally we have the end state: if the document was valid (or well-formed) the input stack and the context stack will be empty, and the output stack will have some information items.

graphic of parser at end


The input stack is mainly a stack of entities, with characters being popped as they are read, and new entities pushed when they are referenced. Characters from numeric character references could be pushed as well. HTML can get away with a single item stack here, because all its references will expand to characters less than the markup they replace. SGML & XML allow different parsing rules for different entities (e.g., CDATA entities) and this could be signalled by an otherwise unused character as the first character in the entity.

The context stack mainly holds a map reference to the current element being parsed. I have omitted any details relating to how the grammar is used for validation. In the case of SGML, there are also some other uses for this stack: in particular to push the particular delimiter map that should be used in some element to get wiki-like short-references. For XML this can be just a simple stack, minimally just enough to check well-formedness tag matching. For HTML this can be just a simple stack too.

For a variety of features, such as SGML's complex newline handling or default value insertion, it may be useful to have the output as a stack (of stacks of information set items) as well. DTD-less XML certainly doesn't need this.

The idea of the stacks of stacks is not new: old-timers will recall that it is basic data structure of the OmniMark text programming language (under the name of shelves and with some extra like name-ability, so that it could function as a map as well). OmniMark's developers Sam Wilmott and Norbert Winklareth are exactly the kind of guys who would have known all about EPDA and so on at the time, I'd bet—Sam had a paper featuring derivatives in 1992, quite ahead of the pack.

[Updated] Finally, there is another generic augmentation we can make that can be used to make implementation of many features more straightforward: we take all the data structures and put them in maps, one map for each kind. I hesitated in putting this in at first, because I thought it obscured the basic model I was proposing, but it is simple enough to understand.

image0015.png

This brings us much closer to a real system. Now we can have multiple inputs, for example data from command-line parameters as well as the initial document. And we can have multiple outputs, for example to standard error. And we can have multiple contexts, for example to store other variables, help with XPath tracing, and so on. Now that we have maps of maps in the top row, we can use this for namespaces: entities have two namespaces (parameter entities and general entities) for example. Thr use of namespaces for the grammars and defaulting fits of course into an XSD way of doing validation by namespace. And having name-able delimiter maps fits in with SGML's USEMAP facility better.

This is, of course, no-where near completeness, but I wanted to show that a very few generic structures can provide a lot of the framework, but this framework might not be the kind of thing that naive developers would expect. Readers may catch that the same framework (with different objects) could be pipelined before this for byte-to-character conversion and pipelined afterwards for separate validation. Indeed, the more that the general model is developed, the more that is starts to look like the kind of thing that you might use for streaming XSLT to. (But this just falls out naturally as a consequence of the links to TAG grammars, I suppose.)

Transitions

The second thing I think might be useful is that for SGML and particularly for HTML, the way that the grammars (at delimiter level and at tag level) work is interesting.

Lets take some grammar: A, B, C where these may be elements or tags or delimiters. Here is a diagram of the FSM

Now lets add extra transitions in red, showing all the states that are indirectly reachable

In HTML and SGML, many of these transitions may have to be coped with, in the name of markup minimization or error recovery.

But there is a particular rub here, just taking one of the red transitions is not enough: the parser also has to fill in a missing feasible path. For example, if a parser gets C, it may have to act as if it had gone through A and B. For example, at the delimiter level, if the HTML parser sees <b<i> it has to act as if it saw <b><i>. And if the document was <html>hello world the HTML parser will act as if it received <html><body><p>hello world</p></body></html>

The kinds of recovery actions involved could be just to push some information item to the output stream. But they could involve pushing and popping the context stack, or even pushing some markup onto the input stack.


[Update] Why talk about SGML-family?


Surely SGML is dead? Well, XML is SGML and XML isn't dead. And HTML has been notionally SGML and it isn't dead. And there still are large SGML documents and systems, so SGML is not dead: though certainly Grandpa has hung up his saddle.

But full SGML, especially as extended in 1998 with WebSGML to cope with HTML and XML better, plus XML and HTML (not to mention Wikis) together cover a very broad range of markup language possibilities. It is entirely possible that future changes in HTML or XML could already be covered by SGML: people pre-judge that change necessarily implies divergence but it is case-by-case. So I hope having an introductory model for a machine that could cope with all of them might be useful or interesting to readers.


You might also be interested in:

4 Comments

That is a fantastic article, Rick. Thanks!

Rick, as you so thoughtfully articulate, the complexity of markup to deal with semi-structured information has become quite overwhelming! Perhaps its time to consider a new fundamental encoding model that can effective deal with complex analytic markup. Are you interested in leaning more about such a model?

Sandy: "the complexity of markup to deal with semi-structured information has become quite overwhelming!" I don't know how you get that conclusion: I was noting that our ability to understand SGML-family syntaxes has become less overwhelming due to advances in theory.

There is a lot of merit to the idea of having an integrated stack for development. But mandated vertical platforms are almost the opposite of how I see a healthy computing environment: supporting organic plurality, markets, libraries and bazaars, and distributed craftsmen is more my vision.

Very nice. I could use this. Not sure so many stacks required. (What worries me a little more is that this comes close to the I4i-patent separation of metadata from the stream sort of thing. I suspect that is work-aroundable.)

I have an application for a parser model like this. I will definitely keep my eye on you [;

More seriously,

News Topics

Recommended for You

Got a Question?