Jotting on parsers for SGML-family document languages: SGML, HTML, XML #4

Some links to research

By Rick Jelliffe
September 6, 2009 | Comments: 2

This is #4 in a series of 4. 1) You say automata and I say automata, 2) Stateless semicoroutines may be convenient, 3) Putting it all together, 4) Some links to research


These jottings are obviously not intended to be science nor comprehensive. That is way beyond my capabilities.

But it would be remiss of me not at least provide interested readers with a pointer to the three most active researchers into the areas of grammars and SGML/XML: Anne Brüggemann-Klein, Derick Wood and Murata Makoto (Makoto Murata).

Blaming the victim?

They have researched the classes of languages that a document validated by a DTD fits into, and various issues that come out of that. Kudos for them for not taking the mismatches between full SGML and theory as necessarily full SGML's problem: contrast this with the attitude in this 1998 presentation by Dick Grune, which I think was typical:

SGML was clearly not designed to interface seamlessly with existing computer science parsing techniques. Most of these techniques were already very well known in the beginning of the 80s, when SGML was designed. The syntax issues of SGML in ISO 8879 are expressed in completely novel terminology and the parsing requirements do not match known algorithms and techniques.

The results were not favourable: few computer scientists have worked on parsing SGML, although many software designers and programmers have. Most computer scientists found parsing SGML a less than attractive challenge, and some were daunted by its alien terminology.

The situation has deteriorated with the advent of HTML, which had a syntax defined more or less by what Netscape or MicroSoft could get away with. Unsurprisingly, this did not lead to a strong basis and stable software.

The developers of XML got the point, and did it right, both from the parsing point of view, the terminology and the SGML compatibility.

I understand the pragmatic value in XML, obviously. Standards are best if they fit in with technology. But there is no reason to think that theory always drives practice, nor should, nor has. With all respect to Dr. Grune, the flaw I see in his comment above (which I of course largely agree with) is an underlying assumption that then existing techniques, no matter how well known, were necessarily up to the job: full SGML was ultimately complex because they wanted to do a complex thing, not for gratuitous reasons. XML managed to be simpler only by jettisoning a whole lot of use-cases: wiki-like markup, gone, HTML-like markup, gone, #CURRENT, gone. Not by fitting in with theoretic classes entirely.

I think by the time it reached the geekosphere, the prejudice that SGML would never fit in with any theory had taken over from what the academics like Dr Grune had actually been saying. (A good example of this is Dan Connolly's comment in the excellently named post SGML is an ugly solution to an elegant problem where he says These linguistics problems were
solved in the '60s and '70s
.)

To be fair to Grune, he does later mention

The alternative to grammar conversion is to develop the proper theory of SGML grammar features. This is not at all impossible, or even extraordinarily difficult,
but results have been slow in coming.

(I cannot help but thinking that if the problem was not extraordinarily difficult, then it is no wonder that theoretical computer science marginalized itself so successfully during the 1990s: if the real world becomes too demeaning to try to explain, what else could be expected? I think we should be excused if we feel a little frustrated at general academias' unwillingness or inability to grab the bull by its horns, with notable exceptions like the people mentioned here.)

Research

Murata-sensei is especially well-known as the driving force behind the RELAX NG schema language, which came out of his research interests (at Fuji Xerox then IBM and now in academia) in forest automata and set operations on documents.

Paul Prescod has a light introduction to forest automata; I would explain them like this: if you look at that chain of recognizers (the diagram at the top of #3 in this series) you will see that the earlier recognisers are interested in structures in linear sequences (of octets, characters, delimiters) while the later stages are interested in structures in tree data (elements), so string grammars are appropriate for the early stages while forest grammars are suitable for the later stages. For a more detailed look, see this this 2001 draft paper (Is a complete version available somewhere?)

Prof. Derick Wood has a great page at Hong Kong University of Science and Technology of his research papers on the subject. The pages are in PostScript rather than PDF, but it is easy to find a free PS to PDF conversion service on the WWW. I recommend 1994's Standard Generalized Markup Language: Mathematical and Philosophical Issue as a good place to start.

I find it helps, when reading these kinds of material, to understand that very often the analysis that academics make is done with a tacet multi-level model, where character and lexical issues are not really the subjects of interest. When they say SGML, they may really mean DTDs in particular, not the whole shebang.

For example, the Wood paper deals with figuring out which class of grammar that DTD content models belongs to (e.g. do inclusions and exceptions increase the expressive power of the language?) but notes that as of 1994 characterizing tag omission was still an area of research. So you wouldn't expect to find details of how Record End handling fits in, for example. This is frustrating if you want a broad characterization of the whole SGML system, but systematic focus is not necessarily a flaw in an academic!

Some things are easy: Wood mentions that exclusion exceptions can simply be handled as errors. Other more complex approaches are possible, for example R.W. Matzen's Dynamic Content Models. (Matzen and colleagues also worked on a formal model for parsing SGML a formal model for parsing SGML : This article defines a formal language model for SGML; systems of finite automata from systems of regular expressions. I have not sighted this, but the idea that you need systems of the things may well be a multi-layer approach.)

I want to leave this topic now and skip the last decade of research: I don't have a grip on what it has produced.

One thing I'd like to see would be a nice little paper explaining how to derive all possible fill-ins given any pair of start or end tags. For example, that if you find <table><td> there must have been a fill in <tr>. Most of the material on omitted tag implication is written in dynamic terms as if you have to look up the grammar at run-time. You still need to look up through the actual element stack, I suppose, but a table would be feasible for the sizes of the kinds of languages that would benefit from tag implication.

But I'll especially mention this 2007 paper A Robust Class of Context-Sensitive Languages because it describes an automaton that is very like what I have been describing: and their motivating case is the SGML-family languages:

We define a new class of languages defined by multi-stack automata that forms a robust subclass of context-sensitive languages, with decidable emptiness and closure under boolean operations. This class, called multi-stack visibly pushdown languages (MVPLs), is defined using multi-stack pushdown automata with two restrictions: (a) the pushdown automaton is visible, i.e. the input letter determines the operation on the stacks, and (b) any computation of the machine can be split into kstages, where in each stage, there is at most one stack that is popped.

Yet another class of languages? (We already have also been told SGML can be understood using LALR(1), Knuth's Attribute Grammars, and Catapillar Automata with Pushdown Stack.)

And even apart from the issue of which grammars can cope with SGML, there are the issues of finding grammars that describe it with clarity, or can be implemented without explosions (see Cameron's 1992 Permutation Phrase Grammar for an example: attributes are unordered, as is xsd:all.)

We seem to be getting to the stage of finally having several credible candidates for language class that can cope with SGML-family systems. And I think the point for implementers and standards developers is the ballpark that the various classes of languages provide: once you start to get multiple formalisms saying that to parse SGML-family languages (like HTML) you need to have multiple stacks or stacks of stacks or stacks you can push below and above the top item of, then it is safe for the standard to speak in terms of multiple stacks: talking about multiple stacks might be thought of as prescribing a particular implementation unless it happens to be the minimum needed for any implementation. In which case, if a standard omits that it needs multiple stacks of some arrangement, it merely obscures matters for the developer.

It will be interesting to see how the new HTML 5 spec develops: will it take into consideration modern language theory? That is their business.


You might also be interested in:

2 Comments

Re: "One thing I'd like to see would be a nice little paper explaining how to derive all possible fill-ins given any pair of start or end tags."

It turns out that there is always at most one tag (a start-tag or an end-tag) that can be inferred at any point in the parse, and it's uniquely determined by the position in the current content model.

Tag inference is most easily implemented as a form of error recovery: if the next token (tag or #PCDATA) isn't valid at the current point in the current content model, then: if there's a contextually required token, insert a start-tag for that token; else if the current content model has been satisfied, insert an end-tag for the current element; otherwise it's an error.

OMITTAG (along with DATATAG) is the main reason why SGML requires content models to be 1-unambiguous. Without that restriction the definition of "contextually required token" is meaningless.

Joe: Yes, it may be most easily implemented as a form of error recovery, but every time we step outside fitting into some formal class, we run the risk of missing some insight into other implementation options. It may be fun to wallow in spaghetti, but we don't need to boast about it!

(Readers, Joe himself has been the poster child for this, being the person who most brought the idea of validation by derivatives to the attention of the XML technical community in this page.)

So I think we have to see the SGML-family as being a kind of term-rewriting system at the token level. If we make this step, then the non-SGML good and bad ideas in the new HTML5 spec start to fit better. They have various rules for tag implication, but also for tag movement (expressed in DOM terms though):

<table><X><tr> becomes <X><table><tr> where X is an inline formatting elements like b and i. (I think this is not a bad idea, as a form of error recovery that keeps the tree structure.)

<X> text <Y> text </X> text </Y> becomes <X> text <Y> text </Y></X><Y> text </Y> where X and Y are inline formatting elements like b and i. (I think this is a bad idea, since only some HTML systems acted this way, based on the Netscape bug.)

SGML's odd system of moving record ends between markup is perhaps another example of this. Something that may be nascent or a workaround in full SGML (or be dismissible as merely an exception) may have to be regarded more seriously when we look at the whole SGML family.

I guess the next step would be to try to identify for each of the bubbles in the process, exactly which class of grammar is needed, if we want to have a characterization that is entirely within formal classes (and assuming the purpose is clarity, so a system of the most clear formalities at each level rather than just "It is level 1" or whatever.)

News Topics

Recommended for You

Got a Question?