Jotting on parsers for SGML-family document languages: SGML, HTML, XML #3

Putting it together more

By Rick Jelliffe
September 5, 2009

This is #3 in a series of 4. 1) You say automata and I say automata, 2) Stateless semicoroutines may be convenient, 3) Putting it all together, 4) Some links to research


The previous two items in this series You say automata and I say automata and Stateless semicoroutines may be convenient are not terse, so I thought I'd partly summarize.

1) Full SGML, XML and HTML are a family of markup languages.

2) SGML was developed a little ahead of both the formal theory needed to classify it, and the kinds of programming language features that would have made it easy to implement. This somewhat unfairly gave full SGML a reputation for gratuitous complexity; however some of the features that remain in XML and HTML continue to present difficulties to developers who use old languages such as C++ and Java, or know only old automata classes.

3) A convenient (but not the only) way to view SGML-family parsers may be as pipeline of recognizers with some kind of feedback. Each family member may have a different number of stages. It may be better for a developer to start with expectations of these kinds of structures to prevent later ad hoc refactoring. For example:


parse1.PNG

4) A convenient (but not the only) way to implement the recognizers may be as having a stateless semicoroutine accessing external state. This way the recognizers form something akin to Conway's classic process chain. It may be better for a developer to start with expectations of these kinds of structures to prevent later ad hoc refactoring. So object languages that support semicoroutines may be appropriate: Scala, Lua, C#, not Java?


recogoniser.PNG

5) A convenient (but not the only) way to view the state part of each recognizer may be as having various super stacks and super maps, an intuition reinforced by various modern formal automata classes. It may be better for a developer to start with expectations of these kinds of structures to prevent later ad hoc refactoring.


multistack.PNG

6) And finally that while considering the only the allowed grammars is adequate for most uses XML and the Desperate Perl Hacker, full SGML and HTML both have considerable extra complexity with features and strategies to cope with omitted markup and error recovery. To shift the complexity into regularity, table-driven approaches may be good for some of the recognizers even if semicoroutine style yield() is available.


transitions.PNG

Document Driver

In the top diagram, I have a process called "Document Driver" that I would like to explain. It is usually missed as an extra stage, or folded out into a special case, but it is no more than another grammar in the general scheme of things.

To explain, here is the top-level C++ code from James Clark's C++ SP SGML/XML parser (Parser.cxx).

Event *Parser::nextEvent()
{
  while (eventQueueEmpty()) {
    switch (phase()) {
    case noPhase:
      return 0;
    case initPhase:
      doInit();
      break;
    case prologPhase:
      doProlog();
      break;
    case declSubsetPhase:
      doDeclSubset();
      break;
    case instanceStartPhase:
      doInstanceStart();
      break;
    case contentPhase:
      doContent();
      break;
    }
  }
  return eventQueueGet();
}

I see a grammar here, or part of one ((init, prolog, declSubset, instanceStart, content).) Roughly, for full SGML the top-level grammar is some thing like this:

    document ::=  sgml_declaration? 
		( base doctype,  doctype declarations? )?
                link_process_definition? 
		instance 
    instance  ::=  instance_start 
		document* 
		instance_end 

When we look at XML we see it is the subset

    document ::=  sgml_declaration? 
		( base doctype,  doctype declarations? )?
		link_process_definition? 
		instance 
    instance  ::=  instance_start 
		document*  
		instance_end 
And for HTML just
    document ::=  sgml_declaration?  
		( base doctype,  doctype declarations?)?
		link_process_definition? 
		instance 
    instance  ::=  instance_start 
			document*  
			instance_end 

but I don't see that there are less stages than in the top diagram. The family resemblance remains.

And just by way of explanation of the items:


  • SGML has a kind of declaration called Link Process Definitions (LPD), which allow you to add extra attributes to elements. It is a kind of stylesheet mechanism intended for formatting not a hypertext thing: it has the advantage that all the information is delivered as normal attributes information items rather than an external mechanism: link maps can have names so you can invoke multiple names, they can specify the top-level element where they apply, and they have a mechanism for treating the initial element in a sequence differently to the subsequent ones. The user can override the map selection in the document too. I gather that the LPD is one of the few recognizers in SGML that could be implemented as a post-process in a simple one-way pipeline.

  • SGML allows documents to contain other sub-documents. This has been overtaken in XML by namespaces, which is more lightweight and useful for attribute and interleaved vocabularies.

It is useful to note that while full SGML often is accused of having too many built-in optional features, almost all of them, when removed from the other members of the family (HTML and XML) immediately need to be reinvented. No LPD, but we got CSS. No subdocuments, but we got namespaces. No external entities or PIs (in HTML) but we got server-side includes. No DTD but we got xml;id, xml:base, XML Schemas and so on. Now of course, many of these are improvements or slightly retargetted in functionality or developed specifically as replacements.

But that so many of the features are reimplemented means, I think, that the criticism of SGML's supposed complexity must come down to a criticism of the perceived or real lack of what Conway called separability, even after we allow that the text could be untangled better in the light of modern automata theory etc. —some diagrams even, for those of use who are graphical rather than formulaical!— or that some features have been overtaken by events.

In fact, it is notable that several features of SGML are actually less used after they have been severed from the base and made more cleanly implementable by post-processes. There are multiple reasons, I am sure, but I suspect that one of them is the issue of unreliability: you don't use a feature if you will not be pretty sure that the other end will be set up to use it.

I think there is a fundamental confusion by standards developers, particularly at the W3C. Everything to do with information set and link construction needs to be standardized and reliable: the application developer needs to be concerned with the information not its assembly. So reinventing all these wheels but not refactoring them back into the base with a fixed execution order is just window dressing that complexifies the developer's tasks. Which is hardly the point of XML,is it?

I am not saying we need to go back to full SGML! But XML the simplification that XML needs is to deal with all the ad hoc optional features (@xml:*, @xmlns) by folding them back into the base. Simplification by adding functionality? You betcha.

Packaging manifest as the top-level grammar?

The processor chain at the top-level could be extended further: add an extra recognizer at the far right to whose grammar represents the files/parts/entities/resources that can be found in an archive (or the set of entities referenced in the document.)

In fact, this is exactly what ISO 8879 does:


[1] SGML document =
SGML document entity,
( SGML subdocument entity |
SGML text entity |
character data entity |
specific character data entity |
non-SGML data entity )*

Now by now you may be saying Rick, are you really saying that SGML can only be described by some kind of seven-level grammar? Zut alors! Not really: I am saying however that in order to get a clean separation of concerns, things like a seven-layer system may be a helpful model, and that (as in the Carroll RDF paper I referenced earlier) trying to model our SGML family languages in terms of traditional Conway-style recognizer chains or trying to figure out what insights can be drawn from modern automata/grammar/language theory should be part of our discipline (even though we may all be beginners or amateurs at it.)


You might also be interested in:

News Topics

Recommended for You

Got a Question?