Jotting on parsers for SGML-family document languages: SGML, HTML, XML #2

Stateless semicoroutines may be convenient

By Rick Jelliffe
September 4, 2009

This is #2 in a series of 4. 1) You say automata and I say automata, 2) Stateless semicoroutines may be convenient, 3) Putting it all together, 4) Some links to research


In the previous blog entry on this topic, I looked at some issues relating to characterizing SGML-family markup languages using formal automata classes, and about the kind of program structures we would therefor expect to see in an SGML-family parser. In this entry, I want to look at another way of categorizing parsers, which is by the class of computation that would be required, with some comments about how to understand SGML-family recognition using the conventional lexer/parser concept.

Pipelines

Melvin Conway has made many seminal contributions to software engineering that are well-known: the characterization of parsing as a (possibly multi-pipe) pipeline and the concept of coroutines both in 1963's Design of a separable transition-diagram compiler, his questioning of workforce linearity which lead to Brooke's Law and Conway's Law both in 1968's How Do Committees Invent? , the invention of the MUMPS environment and so on. I think he is one of the heroes of modularity. His website has some tantalizing slides on A New Application Model.

If the name doesn't ring a bell, here is Conway's Law:

Any organization that designs a system (defined more broadly here than just information systems) will inevitably produce a design whose structure is a copy of the organization's communication structure

The rest of this scribble here is about coroutines, with a digression on Conway's linearity idea down the bottom too.

Can SGML-family be parsed by coroutines?

I was reading a good 2001 paper CoParsing of RDF & XML today in which Jeremy Carroll uses a multiple stage pipeline approach.

You could characterize XML as being an attempt to make a subset of XML that can be parsed using a three-level or two-level grammar (for validated and unvalidated WF XML respectively.) The tokenizer of XML can be a state machine; the parser to get well-formedness is a stack machine; the validator is a stack machine. Carroll tacks on a lexer and parser and graph-constructor to do the RDF recognition for multiple levels: a pipeline.

Now certainly HTML can be treated as this kind of three-level grammar, with its fixed and open DTD.

But even with XML, it is not quite so neat. For a start, the lexer needs to be a stack machine too, in order to cope with entities. But the validator takes its grammar from the document, so really needs to be some kind of adaptive grammar (see the previous blog for references.)

But full SGML cannot be characterized using Conway's coroutine approach if we try to segment it using the same cut points (lexer, parser, validator) that we expect from two-level grammars and conventional wisdom.

Here is Conway's foundation point from his 1963 paper defining coroutines:

That property of the design which makes it amenable to many segment configurations is its separability. A program organization is separable if it is broken up into processing modules which communicate with each other according to the following restrictions: (1) the only communication between modules is in the form of discrete items of information; (2) tile flow of each of these items is along fixed, one-way paths; (3) the entire program can be laid out so that file input is at the left extreme, file output is at the right extreme, and everywhere in between all information items flowing between modules have a component of motion to the right.

Under these conditions each module may be made into a coroutine; that is, it may be coded as an autonomous program which communicates with adjacent modules as if they were input or output subroutines. ...

The coroutine notion can greatly simplify the conception of a program when its modules do not communicate with each other synchronously. ...

If we segment a full SGML parser along these lines, we see, to our disappointment or shadenfreude or in my case perky interest, that there are many 'lines of motion' of information items (a term still in use, perhaps thanks to John Cowan's erudition on these matters!) are not all left-to-right. A short reference delimiter token gets parsed rightward (in our Conway layout) into an entity reference, this data goes leftward to push a new entity on the entity stack, the entity data goes right and may be parsed into a start token which may then cause a new short reference delimiter recognition map to be required (another leftward motion of information.)

So full SGML parsing is a kind of turbulent stream, with local cycles, rather than a neat flow. And certainly not a coroutine in the classic sense of autonomous unit with only a simple suspend()/resume() or yield() operation. For full SGML parsing characterized as coroutines, the coroutines need a resume() that takes an argument, to feedback configuration information.

Put like this, then the issue with characterizing full SGML parsing in detail may become this: how do we segment the process into coroutines which have the least amount of feedback? And I think this explains the position of the mysterious object in full SGML systems, the entity manager.

Entity Manager as Lexer

The entity manager is very familiar to XML and HTML people under the name URI resolver: you give it a URI and it may return a (representation of a resource which presents ifself as a) stream of characters. In the case of HTML, the complexity of URI resolution is handled at the server-side. For full SGML and XML documents, local management issues means that it is often convenient to handle resolution at the local end, which is where XML catalogs and SGML Catalogs fit in.

The entity manager is a central segmentation point for full SGML, because it has this simple interface (the system or public identfier). Issues relating to entity stacking, character transcoding (from storage character set to document character set to system character set) and newline conditioning can be separated into it. From this perpective, I recall James Clark's comment (do I remember this right?) about the usefullness of having a peek() function on entities, not just a read,or a push() mechanism, simplifying the code. The more that the full process can be characterized with a left-to-right motion, the simpler things are for implementers.

The presence of this right-to-left motion, however, also may perhaps explain why developers who attempted to make an SGML parser by starting off with a simple pipeline rapidly found that their design was compromized and spaghettified. Starting off with the wrong class of design meant that various SGML features could only be implemented as exceptions, rather than neatly fitting in.

While I am highly sympathetic to the idea that it is a good property of a standard if it can be incrementally implemented, and that a good standard should make it clear what kind of theoretical base would be needed for this (e.g. RELAX NG and regular tree grammars), I don't see it as an all-encompassing excuse when implementations fail. I'd say that a persistent problem with pre-XML SGML was our lack of ability to clearly explain what this base was.

And, as I mentioned in the previous blog item, this is turn can partly be explained by the state of progress in formal automata theory at that time: the formal tools were inadequate. Attempts were made in the mid 1990s, with the GROVEs—Groupings Of Valid Elements—work to make an infoset for full SGML to help explain it in a black-box kind of way, but this would not address the underlying gap between the standard said, what theory could say, and what pragmatic information a developer would need to implement full SGML.

I think a trouble that developers had was that they were looking for a two-level grammar, but they missed it. Conventional programming languages and the various lexical analysers and compiler compilers for them have as their atomic unit the character. The lexer then detects tokens, which the parser then parses.

However, when you look at what markup languages do, the basic lexical unit is not, actually the character: it is in effect the octet (byte). If you come at SGML-family markup languages looking for a top-level two-level lexer/parser pair, the token the lexer produces is not the delimiter or tag, it is the character. In XML's case, for example, the Unicode character: Unicode has allowed a great simplification of implementation and a great increase in capabilities in this area. The SGML-family of languages are functions to go from octets to information items, not from characters to information items.

This is not just theoretical: and it took a while to establish how it all should work: Gavin Nicol's seminal 1994 The Multi-Lingual World Wide Web (which in turn describes my ERCS as seminal, we are awash!) was key here. The concrete question was "When you have a numeric character reference in an SGML-family document, what character set should it use?".

So in SGML-family languages, if we want to look at them and find a lexer and a parser, we can actual find them: in SGML terms, they are called the Entity Manager and the Parser. People who came to full SGML looking for a lexer/parser pipeline but who took the character as the lexical unit found nothing but confusion: a turbulent stream inside the parser, and then all this entity stuff outside the parser that some how needed to be fitted in. To a certain extend, we couldn't see the wood for the trees.

Actors?

So if detailed SGML-model parsing (and certainly for full SGML) is spaghetti when thought of in subroutine terms, and still too complex when thought of in coroutine terms, what is left?

Carl Hewitt's 1973 Actor model is surely closer to fitting the bill here. (The Wikipedia page on the Actor model mentions XML, but in the context of message sending to construct documents, not in parser design.) But full actors in Hewitt's original sense are much more managed and capable that what we need. We need something between actors and coroutines.

(Note that models such as the deterministic dataflow model the Kahn process network ultimately relates to a safe way of sequencing interoperating coroutines or actors, rather than a different computation class, if I understand it properly. Similarly with CSP. )

I have mentioned earlier that I think that what need to be looking for some system of coroutines that allows parameterized suspend/return: for example for yield() to return a value giving the next parse state or delimiter map.

This could be done by an object system that allowed methods to be coroutines, and where a suspended method does not prevent other method calls. So rather than the caller saying char= tokeniser.resume( delimiter-map ); as with parameterized co-routines, it is just the same to split it in two tokeniser.setDelimiterMap( delimiter-map ); char=tokenizer.resume();

So it seems to me that the modern languages like Scala and Lua would be pretty good for this kind of organization.

I was looking through James Clark's SP code recently: it is the last and greatest of the SGML parsers. Of course it is larger than it could be because James prefers to write as much of a system himself as he can, for lots of good reasons. And the kinds of library objects now available such as URI resolvers would make a difference to a modern implementation of full SGML (I am not expecting such an eccentric thing!) His code is object based C++, and better for using objects. but I think that it suffers from not having a more actorly/coroutiney arrangement at its top-level. That well may be easier to comprehend, an important consideration if you want the code to speak for itself without documentation.

Stateless Semicoroutines

We might adapt Conway's law and say Any programmer who designs a system (defined more broadly here than just information systems) will inevitably produce a design whose structure is a copy of the programming language's communication structure. And then we might try to be smart and say, well if we know that SGML-family parsers might be more tractable to implement if they had a design along the lines I have been mentioning, we should choose a language that supports that computational model,

So what would this computational model look like? I am heading in the direction of what I am calling stateless semicoroutines. I dont know an established term for them: it is similar to what Microsoft calls a fibre: The only state information maintained for a fiber is its stack, a subset of its registers, and the fiber data provided during fiber creation.

A semicoroutine is a restricted form of coroutine which always suspends/yields to its caller, rather than for example allowing some external scheduler to determine which coroutine runs next (you will therefor tend to get a Kahn process network for free, if I understand it right.) A stateless semicoroutine is a semicoroutine in which all state information apart from that relating to program flow (position in the program and stack information) is held externally: lets say this is an object system and the state data is held at by the method's object. The state data may be accessible by other semicoroutines in other objects.

The semicoroutines are connected to form a Conway-style pipeline, however one object is capable of altering the state of another object, ahead or behind it, without causing the system of semicoroutines to crash. By making the semicoroutines stateless, we force them to consult the external-to-the-semicoroutine state-data of the encapsulating object each time they need to use it. Obviously the coroutines would still have to be written and chunked knowing which state data is volatile.

image011.png

Now of course there is no fixed difference between data and instruction. Indeed, in a sense semicoroutines are exactly states. So by stateless I do not necessarily mean that in the code for the semicoroutine there are no variables which persist between suspension and resumption especially in relation to keeping track of the state and partially consumed values of inputs: the kinds of things that could be implemented as execution paths rather than variables. (You can see that the provision peek() function would also reduce statefullness: at its most developed this might even be a full KPN system of FIFO buffers between processes, and indeed you might see the material for such a thing in the input and output stacks of stacks I mentioned in the first blog.)

A stateless semicoroutine in this sense would be directly implementable by a continuation, it seems to me. You don't want the kind of configuration data such as entity type, element content type, delimiter maps and so on to be part of the closure, but you do want it still accessible from inside the continuation and public to other processes.

[Update: Sean MacGrath has a blog entry SGML and Justified Complexity.]

[Update: Sam Wilmott gave a presentation at Extreme Markup 2003 conference What programming language designers should do to help markup processing that particularly identified coroutines. I see in his 2002 paper Sam also said SGML has some things to teach us..., but SGML isn't the solution.]

Digression on Linearity

It is interesting to consider his linearity argument (a project of 100 workers will not produce 100 times the output of a project of 10 workers for things like software development) in the light of what Eric Raymond dubbed Linus's Law: Given enough eyeballs, all bugs are shallow. Raymond rephrases this Debugging is parallelizable'. However, folding solutions back into code is certainly not well parallelizable, and this has been a real stumbling block for the very large FOSS and semi-open-source projects like Sun's Java and Apache Xerces.

Digression on the Digression

It is notable that the Linux project is often described as being organized in a way that responds to this: Linus Torvalds being described as a gatekeeper with a 'funnel'.

We may view Linus's method as an way to create an efficient market in ``egoboo'' [Ed: the enhancement of one's reputation among other fans] — to connect the selfishness of individual hackers as firmly as possible to difficult ends that can only be achieved by sustained cooperation.

Raymond comes up with the unobjectionable :

Provided the development coordinator has a medium at least as good as the Internet, and knows how to lead without coercion, many heads are inevitably better than one.

My criticism of Sun's Java effort and Apache Xerces a few years ago was based on there not being adequate resources for folding back reported issues and bugs: the failure to make even easy fixes to the HTML parser in particular utterly crippled Java on the desktop, one of the most mind-boggling stupid strategic decisions I know of. Many heads may be better than one, but if there are limited resources for folding fixes back into the codebase then it seems to me that you will get a higher quality fixes but not necessarily any more fixes integrated in absolute terms as the number of eyeballs increases.

I think Java's shortcomings on the desktop validates Conway's Law quite well, though there are certainly multiple factors at work. Software design may follow organization structures, but organization structures directly reflect strategic and business decisions when made by a single company.

This is a reason why a plurality of standards organizations such as consortia are useful, to go beyond the organization structures of a single company, and why review organizations (which is where ISO/IEC JTC1 has been heading in the last decade) that can ideally complement the consortia by making up for whatever weaknesses they have (very often this is the lack of an international perspective, which is not just internationalization but issues such as translatability, technical fit against national capabilities/maturity, and involvement of industrial grassroots.)

I see this as core question for JTC1 SCs to ask themselves: how do we bring on-board expertise to complement the consortia or NBs which are championing the standards, to provide adequate and useful review? Increasingly, I think effective committee work will involve the ability of a member to reach out, perhaps down to the national body's mirror committees and to the national communities, to find subject matter experts that can help.

By the way, I have a need for a mathematician or computer scientist who is really on top of predicate logic, at the moment, for example. In the current review of ISO Schematron, there have been questions from Japan about whether the formal schema semantics as currently specified are idiomatic or should be replaced by natural language (or psuedo-code). I'd love more input into this, if there is anyone with these skills.


You might also be interested in:

News Topics

Recommended for You

Got a Question?