The Bold and the Beautiful: two new drafts for HTML 5

By Rick Jelliffe
May 11, 2009 | Comments: 12

Two new drafts out at W3C from the HTML 5 effort: HTML 5: The Markup Language (hat-tip Micah) and HTML 5: A vocabulary and associated APIs for HTML and XHTML (hat-tip Jeni.)

The first one is a model of the kinds of standards-writing we need: I'd recommend any standards editor looks at it for a model of a good solution to the problem they are trying solve. It uses standard notations or make simple objective statements that can be trivially implemented. In particular, see how easy it would be to implement its Assertions statements in Schematron: they are singular and objective. (I presume they spring from Henry Sivonen's validation work.)

The second one is much larger, and is where many of the fiddles of historical HTML applications go. So it is not surprising if it is a bit less crystalline than the markup language spec. Its contents are pretty good, though, which excuses a lot in a standard I suppose.

If I would find fault with it, I think it has the XML Schemas Part 1 problem of laboriously spelling out every step in natural language text: this disguises patterns in the constraints that diagrams or schemas or tables expose, which increases the reading burden on the reader. (Furthermore, artificial languages can be more readily automatically converted to code.) These are engineering problems and engineering has evolved a large set of diagramming techniques that should be used. You can link back to plain language descriptions, but it is dangerous to use language where less ambiguous notations are possible,

For example, why on earth don't they specify the parsing using either formal grammars or state diagrams or state tables? It is great that they actually do talk of state, but just using lists provides no certainly about missing transitions for example.

At least there is some levity. Here is the text for when in in-body insertion mode, whatever that is:

An end tag whose tag name is "sarcasm"

Take a deep breath, then act as described in the "any other end tag" entry below.

From the standards perspective, I think this may be a good approach for other specifications to follow: for the documents, a rigorous "minimum manual" approach using standard schema languages (or statements which are clearly trivially implementable in such) in particular RELAX NG, Schematron, XSD datatypes and EBNF. Then a separate specification giving semantics for a class of applications. It is a continual tension in both the ODF and the OOXML standardization efforts, so I am glad to see the HTML 5 editorial approach. From his comments, I think Murata Makoto is even more strong on this than me.

If you look at how difficult it is to draft standard text using required status terms like "shall" or "should", and how using other terms opens the door for abuse and malarky, I often think that we should just ban natural language from standards. Of course that is too much: I think Schematron's approach, where you back up natural language assertions with executable tests, is a much more practical approach.


W3C is also providing a good document: HTML 5 differences from HTML 4

s2.1 I think they still get it wrong by looking at the trasnport later (e.g. the MIME header) to find the character encoding. APIs don't feed this, and it only works by accident.

s2.2 I don't see what the need for gratuitously departing from SGML and XML is, allowing <!doctype html> rather than <!DOCTYPE html>. Just make the lowercase version an optionally reportable, recoverable error and life would not be any different.

I see finally ruby text is making it into HTML. Only a decade late.

Other welcome changes include more widgets (e.g. menu, canvas, time) and some simple page-oriented features (header and footer). I was particularly pleased to see that <hr> has been given a semantic (or at least, a rhetorical function), being a paragraph-level thematic break: a step in the right direction.

Jeni Tennison was impressed with the microdata section: it seems to point out something obvious (if you label data, you can use it for stuff) but perhaps it is there for better direction.

All in all, HTML 5 looks really exciting. They have started to simplify the grammar which is good, but I would prefer further (e.g. Editor's Concrete Syntax) but still friendlier than XML.

Memory Lane

By the way, did you know there is an ISO HTML too? It was a profile of HTML 4 designed to allow HTML to be used in certain government situtations and to provide a view of the technology more from the SGML angle: it was not an alternative to W3C HTML 4 but a service to users who needed HTML 4. The specification is online here (with corrigenda here) and a user guide here. A Japanese translation is available too.

One of the problems the users guide addressed was the lack of structural elements in HTML. It suggested using DIV1, DIV2, etc elements at authoring time but stripping them out for delivery as HTML 4. So the extra structural elements in HTML 5 are interesting: <section>, <article> and <figure; will make HTML far more useful as a structured format and for round-tripping of structure information through HTML.

You might also be interested in:


The reason we didn't use a schema or parser language to describe the parsing rules of HTML5 is that as far as I am aware there are no schema or parser languages that are capable of accurately describing the parsing rules sufficiently precisely. There are some really irregular requirements, and things get really hairy when you get near scripts and document.write().

By and large I tried to be consistent for the rest of the spec and use tables and such like where possible, but frankly HTML (as it has to be implemented by browsers) really has a lot of inconsistencies -- when the spec says the same thing three times, it's usually because it says it differently each time. I try to abstract out the commonalities into common algorithms that are reused (e.g. the microsyntax parsers in the infrastructure section early on).

Please do feel free to take part in the discussions or send feedback -- see the "status of this document" section for details on how to take part.

Ian: Thanks for that, and for your hard work editing! I was suggesting using more formal/concise notations/diagrams for states (which are used) rather than some kind of schema or grammar.

This issue traces way back to IS8879 SGML, which had parser modes and various other modes which would have been far easier to understand and work with if put in some state-and-transition diagram or table.

Of course, a large diagram is difficult, and having error and recovery transitions also complicates things. But would a diagram kill? One thing that proved quite popular in XSD Datatypes Rec was the type tree diagram, which acts like a visual TOC and contains links.

(I was particularly interested in the WGs work on figuring out a parser model. Over the years, I have been working as an intellectual hobby on a unified steaming parser model that would cope with SGML, HTML, XML, etc. including transcoding and validation. The most recent version of this is a pipeline of parse events where each stage has a pipeline of four processes: parse, recognize, check, fix. I'll have to check how it stands up to HTML 5's rules.)

There's this and this showing (an old version of) the tokeniser and tree-constructor state transition graphs. But I'm not entirely sure how helpful those diagrams are (or how they could be made more so)...

Philip: Yes, that kind of state transition diagram (if it had labeled edges) is indeed exactly the kind of thing I would find useful. If I were implementing HTML, it would be the first thing I would make, to help get a grip on what is required.

Chrome and Firefox and Open Office don't open that SVG. Do you have it in PNG conveniently?

I'd like to see a version of this article that the rest of us can understand. 95% of the people who *use* HTML won't understand this. Can you tell me what is "bold and beautiful" about HTML that will help someone who is not a grammer/syntax/semantics expert, rather, someone who wants to understand this new standard early?


Rick: The SVG works fine for me in Firefox 3.0. But it's pretty large, and the top left corner is blank - maybe you need to scroll and/or zoom to see the content? :-) Converted to PNG here anyway.

This one does label the edges, though it's an extremely incomplete approximation of what the spec says. (I generated the graphs by implementing the parser algorithm as a data structure in OCaml, and then extracting the transitions and passing them off to Graphviz. The automatic layout algorithms aren't great, but they give roughly the right idea.)

Elizabeth: Why are the specs bold and beautiful? For a start because HTML 5 has been a soap opera. But perhaps because I think one of the specs is quite bold and the other is a little beauty (the markup spec).

My blog articles do tend to be deeply technical and my readership is often people involved in standards creation. Entries often deal with review issues for drafts of standards, and as an editor of a standard myself I am very interested in editorial/organizational innovations. For example of the readership, the Ian Hickson who wrote a comment earlier is the editor of one of the draft HTML 5 standards I linked to.

I would expect and hope 99.99% of people who use HTML would not be interested in this. A standard is not a tutorial.

If you are finding it difficult to understand the current drafts, then perhaps like me you would find it better to have text replaced or augmented by diagrams more.

Philip: Ha, a trick diagram :-)

Thanks for the PNG: yes the SVG does work, I didn't see the scrollbars.

Again, that diagram is exactly the kind of thing I would like to see more of in standards. (Of course, it would be better refactored into say three diagrams by making bubbles for the head and body states and transitions. And things like inital and final states should have indications. But that is just details.)

Diagrams have tremendous synoptic benefits. And that can only help QA.

At my work, we have a saying that if you cannot produce a top-level diagram of some kind showing the project you are working on, it is actually out of control. With a state and transition diagram, you are forced to QA "does every state have incoming and outgoing transitions" where the result is visually obvious: with text, mistakes may not be obvious at all.

My plan is to include information diagrams such as Philip's once the spec is past last call. I didn't want to include them earlier than that because updating diagrams is tedious and so they'd always be out of date.

Ian: Great!

Eventually, I would like to see a kind of Standards Writers Toolbench which have vocabularies for a wide range of kinds of diagrams. For example, ER, state-transition, grammar, Venn and so on. (The kind of thing we would have used the pic utility for in UNIX days.) To make it easy for standards teams to choose the best forms for communication.

I don't think it is fanciful, indeed it is mainstream in Word's SmartArt (I wrote a blog recently about mooted equivs for OpenOffice). We make our SC34 WG1 standard using a higher level language, and I thought that XSD structures does too (actually, I don't remember why I think that.)

Auto-generating text or diagrams from schemas is not unknown either (though I don't think it is that much use: RELAX NG productions are terser and work on large content models) but the OOXML experience is that large quantities of autogenerated text cause pushback by daunting incoming readers (though familiarized readers are positive.)

Ian says: "The reason we didn't use a schema or parser language to describe the parsing rules of HTML5 is that as far as I am aware there are no schema or parser languages that are capable of accurately describing the parsing rules sufficiently precisely."

Why is HTML5 like that? Why not create a language which can be expressed mathematically? Is there a history of this decision somewhere?


Charlie: Short answer: I think there are lots of reasons, some good, some bad...

Long answer: A lot of formal parser theory is based on Naom Chomsky's work, which (and this is second hand, I have never read Chomsky only summaries of him, so correction is welcome) he posited as series of simpler formal grammars as straw men to show that they didn't explain the way that grammar actually works: both the constructs that people and languages use and the constructs people and languages never use. So it is a respectable view, at least academically, that the more "natural" an artificial grammar is, the less that it can be well-described by simple formal models.

This was a problem that went all the way back to SGML: by the time you added tag ommission and implication, validation, short references, inclusion/exclusion exceptions, asynchronous entities, whitespace attribution and movement, SGML declarations, variant features, and so on, you ended up with something quite hard to pin down using simple formal grammars.

HTML inherited this a little (though I would say that SGML provided a gravitational centre for HTML that prevented it from spinning out even further from orbit into ad hocery) but the 'need' to propogate implementation bugs of previous versions (either by the HTML standard or by the decision of browser-makers) has made things quite complicated.

XML of course went the other way: make a dialect of SGML that was entirely and trivially mappable to simple grammar constructs. Good for implementers but, as XHTML has shown, it does not provide enough affordances (or, at least, 'give') for casual and SOHO use.

So HTML 5 needs to tighten up the grammar a little, but not to the extent of mirroring XML.

Conventionally we spit parsing up into three passes, tokenizing, parsing, and validation. Tokenizing is how you convert delimiters and names into tokens; parsing is how you build the tree from these tokens; validation is whether the tree fits some plan. The difference between SGML and XML was that in XML each stage is distinct (apart from entity expansion) while in SGML they are coroutines, where the result of validation can impact the delimiter recognition. HTML is a little more like SGML here (due to 's R?CDATA for example.)

Many of us have been working for years on figuring out formalisms for parsing SGML/HTML systems, in the sense of simple analytical structures that can allow both straightforward objective specification and easy implementation.

In the case of HTML 5, you can see that they have a rather complex state or stack machine at the heart of their parser rules. What makes things more complex is the need to define error and recovery states, this makes the model much more busy (though not more complex, in the case of a state machine). But state and stack machines are simple and well understood, so you can certainly regard them as expressable "mathematically".

For validation, both RELAX NG and Schematron have mathematical characterizations in their standard, so schemas written using them can be regarded as "mathematical". But XML validation runs out where the markup runs out: you cannot validate CSS and JavaScript using generic XML tools, for example.

I certainly would imagine that more of HTML5 could be validated by Schematron, but Schematron was not designed for completeness but for directness, so there may be other formalisms (perhaps yet unready) that need to be adopted to cover the full range of constraints.

News Topics

Recommended for You

Got a Question?