Four months ago I blogged on the two HTML 5 drafts then out in The Bold and the Beautiful: two new drafts for HTML 5 and The Assertions in HTML 5.
Since then there are new version drafts, with a new Working Draft on 25th August 2009 of HTML 5: A vocabulary and associated APIs for HTML and XHTM. I've had a little look over, and here are some random thoughts.
First, a positive thing. Among the tsunami of new elements are some which look like partly addressing something I have been calling for since at least 2001: more rhetorical elements. Though not in the kind of generic way (superior and inferior labelling) that I suggested, but in some more flat hard-coded elements: aside, details, dialog, header, footer, more oriented to screen and page than abstract information. Here is an aside: does it work for you?
Second is that draft HTML 5 is definitely now defined in terms of an automaton rather than a grammar. In fact, more than a simple stack machine, it is a transformation: Instead of a grammar, we get exact instructions for adding and shifting nodes in an output tree. The little document HTML 5: The Markup Language is clearly the minor document, though useful. The target readership of the draft is not document writers, but parser writers.
I don't know that this is a bad thing at all, in fact from my angle one of the key priorities of SGML is the overarching need for what ISO 8879 calls rigorous markup (i.e. where the rules of parsing are clearly defined, explicit and public): SGML being more honored in the breach than the observance? Rigorous, descriptive markup which could at least cope with standard WebSGML documents in superset...not bad! Of course, the trouble with the misnested markup is that it springs out of the attitude that the tags turn effects on and off: that needs to be resisted in the text. Indeed, to be fair, the way that misnested tags are supported by specifying a transformation to a nested form: some tags may be misnested but the information set is the traditional tree of nesting elements. If they are going to have misnested tags, that is the way to do it, I guess.
Third is that the big HTML 5 draft gathers together almost so many of the previously separate specs, such as relating to DOM, into a unified whole. Say goodbye to separation of concerns! The target readership of the draft is not document writers, but parser writers.
Fourth is that there is a strong backwards-compatibility imperative at work: almost to the extent that there is no silly thing that any browser ever allowed in the past that does not go into the mix. This is a perfectly respectable attitude, and a sure way towards kitchen-sinkism. Rather than steer HTML in a particular direction, we should learn to love wallowing in the craziness. The mantra here is that the spec should document reality not present a rosy fake view. What HTML parser writers have and are doing is the driver.
I don't see why moderation isn't called for. For example, on asynchronous markup (HTML parsers allow
<p>hello <b>bold <i>new </b> world.</i></p> where the world will be in italic) I can understand that the imperative to reflect reality would want to mention this. But would it kill anyone to say "This is bad practice, even though this is allowed?"
The smaller HTML 5 draft mentions this under the name misnested tags, but no-where is this deprecated. . [Update: I was wrong on this one: the misnested tags are parse errors, that a validator should report. See the comments at the bottom for details.] I guess this is what they mean by abandoning SGML. More on this later
Fifth, I have never seen a spec which is so explicit and cheerful about willful violations of other standards. Some are trivial: this is a draft, so we should not expect completeness or consistency between the parts. For example, I see that HTML 5 has finally stimulated a separate document of Web Addresses in HTML 5 (edited by Dan C. and Michael S.-M.) though the big draft continues to use "URL" rather than "Web Address". The draft says
1.5.2 Compliance with other specifications
This section is non-normative.
This specification interacts with and relies on a wide variety of other specifications. In certain circumstances, unfortunately, the desire to be compatible with legacy content has led to this specification violating the requirements of these other specifications. Whenever this has occurred, the transgressions have been noted as "willful violations".
Now sometimes standards cannot be complied with: when they are broken and unworkable. For example, the idea that you can rely on the character encoding in a MIME header is broken: we don't have seamless end-to-end out-of-band transmission of this, so the only place it can go is inside the document, which is what XML does.
But deciding to not follow some standard in order to be compatible with reality is difficult ground for a specification: if you don't follow others' standards, how can you expect anyone to follow yours? Becoming an alternative and contradictory source of the definition of a technology is the very thing that reputable standards bodies work hard to avoid: not just because of turf wars but also because it confuses users, and it goes against order and due process.
Again, what HTML parser writers have done is the driver.
Sixth. So, can we say then that HTML 5 is the OOXML and ODF demons come to roost: standards which are explicitly aimed at canonizing the particular features sets of the championing vendors? Yes definitely: the documentary impulse has trumped the aspirational, but that is not the whole story. Because OOXML, ODF and HTML 5 all have processes with non-vendors involved significantly, though whether it is enough in each case requires vigilance, transparency and pressure. The key here is that a balance of interests is needed. If it has got as far as Formal Objection to 'One Vendor, One Veto' then I think it is time for regulators to step in, to require balance: HTML is important enough that to me it falls into the class of market dominating interface technologies that need legally imposed requirements for balance, balance of interests being the thing that makes openness real. Fat chance.
But things like this make it look like the result is muddling through, all von Bismarck and sausages of course. Setting the early stages of a standard is the name of the game when it comes to domination, look at XML and XSD: details are easy to change, the thrust of the technology is not.
Actually moving closer to SGML?Now I mentioned I had something else to say about abandoning SGML. I have written before that I think HTML is popular enough that it does not need to be couched in terms of SGML explicitly, especially since the SGML spec is not available for free on the WWW. And also written that SGML in its second decade had become a centre of gravity rather than being where the action was. And now we are in SGML's third decade, it is really just the SGML information set that is the centre of gravity, one step more removed.
What is interesting about HTML 5 is that they may think they are moving further away from SGML,
and indeed they are doing that in theory by not deprecating misnested tags, but look at other things: they are introducing CDATA marked sections to cope with SVG better, for example! And they are adopting XML-style
/> empty element syntax: HTML browsers accepted these, so again this just reflects reality, but people only do it because of XML: the modern incarnation of SGML.
I'd like to see HTML 5 allow processing instructions too, to cope with the server-side HTML. Now that is what I would call supporting reality! I suppose it shows that the vendor/developers who are driving HTML must be the browser vendors in particular. This client-side focus is lop-sided.
But seriously, what is the point of keeping this kind of rubbish? (I mean that kind of transformation in HTML, not that kind of silly shoehorning to SGML!) At least in the sense of not marking it as obsolescent and deprecated even if it may be prudent for backwards-facing browsers to continue to support it? There needs to be a way of letting some things wither on a vine, the only that grows uncontrollably is a cancer, etc etc.