Draft HTML 5: no longer a markup language but a machine?

By Rick Jelliffe
September 10, 2009 | Comments: 21

Four months ago I blogged on the two HTML 5 drafts then out in The Bold and the Beautiful: two new drafts for HTML 5 and The Assertions in HTML 5.

Since then there are new version drafts, with a new Working Draft on 25th August 2009 of HTML 5: A vocabulary and associated APIs for HTML and XHTM. I've had a little look over, and here are some random thoughts.

First, a positive thing. Among the tsunami of new elements are some which look like partly addressing something I have been calling for since at least 2001: more rhetorical elements. Though not in the kind of generic way (superior and inferior labelling) that I suggested, but in some more flat hard-coded elements: aside, details, dialog, header, footer, more oriented to screen and page than abstract information. Here is an aside: does it work for you?

Second is that draft HTML 5 is definitely now defined in terms of an automaton rather than a grammar. In fact, more than a simple stack machine, it is a transformation: Instead of a grammar, we get exact instructions for adding and shifting nodes in an output tree. The little document HTML 5: The Markup Language is clearly the minor document, though useful. The target readership of the draft is not document writers, but parser writers.

I don't know that this is a bad thing at all, in fact from my angle one of the key priorities of SGML is the overarching need for what ISO 8879 calls rigorous markup (i.e. where the rules of parsing are clearly defined, explicit and public): SGML being more honored in the breach than the observance? Rigorous, descriptive markup which could at least cope with standard WebSGML documents in superset...not bad! Of course, the trouble with the misnested markup is that it springs out of the attitude that the tags turn effects on and off: that needs to be resisted in the text. Indeed, to be fair, the way that misnested tags are supported by specifying a transformation to a nested form: some tags may be misnested but the information set is the traditional tree of nesting elements. If they are going to have misnested tags, that is the way to do it, I guess.

Third is that the big HTML 5 draft gathers together almost so many of the previously separate specs, such as relating to DOM, into a unified whole. Say goodbye to separation of concerns! The target readership of the draft is not document writers, but parser writers.

Fourth is that there is a strong backwards-compatibility imperative at work: almost to the extent that there is no silly thing that any browser ever allowed in the past that does not go into the mix. This is a perfectly respectable attitude, and a sure way towards kitchen-sinkism. Rather than steer HTML in a particular direction, we should learn to love wallowing in the craziness. The mantra here is that the spec should document reality not present a rosy fake view. What HTML parser writers have and are doing is the driver.

I don't see why moderation isn't called for. For example, on asynchronous markup (HTML parsers allow <p>hello <b>bold <i>new </b> world.</i></p> where the world will be in italic) I can understand that the imperative to reflect reality would want to mention this. But would it kill anyone to say "This is bad practice, even though this is allowed?" The smaller HTML 5 draft mentions this under the name misnested tags, but no-where is this deprecated. . [Update: I was wrong on this one: the misnested tags are parse errors, that a validator should report. See the comments at the bottom for details.] I guess this is what they mean by abandoning SGML. More on this later

Fifth, I have never seen a spec which is so explicit and cheerful about willful violations of other standards. Some are trivial: this is a draft, so we should not expect completeness or consistency between the parts. For example, I see that HTML 5 has finally stimulated a separate document of Web Addresses in HTML 5 (edited by Dan C. and Michael S.-M.) though the big draft continues to use "URL" rather than "Web Address". The draft says

1.5.2 Compliance with other specifications

This section is non-normative.

This specification interacts with and relies on a wide variety of other specifications. In certain circumstances, unfortunately, the desire to be compatible with legacy content has led to this specification violating the requirements of these other specifications. Whenever this has occurred, the transgressions have been noted as "willful violations".

Now sometimes standards cannot be complied with: when they are broken and unworkable. For example, the idea that you can rely on the character encoding in a MIME header is broken: we don't have seamless end-to-end out-of-band transmission of this, so the only place it can go is inside the document, which is what XML does.

But deciding to not follow some standard in order to be compatible with reality is difficult ground for a specification: if you don't follow others' standards, how can you expect anyone to follow yours? Becoming an alternative and contradictory source of the definition of a technology is the very thing that reputable standards bodies work hard to avoid: not just because of turf wars but also because it confuses users, and it goes against order and due process.

Again, what HTML parser writers have done is the driver.

Sixth. So, can we say then that HTML 5 is the OOXML and ODF demons come to roost: standards which are explicitly aimed at canonizing the particular features sets of the championing vendors? Yes definitely: the documentary impulse has trumped the aspirational, but that is not the whole story. Because OOXML, ODF and HTML 5 all have processes with non-vendors involved significantly, though whether it is enough in each case requires vigilance, transparency and pressure. The key here is that a balance of interests is needed. If it has got as far as Formal Objection to 'One Vendor, One Veto' then I think it is time for regulators to step in, to require balance: HTML is important enough that to me it falls into the class of market dominating interface technologies that need legally imposed requirements for balance, balance of interests being the thing that makes openness real. Fat chance.

But things like this make it look like the result is muddling through, all von Bismarck and sausages of course. Setting the early stages of a standard is the name of the game when it comes to domination, look at XML and XSD: details are easy to change, the thrust of the technology is not.

Actually moving closer to SGML?

Now I mentioned I had something else to say about abandoning SGML. I have written before that I think HTML is popular enough that it does not need to be couched in terms of SGML explicitly, especially since the SGML spec is not available for free on the WWW. And also written that SGML in its second decade had become a centre of gravity rather than being where the action was. And now we are in SGML's third decade, it is really just the SGML information set that is the centre of gravity, one step more removed.

What is interesting about HTML 5 is that they may think they are moving further away from SGML, and indeed they are doing that in theory by not deprecating misnested tags, but look at other things: they are introducing CDATA marked sections to cope with SVG better, for example! And they are adopting XML-style /> empty element syntax: HTML browsers accepted these, so again this just reflects reality, but people only do it because of XML: the modern incarnation of SGML.

I'd like to see HTML 5 allow processing instructions too, to cope with the server-side HTML. Now that is what I would call supporting reality! I suppose it shows that the vendor/developers who are driving HTML must be the browser vendors in particular. This client-side focus is lop-sided.

But seriously, what is the point of keeping this kind of rubbish? (I mean that kind of transformation in HTML, not that kind of silly shoehorning to SGML!) At least in the sense of not marking it as obsolescent and deprecated even if it may be prudent for backwards-facing browsers to continue to support it? There needs to be a way of letting some things wither on a vine, the only that grows uncontrollably is a cancer, etc etc.


You might also be interested in:

21 Comments

A couple of comments. (Apologies for any formatting mess-ups, since there's no preview function I can see.)

> Of course, the trouble with the misnested markup is that it springs out of
> the attitude that the tags turn effects on and off: that needs to be resisted
> in the text. Indeed, to be fair, the way that misnested tags are supported by
> specifying a transformation to a nested form: some tags may be misnested but
> the information set is the traditional tree of nesting elements. If they are
> going to have misnested tags, that is the way to do it, I guess.
> ...
> I don't see why moderation isn't called for. For example, on asynchronous
> markup (HTML parsers allow <p>hello <b>bold <i>new </b> world.</i></p> where
> the world will be in italic) I can understand that the imperative to reflect
> reality would want to mention this. But would it kill anyone to say "This is
> bad practice, even though this is allowed?" The smaller HTML 5 draft mentions
> this under the name misnested tags, but no-where is this deprecated

HTML 5 doesn't deprecate anything. HTML 4, XHTML 1.0, etc. took the tactic of having separate "Transitional" and "Strict" versions. In practice, most authors just ignored the existence of Strict.

HTML 5 rectifies this by having two different types of requirements: authoring requirements and implementation requirements. Implementations are required to handle misnested tags, but authors are prohibited from using them. Malformed markup like <b><i></b></i> will fail to validate, just as ever.

> The history of HTML 5 is of course that it was imposed on W3C to some extent,
> as a rebellion driven by the minor vendors

Well, only if "minor vendors" means "everyone except IE". The original composition of the WHATWG was Mozilla, Apple, and Opera -- i.e., every significant browser vendor except Microsoft. The key thing is that HTML 5 has been driven by browser vendors who actually want to ship new features soon, and not standards purists who care more about adhering to ideals than making something that anyone can realistically ship support for.

> But deciding to not follow some standard in order to be compatible with
> reality is difficult ground for a specification: if you don't follow others'
> standards, how can you expect anyone to follow yours? Becoming an alternative
> and contradictory source of the definition of a technology is the very thing
> that reputable standards bodies work hard to avoid: not just because of turf
> wars but also because it confuses users, and it goes against order and due
> process.

HTML 5 normally violates standards only when the non-standard behavior is so entrenched that the major browser vendors would refuse to ship anything that adheres to the standard. They normally do this because they know from experience that if they break a significant number of sites, most users will simply refuse to upgrade to the new browser version. At that point, you really have no choice but to give up and write an exception to the other standards.

> And they are adopting XML-style /> empty element syntax

No, they're not. The HTML 5 standard specifies that in the HTML serialization, the slash in /> is conforming (if used on an element that would normally have no closing tag) but ignored. This matches browser behavior. is non-conforming, and browsers must treat it as <div> rather than <div></div>. Of course, there's a whole XML serialization you can use, if you want things like />.

Aryeh: You write

HTML 5 rectifies this by having two different types of requirements: authoring requirements and implementation requirements. Implementations are required to handle misnested tags, but authors are prohibited from using them. Malformed markup like will fail to validate, just as ever.

Where does it say this? I miss where there is any prohibition. (I would welcome such a thing.) But if it is prohibited, how come you say

HTML 5 doesn't deprecate anything.

Which is it? Prohibition or no deprecation?

HTML 4, XHTML 1.0, etc. took the tactic of having separate "Transitional" and "Strict" versions. In practice, most authors just ignored the existence of Strict.

Isn't Strict and Transitional more to do with the schema — the elements allowed — rather than the syntax — misnested tags? That most authors ignore Strict may mean that for a chunk of the population, the option of doing it in CSS is overkill or unfeasible. For example, I have no control over the stylesheets on this blog. If I want to get effects, I basically have to hardcode them. I have to do it even to set headings to be bold, for example, using explicit <b> tags.

That people need to use workarounds with low-level tags says nothing about the advisability of non-nested tags.

Well, only if "minor vendors" means "everyone except IE".

I see the latest at MarketShare.com has IE at 2/3 of the market and Firefox around 2/9. So IE has 35 to 65 times the market share of the #3, #4, & #5. And 3x the share of the #2. And twice as many as all its competitors combined. Minor seems an appropriate word. It isn't intended as an insult, but it is a fair characterization I think. It is a dominated market.

> Where does it say this? I miss where there is any prohibition. (I would
> welcome such a thing.)

The rules for writing HTML documents are in section 9.1:

http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#writing

The rule against nesting isn't stated explicitly, but it follows from the text:

"The contents of the element must be placed between just after the start tag (which might be implied, in certain cases) and just before the end tag (which again, might be implied in certain cases). The exact allowed contents of each individual element depends on the content model of that element, as described earlier in this specification."

The various content models are all defined in terms of the DOM, so they consist of zero or more elements. Therefore proper nesting is required: every element below the root must be part of some other element's contents.

You can confirm that this is the intended interpretation by inputting some misnested HTML (like Foo) into http://validator.nu/ and observing the errors it raises.

> But if it is prohibited, how come you say
>> HTML 5 doesn't deprecate anything.
> Which is it? Prohibition or no deprecation?

By definition, something that's deprecated is still allowed, just discouraged. Misnested tags are not deprecated in HTML5; they're simply prohibited. User agents are still required to handle them, for pragmatic reasons, but any document that contains them doesn't conform to the HTML5 standard. It will not validate in an HTML5 validator, for instance.

> Isn't Strict and Transitional more to do with the schema â the
> elements allowed â rather than the syntax â misnested tags?

Yes. The same principle applies, though. HTML5 doesn't deprecate anything. It says that elements like are prohibited as well -- but user agents must still support them. This approach has allowed HTML5 to be far more aggressive than any previous HTML spec in defining UA behavior -- it can do so without allowing authors to use such markup.

> That most authors ignore Strict may mean that for a chunk of the population,
> the option of doing it in CSS is overkill or unfeasible. For example, I have
> no control over the stylesheets on this blog. If I want to get effects, I
> basically have to hardcode them. I have to do it even to set headings to be
> bold, for example, using explicit tags.

is still part of HTML5 (as it's part of XHTML1.1 and XHTML2, IIRC). If the software you're using doesn't let you use CSS, then it might be impossible to achieve a particular visual effect while outputting a conforming HTML5 document. (This page is XHTML1-as-text/html, so it's not a conforming HTML5 document anyway.)

> I see the latest at MarketShare.com has IE at 2/3 of the market and Firefox
> around 2/9. So IE has 35 to 65 times the market share of the #3, #4, & #5.
> And 3x the share of the #2. And twice as many as all its competitors
> combined. Minor seems an appropriate word. It isn't intended as an insult,
> but it is a fair characterization I think. It is a dominated market.

I would not call a 20%+ market share "minor", personally. Firefox is a large minority. Safari/Chrome/Opera are smaller minorities, but big enough that a lot of web designers/developers care about at least one of them. The minor browsers, to me as a web developer, would be things like Konqueror, iCab, or lynx.

Aryeh: No, a kind of implied requirement is no requirement at.

For example, the requirement that the contents of an element go between its start and end tag seem carefully worded to omit any sense of nesting. That the validator gives a particular message does not resolve anything: the text is the specification. That the nodes get shifted around to make the DOM reflect the supposed intent says nothing about conformance either.

Now what you are saying might be something that will make its way back into the spec. But it is not there. These things need to be explicit, not imagined or arguable.

You're right, however the parsing section does detail which parsing steps are errors and which are not. If an error is encountered while parsing this is called out (called a "parse error"). This has an effect on conformance as is stated by the specification.

Anne: Thanks for the comment.

The misnested tags are not marked as parse errors in the current draft, are they? (I hope you will be patient with me, since it is a big spec.)

Is there any chance they could be? Who would it harm?

They are marked as parsing errors as part of the parsing algorithm.

Trust Anne above me on this; I haven't tried to read the HTML parsing section yet. It looks like this is what you're looking for:

http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html#adoptionAgency

If you work through all the definitions, that ends up saying that if you run into an end tag for a/b/big/.../u, and there is a matching unclosed start tag, which is in scope, then "If the element is not the current node, this is a parse error." This indicates that if a parser reaches this state, the page is non-conforming. If you follow the "parse error" link, you'll find that a validator is required to report an error in this case.

Aryeh: Trust and verify!

I'd like to see HTML 5 allow processing instructions too, to cope with the server-side HTML. Now that is what I would call supporting reality! I suppose it shows that the vendor/developers who are driving HTML must be the browser vendors in particular. This client-side focus is lop-sided.

I'm not sure what you mean by "supporting reality" here. Nobody implements PIs, to my knowledge. None of the browsers have expressed interest in implementing them (usually a good proxy for users, as they get bug reports on things they're lacking), and you're the first person I've ever seen ask for them (I'm sure you're not the only one, but you're certainly a very small minority).

That's precisely the sort of thing that HTML5 is trying to *avoid* putting in the language - minority features with little to no popular support. If nobody's trying to hack around a lack in browsers, is it really a lack? Does it really need to be addressed?

Tab: No-one implements PIs?

ASP has them, using <% ... %> syntax.
JSP has them, using <% ... %> syntax.
MLSP has them, using the <? ... ?> syntax.
PHP has them, using <? ... ?> syntax and allowing <% ... %> syntax.
PSP has them, using the <% ... %> syntax

And so on.

And why mention browsers? Didn't the phrase "server-side" give anything away? Sorry if I am not being clear here. But HTML is also the thing at the other end.

The browser-side implementation of PIs would presumably be simple. Ignore the markup and use the normal rules. The model would be that PIs are to be consumed by the sender.

Are you saying that PHP, ASP, etc. actually use HTML parsers in their interpreter? I think that's false.

Anne: So by what magic do the bytes in the ASP, JSP, PHP, etc files get inside the computer, if they don't use a parser? It may not be a full parser, but of course they use a parser. You are kidding, right?

My point is that if they do not use an HTML parser they are as relevant to the HTML specification as Perl regular expressions trying to extract something out of an HTML document.

Anne: Yes, but that is a circular argument, it seems to me.

That they have to make up home made partial systems: how could they have any choice if the spec ignores their requirements?

It seems to me that if HTML 5 WG is saying "We justify ourselves because we reflect vendors and reality" then it ignores 100% of the supply-side vendors (I mean the specific markup features they actually use in reality), then the HTML 5 WG is not living up to its own goals.

"Formal Objection by One Vendor, One Veto" is misleading as that is not at all what the URL you are pointing to says. It is an email about a Formal Objection _to_ "One Vendor, One Veto".

As for your comments on SGML. Those things are disallowed, but clients are still required to process them so they can interoperate with broken Web content, such as this page :-)

Anne: I will change the text to match: just sloppy transcription put in "by" instead of "to", I don't think it makes any difference to the point. (And I don't necessarily endorse the point of that post anyway, but it clearly is notable: especially since the issue of balance of interests, and the need for regulatory involvement in trying to get it and defending the public against attempts to cartelize standards bodies, have been a long-standing issue in this blog.)

You mention "these things are disallowed". Where?

As part of the parsing algorithm. That is the section named "Parsing HTML documents".

Anne: I see it now:

9.2.5.10 The "in body" insertion mode ... An end tag whose tag name is one of: "a", "b", "big", "code", "em", "font", "i", "nobr", "s", "small", "strike", "strong", "tt", "u" ... 1 Let the formatting element be the last element in the list of active formatting elements that:

* is between the end of the list and the last
scope marker in the list, if any, or the start
of the list otherwise, and
* has the same tag name as the token.

If there is no such node, or, if that node is also
in the stack of open elements but the element is not
in scope, then this is a parse error; ignore the
token, and abort these steps.

Otherwise, if there is such a node, but that node is not in the stack of open elements, then this is a parse error; remove the element from the list, and abort these steps.

Otherwise, there is a formatting element and that element is in the stack and is in scope. If the element is not the current node, this is a parse error. In any case, proceed with the algorithm as written in the following steps.
...

Glad to be wrong!

But another confusion came to me when reading. The section 9.2 on parse errors has a note saying that it was confusing that validators were based on SGML rules and so reported parse errors that the HTML browsers did not report. (I don't know what how "the decades of productivity" lost was calculated: I presume there was some credible study including the benefits of strict validation as well as the costs, not just some emotive ranting. I expect it has just been left out of the references by mistake.)

Which errors are they? It cannot be misnested tags, because as you say these are still a parse error, so a validator would still report errors that an existing HTML application would not report. (And, in any case, these would be errors against the HTML 4 DTD's rules rather against SGML: if the desire was merely to make validator's shut up about unclosed or unopened formatting tags, they could have been marked with - - in the DTD as tag ommissible.)

All browsers that implement application/xhtml+xml allow the use of XML PIs instead of link tags to apply CSS stylesheets.

I like to use small XHTML 1.1 pages with XML PIs to apply both CSS and XSLT. A link in the head points to an Atom source file, which is pulled through an XSLT transformation, becoming the body of the XHTML document.

But I can't, because using XML PIs for XSLT transformation is currently only implemented in Safari, Chrome, Opera and Firefox.

I believe Microsoft has been hearing the grumbles about their failure to support application/xhtml+xml, and XML PIs, for years. If I want to do cross-browser XSLT, I have to use something like sarissa.js to cope with IE's failure to support XML PIs. If there weren't demand to use XSLT in browsers despite this failure, libraries like Sarissa wouldn't exist.

So, aside from IE, XML PIs were definitely implemented in enough browsers to show popular demand, and yes, Web Developers have been hacking around a lack of this feature for years -- while wasting our breath trying to get the application/xhtml+xml message across to Microsoft.

Eric: Yes, if Microsoft does not support Clark's stylesheet PIs, I hope they will. It is rather late.

But supporting PIs in HTML is what I am a little more interested in.

I guess it comes down to whether "" gets output as text ""

The other issue is that the PIs are used in places where entities references are allowed like in attribute values (in SGML this is no conceptual problem, a reference to an entity containing a PI).

And furthermore some systems allow nesting of
where these are really argument grouping or macro expansion parameters. There is nothing like that in SGML at all. But since they are all nested in data, there does not need to be any analog in SGML for them to be put into HTML.

News Topics

Recommended for You

Got a Question?