Replacing BNF with RELAX NG in standards?

"A" / "E" "BNF" or 'A' | 'E' 'BNF'

By Rick Jelliffe
July 5, 2009 | Comments: 3

ISO/IEC 1497:1996 - Information technology - Syntactic metalanguage - Extended BNF is the official ISO standard for describing grammars with BNF. (The standard is available free from ISO public site.)

IS497 is showing its age, particularly in that it only supports ASCII characters, so standards people have switched over to 2008's Augmented BNF (RFC 5234, obsoletes 2005's RFC 4232, obsoletes 1997's RFC 2232, obsoletes parts of 1977's RCF 733 and 1982's RFC 822.) ABNF is no good if you need to specify sequences whitespace (it has a collapse semantic) because it is target at tokenized, free-format notations. One of ANBF's distinguishing features is that it uses "/" rather than EBNF's "|" to specify alternatives (like in DTDs and RNCs); ABNF only allows double quotes for delimiting strings while EBNF allowed single or double quotes (like XML attribute value delimiters.)

There has been a little more interest in standardizing non-XML syntaxes recently. The WikiCreole pages are an agreement on a common form for Wikis (it seems that the markup for italics was the biggest real problem.) JSON probably doesn't need standardizing necessarily, because it is explicitly based on a subset of the JavaScript Programming Language, Standard ECMA-262 3rd Edition - December 1999.

James Clark mentioned M (a language for creating Domain Specific languages) earlier in the year. And, indeed, his SP and NSGMLS were SGML processors that allowed DSL parsers to created, using the SGML facilities for parsing embedded notations.

Just this week, preparing for a training job, I have been looking at the OmniMark 4GL again, which has a pipeline of communicating co-routines, often configured first as a tokenizer that creates tags from strings in the input text, then a grammar (XML/SGML) parser that allows processing. Think a nicer version of Perl at the front, and a different version of XSLT at the back, where they both can set modes for each other.

So this brings up another approach for standards-makers rather than using ABNF. This is to specify the grammar for your notation using a familiar XML schema language (such as RELAX NG Compact Syntax) and then give parsing rules for converting from the non-XML form to the XML form. (Indeed, this is how RELAX NG Compact Syntax itself is specified.)

The trouble with ABNF and EBNF is that there are not the kind of ubiquitous, free tools around to support them that XML has. When you cannot test a grammar, there is every chance that you will make a mistake. (A trap RDF fell into originally, IIRC.) And even if the standards-makers get it right, the users have to check by eye, which is not the most reliable method. Bringing the notation into the XML eco-system has obvious advantages for low-hanging fruit.

Imagine, for example, if XPath had been defined as an parser that converts into a standard XML format. We could be using Schematron to validate XPaths!

It seems natural for humans to partition off common changes of domain into changes of notation: C-style syntax won over LISP-style syntax. HTML has CSS and JavaScript and dates and RGB etc. SGML gave support for this, but in a way that was not layered enough to be sustainable over time (non-layerable technologies rarely last, falling apart under their own farinaceous weight.)

Many languages are two-layer: there is a lexical analyser to produce tokens then a grammar language that works on these tokens. Rather than using ANBF, I think there is a current sweet spot for using XML and XML schema languages (such as the ISO DSDL suite of RELAX NG, CRDL, and Schematron, or even XSD) for specifying the underlying grammar.

(Hat-tip David Carver for spurring this.)

You might also be interested in:



Take a look at these:

Not a wealth, I would admit, but just in case you'd missed them.


(Since this article ranks high on a google search with "relaxng abnf", it's worth adding a comment this much after the article was published...)

I wrote the specs for ABNF -- kudos for the cite all the way back to RFC 733.

One one technical point should be clarified:

"ABNF is no good if you need to specify sequences whitespace (it has a collapse semantic) because it is target at tokenized, free-format notations."

ABNF has some built-in constructs that do, indeed, collapse white space. These are extremely useful for many of the syntax specifications that use ABNF.

However these are not inherent in ABNF use. They are specific ABNF "library" rules and you can choose not to use them.

Oh what the heck. Might as well clarify:

"ABNF only allows double quotes for delimiting strings while EBNF allowed single or double quotes (like XML attribute value delimiters.

ABNF also supports specifying characters and strings numerically, in any of the usual number bases.


@Gareth: That gives a checker for EBNF and a generator for ABNF. Surely there is more than that!

@Dave: Thanks for this. ABNF is a very useful tool to have available.

I don't see any use of the term "library" in RFC 5234: which features are you referring to?

Also, the numeric codes don't identify characters but code points, in the current encoding, it seems to me. (This is like SGML/HTML in 1995, so it is a symptom of its age. Unicode-based, encoding-independent numeric references would be much better nowadays, I think.)

News Topics

Recommended for You

Got a Question?