I see the upcoming Scala 2.8.0 comes with a packrat parser.
Packrat parsing is an implementation technique where you have a parser with memo-ization to avoid re-parsing.
For example, say I have this grammar (I will say expressions):
X ::= a, (Y, e) | (b?, Y, f) Y ::= c, d
(It has a local ambiguity at the c.)
And I want to parse this input string: a c d f.
The packrat parser will try to parse the string against a, (Y, e) which fails when the parser gets to the e. A conventional non-backtracking parser at this stage would fail: it might report that the grammar was the wrong class for such a parser (in the SGML/XML tradition, this is called ambiguity which amplifies the formal computer science notion, but it would be better to say non-deterministic.) A conventional backtracking parser would go back to the a, then start again to parse c d f on the next branch (b?, Y, f). The packrat parser remembers it has already found a Y, and so it does not need to parse the c d again, and can start parsing from the f.
This way a packrat parser can parse in linear time, though the tradeoff is in memory usage for all this memo-ization. Packrat parsers can be used on more than Regular Grammars: they can be used for Parsing Expression Grammars.
DerivativesYou could contrast packrat parsers with the implementation technique used in most RELAX NG parsers: derivatives. As I understand it, a parser implemented using derivatives rewrites the grammar after parsing an input token to a new grammar which represents what may appear next. The immediate step of this grammar is unambiguous. Here is a rough example which I hope is in the ballpark.
In our example above, when the parser find the a token it accepts it and derives the expressions
X ::= (Y, (e | f)) | (b, Y, f) Y ::= c, d
When if finds the c token it accepts it and derives the expressions
X ::= Y, (e | f) Y ::= d
When if finds the d token it accepts it and derives the expression
X ::= e | f
And then it can consume the f token and accepts the input.
The derivatives parser does no backtracking or memo-ization: the work it does is rewriting the grammar. It can handle ambiguous regular grammars. I do not know of any material that compares the performance of dynamic re-writing rather than pre-compiled re-writing. (I haven't thought through whether the latter would be subject to combinatorial explosions: it might be useful for some grammar patterns only.)
(It is interesting that James Clark's 2002 Jing validator combines derivatives with limited memo-ization to handle a couple of particular cases. His Jing parser works at the event level, e.g. start tag, end tag, rather than on a DOM per se.)