What is a Packrat Parser? What are Brzozowski Derivatives?

By Rick Jelliffe
March 10, 2010 | Comments: 2

I see the upcoming Scala 2.8.0 comes with a packrat parser.

Packrat parsing is an implementation technique where you have a parser with memo-ization to avoid re-parsing.

For example, say I have this grammar (I will say expressions):

X ::=  a, (Y, e) | (b?, Y, f)
Y ::=  c, d

(It has a local ambiguity at the c.)

And I want to parse this input string: a c d f.

The packrat parser will try to parse the string against a, (Y, e) which fails when the parser gets to the e. A conventional non-backtracking parser at this stage would fail: it might report that the grammar was the wrong class for such a parser (in the SGML/XML tradition, this is called ambiguity which amplifies the formal computer science notion, but it would be better to say non-deterministic.) A conventional backtracking parser would go back to the a, then start again to parse c d f on the next branch (b?, Y, f). The packrat parser remembers it has already found a Y, and so it does not need to parse the c d again, and can start parsing from the f.

This way a packrat parser can parse in linear time, though the tradeoff is in memory usage for all this memo-ization. Packrat parsers can be used on more than Regular Grammars: they can be used for Parsing Expression Grammars.


You could contrast packrat parsers with the implementation technique used in most RELAX NG parsers: derivatives. As I understand it, a parser implemented using derivatives rewrites the grammar after parsing an input token to a new grammar which represents what may appear next. The immediate step of this grammar is unambiguous. Here is a rough example which I hope is in the ballpark.

In our example above, when the parser find the a token it accepts it and derives the expressions

X ::=  (Y, (e | f)) | (b, Y, f)
Y ::=  c, d

When if finds the c token it accepts it and derives the expressions

X ::=  Y, (e | f)  
Y ::=   d

When if finds the d token it accepts it and derives the expression

X ::=   e | f

And then it can consume the f token and accepts the input.

The derivatives parser does no backtracking or memo-ization: the work it does is rewriting the grammar. It can handle ambiguous regular grammars. I do not know of any material that compares the performance of dynamic re-writing rather than pre-compiled re-writing. (I haven't thought through whether the latter would be subject to combinatorial explosions: it might be useful for some grammar patterns only.)

(It is interesting that James Clark's 2002 Jing validator combines derivatives with limited memo-ization to handle a couple of particular cases. His Jing parser works at the event level, e.g. start tag, end tag, rather than on a DOM per se.)

You might also be interested in:


While not exactly a packrat parser, I developed a caching XML parser. It wraps a .NET xmlreader and buffers and caches everything. Extension methods give access to the XML already parsed. It is in public open beta at www.xponentsoftware.com/cax.aspx

Bill: I like the idea of CAX. Your XML editor has a nice feature set too: XPath with arbitrarily large documents in constant memory is pretty cool.

News Topics

Recommended for You

Got a Question?