DRMacIver's Notebook

Published: 2018-08-31

I’m a big fan of the Brzozowski derivative, introduced in “Derivatives of regular expressions” by Janusz A. Brzozowski.

The basic idea is that given some language \(L\) over an alphabet \(A\), and some string \(u\) over \(L\), you can define the derivative language \(\partial(L, u) = \{v: uv \in L\}\). We can extend this further (and it will be useful to do so below). If \(M\) is some other language, we can define \(\partial(L, M) = \{v: \exists u \in M, uv \in L\}\). I’m not currently sure if the derivative of a regular language by a regular langauge is regular in general. It is in the case we’ll see later, and I suspect it is in general.

This seems like a pretty trivial observation until you realise the following three things:

\(u \in L\) if and only if \(\epsilon \in \partial(L, u)\)
\(uv \in L\) if and only if \(v \in \partial(L, u)\)
For most common representations of languages, it’s actually pretty easy to calculate a representation of their derivative.

Putting these together, you can use the Brzozowski derivative to calculate a deterministic (not necessarily finite!) automaton for almost any language that you can easily represent. You label states with descriptions of languages, a state is accepting if it matches the empty string, and transitions to the states labelled by the derivatives.

Regular-expression derivatives reexamined by Owens et al. has some nice practical details of doing this in the context of functional programming.

To see this in action, consider the standard regular expression operators. These satisfy the following identifies:

\(\partial(A | B, u) = \partial(A, u) | \partial(B, u)\)
\(\partial(AB, u) = \partial(A, u)B | \nu(A) \partial(B, u)\), where \(\nu(A) = \epsilon\) if \(\epsilon \in A\) or \(\emptyset\) otherwise (i.e. the derivative can skip over \(A\) if and only if \(A\) contains the empty string)
\(\partial(A^*, u) = \partial(A, u) A^*\)

A result proved in Brzozowski’s original paper (apparently. I can’t currently seem to access it, and am going off thecite in “Regular-expression derivatives reexamined) is that a small number of reasonable normalisation rules over the representation of the language is enough to ensure that you only get finitely many states in the state machine generated by partial derivatives of regular expressions. It’s certainly true that you only get finitely many if you have full equivalence for the regular languages labelling the states - the derivative automaton is actually the minimal automaton representing a language.

There are two very nice things about this representation of the language’s automaton though:

It can be done lazily. This means that even when your deterministic automaton has exponentially (or infinitely!) many states, you only ever need to explore the states that you walk when matching strings.
It is very easy to extend with new operators.

An example of (2) is that regular expressions reexamined actually does it for extended regular expressions with intersection and negation, because might as well right? It’s no harder than doing it with the normal ones, even though adding these to your regular expression language can cause exponential blowup in the size of the automata compiled from your regex.

But there are even more interesting ones if you’re prepared to go for more esoteric operations!

Have you heard of the Levenshtein automaton? The set of strings within some finite edit distance of another string is a regular language and you can define a nice automaton matching it. But in fact, a stronger result is true: For any regular language \(L\) and natural number \(n\), the set \(E(L, n) = \{u: \exists v \in L, d(u, v) \leq n\}\) is a regular language. Why?

Well, we can calculate its derivative! The derivative of \(E\) is \(\partial(E(L, n), u) = E(\partial(L, u), n) | E(L, n - 1) | E(\partial(L, \cdot), n - 1) | \partial(E(\partial(L, \cdot), n - 1), u)\). That is, at each character we can either:

Continue matching the original language (cost 0).
Insert a new character in front of something in the original language (cost 1)
Replace a character in the original language with \(u\) (cost 1)
Drop a character from the original language and try again (cost 1)

In the course of doing this we apply the following rewrite rules:

\(E(L, 0) = L\)
\(E(\emptyset, n) = \emptyset\)

As long as the number of reachable representations for the original languages is finite, so is the number of reachable states in our Levenshtein construction: Every state is labelled by a set of languages of the form \(E(\partial(L, U), k)\) where \(U\) is a language defined by \(u_1 \ldots u_m\) with each \(u_i\) either a single character or a \(\cdot\), and \(m + k \leq n\). There are only finitely many such labels as long as there are only finitely many derivatives of \(L\), although in principle there may be exponentially many. Because of the laziness of our construction that often won’t matter - you can still determine membership for a string of length \(k\) with only \(O(k)\) state traversals (though calculating those states could in principle require up to \(O(nm)\) work, where \(m\) is the number of states in the original automaton).

You can also use this to determine the minimum edit distance between two regular languages, because you can test whether \(E(L, n) \cap L' = \emptyset\) by calculating and walking the generated DFA for the left hand side, so this gives you a decision procedure for \(d(L, L') \leq n\).

Is this a practical algorithm? Not sure. I’ve played with it a little bit, but I’ve not really put it to the test, but I think it’s an interesting example of the flexibility of the Brzozowski derivative, and it was at least mildly surprising to me that the edit ball of a regular language is itself regular.