DRMacIver's Notebook

Automatic Boltzmann Sampling for Context Free Grammars

Automatic Boltzmann Sampling for Context Free Grammars

A Boltzmann sampler of parameter \(x\) for some class of combinatorial objects (e.g. strings in some language) is a random sampler that picks an object of size \(n\) with probability proportional to \(x^n\).

Ages ago I figured out how to do Boltzmann Sampling for arbitrary regular languages, by computing the deterministic finite automaton for the language and doing some linear algebra.

Its major limitation was basically that it required computing the whole deterministic finite automaton, which may be exponentially large. Implicitly this also meant that it doesn't work on anything other than regular languages, because you have to have finitely many states.

Anyway, shower thought for this morning is that I can lift both these restrictions and automatically derive a Boltzmann sampler for any context free language, as long as \(|x| < \frac{1}{|A|}\) (where \(x\) is the Boltzmann parameter and \(A\) is the alphabet of the language). This condition isn't too onerous and is necessary to guarantee that a Boltzmann sampler exists for any language over that alphabet, though some languages may have Boltzmann samplers with larger parameter values.

The technique is actually very simple: Much simpler than the previous one in some ways, although partly because the complexity lives in black boxes.

It works by breaking it down into two parts:

  1. Pick the size of the target string according to the right distribution.
  2. Sample uniformly among strings of that size.

So all we need to do is figure out how to do these things.

In order to do the first, let \(c(n)\) be the number of strings in the language of length \(n\). Necessarily \(c(n) \leq |A|^n\) because there are at most \(|A|^n\) strings of length \(n\), let alone in the language.

The Boltzmann sampler distribution is that we pick a string of length \(n\) with probability proportional to \(c(n) x^n\).

We can thus pick the length as follows: Pick a real number \(z\) uniformly at random in the region \([0, B(x)]\), where \(B(x) = \sum c(n) x^n\) and choose the first \(n\) such that \(\sum\limits_{k=0}^n c(k) x^k \geq z\).

Unfortunately we don't know this infinite sum. You can calculate it relatively easy for some classes of language, but if the grammar for your language is ambiguous that gets harder.

However! We don't actually need to do that at all, because the restriction on the parameter choice allows us to create increasingly good upper bounds on this sum as follows. Because we know that \(c(n) \leq |A|^n\) we know that \(\sum\limits_{k = n}^\infty c(k) x^k \leq \frac{(|A|x)^n}{1 - |A| x}\). Thus we can define sequences \(L_n = \sum\limits_{k \leq n} c(k) x^k \) and \(U_n = \frac{(|A| x)^{n + 1}}{1 - |A|x} + L_n\). These have the property that \(L_n \leq B(x) \leq U_n\), the \(L_n\) are monotonically increasing, the \(U_n\) are monotonically decreasing, and they both converge to \(B(x)\).

We can now sample the size as follows:

  1. Maintain a value \(m\) which we have evaluated \(L_m, U_m\) up to.
  2. Pick a real number \(z\) uniformly in \([0, U_m]\).
  3. If \(z \leq L_m\) then as before find and return the first \(n\) such that \(L_n \leq z\). Necessarily \(n \leq m\).
  4. Otherwise, find \(m' > m\) such that either \(z \leq L_{m'}\) or \(U_{m'} < z\).
    1. Set \(m = m'\).
    2. If \(z \leq m'\) then pick \(n\) as in (3) and stop.
    3. Go back to (2), drawing a new value of \(z\) based on our new \(m\).

(This is a relatively well known technique for values of "relatively well known" that mean "if you work in a very niche field you've definitely heard about it" but I think I independently reinvented it a while back because I don't work in that very niche field so hadn't heard about it)

So that's how we pick the size. Now we are left with two problems:

  1. How do we calculate \(c(n)\)?
  2. How do we pick uniformly among values of size \(n\)?

The answer to both is that we use the things I've been writing about recently!

First, note that we can turn any context-free grammar into a deterministic infinite automaton by using language derivatives, as per the Parsing with Derivatives approach: States are labelled by a derivative of the initial grammar, and are created lazily on demand as you traverse the automaton.

You can now from this automaton use the same techniques I described in Indexing a DFA in shortlex order, which work equally well on an infinite automaton (they only ever require looking a finite depth away from a given state). This means we can easily calculate \(c(n)\) using dynamic programming as before, and we can sample uniformly from the strings of length \(c(n)\) using the same technique we used to index: Pick a random integer \(i\) in the range \([0, c(n) - 1]\) and then walk the DFA the i'th string of length \(n\) in lexicographic order.

One potential issue with the conversion to a DFA that this neatly avoids is that you can run into cases with context free languages where it is undecidable if the grammar contains any strings. Fortunately, for random sampling this doesn't matter! Because long strings are increasingly unlikely, there is (probabilistically speaking) no observable difference between a state that has no strings starting from it and a state that only has ridiculously long strings.

How practical is this? I don't know exactly, but my suspicion is that for a lot of simple grammars it will work very well. I may attempt to put together an implementation soon.

PS. It's likely that the parameter restriction can be lifted by making more precise estimates of the tail sums using the automaton, but I haven't quite figured out how yet. For my purposes it's generally a relatively mild restriction so I'm not that worried about it.