DRMacIver's Notebook
Indexing a DFA in shortlex order
Indexing a DFA in shortlex order
Bad news feelings-readers, it’s another technical post. Having feelings will resume shortly (probably in tomorrow’s newsletter if nothing else)
I asked the other day about how to find the shortlex-predecessor of a string in a regular language represented by a DFA. I eventually came up with a solution that wasn’t too bad. This morning in the shower I came up with a better one (disclaimer: I haven’t written code for this or tested it, so it’s probably subtly wrong and has off by one errors, and it may perform poorly, but the basic idea is sound in theory).
The basic idea is this: It’s well known that using breadth first search we can enumerate the regular language in shortlex-ascending order as \(x_0, \ldots, x_n, \ldots\). Finding the predecessor of \(s\) given this enumeration is as simple as looking up \(s = x_n\) and returning \(x_{n - 1}\). Unfortunately, \(n\) may be exponentially large in \(|s|\) so doing the enumeration is horribly impractical.
But! It turns out, you don’t have to, and this solution can be made workable (given a good bigint library to represent the indices) as follows:
Maintain a dynamic programming table \(C(i, k)\) which counts the number of accepted strings of length \(k\) starting from state \(i\) in the DFA. Also maintain a dynamic programming table \(R(i, k) = \sum\limits_{j = 0}^k C(i, j)\) (i.e. the number of strings of length at most \(k\) accepted from state \(i\)).
What this table lets you do is essentially skip over the bits of the breadth first search that you know won’t lead to where you’re going, and count how many strings you skipped by doing so.
Now, you look up \(x_n\) by first comparing values of \(R(0, k)\) to find \(k\) with \(R(0, k) < n \leq R(0, k + 1)\). We know \(k = |x_n|\). Now, walk the DFA, using \(C(i, k)\) to determine which node to transition to (using essentially the same idea - this is easier to explain in code, but I haven’t written the code yet!).
Now, to find \(x_n = s\) we just count the number of strings in the langugage which are shortlex smaller than \(s\). This can be established relatively easily: First count \(R(0, |s| - 1)\), all strings that are shorter than \(s\), and then work out the number of strings of length \(|s|\) that are lexicographically smaller than \(s\) by adding up sum of the strings of length \(k - i\) by walking the DFA and after reading \(i\) characters count the number of strings of length \(|s| - i - 1\) that you would get from following each \(c < s_i\).
To be honest, I think this explanation was all clear as mud, sorry. I think either you’ll get the idea immediately from the initial seed of “Use dynamic programming to get the indices” or the explanation won’t clarify it, and I should have written the code instead of trying to explain this in prose. Alternatively, this is a sign that I really need to get much better at explaining this sort of thing in prose.