DRMacIver's Notebook

Getting Test-Case Reduction Unstuck with Automaton Inference

Published: 2020-07-05

A question I get asked a lot by friends is why I don’t talk about my PhD much. The answer is generally very simple: It’s because almost nobody actually wants to know about my PhD, and it’s not worth the half hour of bringing them up to speed while they politely pretend that they do.

But, on the other hand, (a small handful of) people do generally seem interested when I talk about my technical work on Twitter, and this notebook is where I get to write about whatever the hell I want, and part of the plan to sort out my PhD is to do some more non-paper research writing.

So today’s post is about my research. Sorry for everyone who is here mainly for feelings and social commentary.

Context: I work on a thing called test-case reduction. Test-case reduction is basically “just” constrained optimisation of a total order (the reduction order), over some set (the test cases), where the constraint is just a black box predicate you’ve been handed (the interestingness test), and you’ve been handed a single witness of the set (the initial test case).

For reasons that are explained in our paper about Hypothesis I’m particularly interested in the case where the test cases are all byte strings (we present it as bit strings in the paper, but same deal really) where the order is the shortlex order. i.e. \(s \preceq t\) if \(|s| < |t|\) or \(|s| = |t|\) and \(s \leq t\) in the lexicographic order.

There is an idea that I’ve had in the back of my head for a while:

Given some interestingness test \(p\) and some starting test case \(s\), the set \(\{t: p(t) \wedge t \preceq s\}\) is finite and thus a regular language.
If we had a deterministic finite automaton for that language, we could just read off the shortlex-minimal string matched by that automaton using a breadth-first search, giving us the global minimum interesting test case.
We can infer (an approximation to) that automaton using the L* algorithm (or see my PyCon UK talk on the subject for a less technical account).

This suggests a great approach to test-case reduction: We set up the L* algorithm, make sure it gets the correct result on our initial test case, read off the minimal matching string of the automaton, check if that’s correct and if not pdate the automaton, and iterate that until the automaton-minimal string is also interesting.

This idea is genuinely very good except for two small problems.

It’s prohibitively slow.
It doesn’t work.

It’s prohibitively slow because the complexity of L* is weird and depends a lot on the language being learned, but in practice you’re lucky if it’s only \(O(n^2)\) when you use it this way. It doesn’t work because although the language is regular it doesn’t have a particularly nice structure a lot of the time and the algorithm isn’t always able to pick up on important details as a result. For example, L* does not easily learn the constraint that two bytes can be equal but it doesn’t matter what value they take beyond that.

This is all very unfortunate because it’s a lovely idea. Fortunately, as I discovered recently, there is actually a context in which L* can potentially be a significant boon to test-case reduction after all: You can use it offline to learn new reduction passes (a reduction pass is basically just a simple test-case reducer, but effective test-case reducers tend to be assembled out of multiple reduction passes) that are not prohibitively slow and correspond to established deficiencies in your existing reducer.

We can find deficiencies in a reducer by basically looking for ways it gets “stuck”. This is based on an idea called test case normalization which is essentially that the goal of a reducer should be that its output does not depend on the initial test case, only the interestingness test.

If you have some interestingess test that demonstrable does not normalize on an initial corpus of test cases (in my case these are generated by Hypothesis) you can learn a set of new reduction passes that normalizes that corpus. You do this by picking the two smallest examples, adding a reduction pass that turns the larger of the two into the smaller, and rerunning until all of the corpus is normalized (which takes at most one pass per corpus element).

Suppose \(s \preceq t\) are the two smallest elements of that corpus after reduction. There is a trivial reduction pass that just replaces the string \(t\) with the string \(s\) wherever it finds it. This is actually quite good, because after reduction (assuming the initial reducer is doing an OKish job of normalizing) you’re much more likely to see \(t\) than chance would suggest. You can improve these odds further by removing a common prefix and suffix from each of \(s\) and \(t\).

But you can also generalise this further. Suppose we’ve extracted those common prefixes, say \(u\) and \(v\). We now use L* to learn the predicate \(f(x) = |uxv| \leq |t| \wedge p(uxv)\), making sure that it learns \(s\) correctly. i.e. we learn a smallish language that captures just enough information to turn the central bit of \(t\) into the central bit of \(s\).

We now use this language to define a new reduction pass which finds any substrings that match this DFA and replace them with the smallest element of it. By making sure L* has learned enough about \(s\) and \(t\) to correctly predict them, this ensures that the new pass can transform \(t\) into \(s\).

At a high level what this is doing is essentially learning patterns where your existing reducer gets “stuck” and learn just enough about the linguistic structure of the test case format to work around them.

Does this work?

Well, maybe.

There are roughly two ways this can go wrong:

The DFA may still be impractical to learn.
The learned passes may not generalise well enough to be useful, so this ends up with a very large number of passes.

Initial experiments on Hypothesis are pretty promising as to whether the DFAs can be learned, but the Hypothesis test-case format is very forgiving. I’ll be surprised if this works well enough out of the box on human-readable test-case formats and I expect it will need more tinkering. In particular when the reduced test-cases in the corpus are far apart, I think one needs some intermediate steps to learn smaller transformations first (I have some ideas for how to use A* to do this. I do know algorithms that aren’t just a letter and an asterisk, I promise).

The number of learned passes is potentially more of a problem, but conveniently I was already working on an approach to make it practical to run test-case reduction with a large number of mostly useless passes.