DRMacIver's Notebook

Reducing Weird Tests

Reducing Weird Tests

Hillel recently wrote about cross-branch testing - comparing implementations of the same API in two different branches as a form of differential testing to see whether a change introduced any functional differences. As part of his example he ended up with what was functionally the following interestingness test:

def is_interesting(x, y, z):
    out = 0
    for i in range(10):
        out = out * x + abs(y * z - i ** 2)
        x, y, z = y + 1, z, x
    return abs(out) % 100 == 9

He mentions in the article that this gives nondeterministic results for Hypothesis:

Running this gives us an erroneous input:

Falsifying example: test_f(a=5, b=6, c=5,)

This is nondeterministic: I’ve also gotten (0, 4, 0) and (-114, -114, -26354) as error cases.

(0, 4, 0) is the (probably globally) correct answer of all of these according to Hypothesis's reduction logic, but Hypothesis's test-case reduction is unable to reliably transform all of the failing examples to it.

I reacted to this as I always do to such things: "Oh, I wonder if I can make the test-case reducer better". But the answer is no, no I cannot, and the reason is that this test contains what is basically a hash function (not a very good hash function, but good enough), and as a result the test cases are pretty much uniformly distributed throughout the set of test cases - if you generate test cases where the values are between -10 and 10, then of the 9261 possible test cases, 129 are interesting, i.e. about 1.4% which is about what you'd expect if the computation of abs(out) % 100 were roughly randomly distributed.

For such functions we end up with the slightly awkward situation where:

  1. It's not that hard to find test cases (1% isn't that infrequent! We can just exhaustively enumerate to find the globally minimal test case).
  2. It's hard enough to find test cases that most test-cases are locally minimal.

The reason for (2) is that if reductions are not any more likely to be interesting than any other test case (which seems roughly true for hillel's example and is true by construction for truly random examples), if you try \(N\) reductions you only expect \(0.01 N\) reductions to work. Suppose for a typical triple of integers we have, say, 10 reductions we can perform, and those reductions are no more likely than chance to work, then 90% of interesting test cases are locally minimal, which means that we'll typically end the reduction process within a few steps.

In order to explore this, lets consider the following toy model: Our space of interesting test cases are triples of integers between 0 and 10, and we have a local reducer that makes the following transformations:

def reduce_tuple(t):
    # First try sorting the tuple
    yield tuple(sorted(t))

    for i in range(len(t) - 1):
        if t[i + 1] < t[i]:
            reduced = list(t)
            reduced[i], reduced[i + 1] = reduced[i + 1], reduced[i]
            yield tuple(reduced)

    # Now try lowering each component
    for i in range(len(t)):
        if t[i] > 0:
            reduced = list(t)
            reduced[i] -= 1
            yield tuple(reduced)
            # If we can raise the next component then try that
            if i + 1 < len(t) and t[i + 1] < 10:
                reduced[i + 1] += 1
                yield tuple(reduced)

(if this doesn't make sense to you don't worry about it too much, it's just the first local reducer that came to mind for triples of integers)

Suppose we're reducing subject to an interesting test where the maximal value (10, 10, 10) is always interesting and every other test case is interesting with probability p. What is the probability that this local reducer finds the globally optimal test case?

Based on some fairly crude simulations, the local reducer starts finding the globally optimal test-case more than half the time around p = 0.93, which has the amusing property that the local reducer is reliably worse (in expectation) than the reducer that just always tries (0, 0, 0) and gives up if that doesn't work (which finds the globally minimal test case with probability p). Some equally crude simulation suggests that an approach that just enumerates the smallest test cases for a while returns at least as good a result in the same number of calls to the interestingness test in at least 80% of cases, regardless of the value of p.

I'm not sure if there's any sort of general lesson to derive here - most interestingness tests are not quite this adversarial - but I do feel like there is often something like this going on, where the reduction problem isn't that hard despite being in a local minimum, and I wonder if something loosely inspired by this that estimates the density of interesting test cases below the current interesting test case, but I'm not sure if that's entirely useful.

If one did have some sort of easy way to sample from test cases that are strictly smaller than the current interesting test case, it might be worth running that for longer than it would be intuitively obviously useful as a sort of crude but perhaps surprisingly effective reducer? It's certainly more robust to this sort of problem than any kind of local transformation based reducer would be.

This also suggests that possibly test-case reducers should try a bit more bounded exploration of the small test cases than they currently do, but I think this is probably misleading due to the relatively small number of test cases in this example - for most formats of interest the number of test cases of size n is exponential in n, so any amount of enumeration other than trying the globally minimal test case (which is usually worth doing) is probably not worth it.