DRMacIver's Notebook
Reducing Weird Tests
Reducing Weird Tests
Hillel recently wrote about cross-branch testing - comparing implementations of the same API in two different branches as a form of differential testing to see whether a change introduced any functional differences. As part of his example he ended up with what was functionally the following interestingness test:
def is_interesting(x, y, z):
= 0
out for i in range(10):
= out * x + abs(y * z - i ** 2)
out = y + 1, z, x
x, y, z return abs(out) % 100 == 9
He mentions in the article that this gives nondeterministic results for Hypothesis:
Running this gives us an erroneous input:
Falsifying example: test_f(a=5, b=6, c=5,)
This is nondeterministic: I’ve also gotten
(0, 4, 0)
and(-114, -114, -26354)
as error cases.
(0, 4, 0)
is the (probably globally) correct answer of
all of these according to Hypothesis’s reduction logic, but Hypothesis’s
test-case reduction is unable to reliably transform all of the failing
examples to it.
I reacted to this as I always do to such things: “Oh, I wonder if I
can make the test-case reducer better”. But the answer is no, no I
cannot, and the reason is that this test contains what is basically a
hash function (not a very good hash function, but good enough), and as a
result the test cases are pretty much uniformly distributed throughout
the set of test cases - if you generate test cases where the values are
between -10
and 10
, then of the
9261
possible test cases, 129
are interesting,
i.e. about 1.4%
which is about what you’d expect if the
computation of abs(out) % 100
were roughly randomly
distributed.
For such functions we end up with the slightly awkward situation where:
- It’s not that hard to find test cases (
1%
isn’t that infrequent! We can just exhaustively enumerate to find the globally minimal test case). - It’s hard enough to find test cases that most test-cases are locally minimal.
The reason for (2) is that if reductions are not any more likely to
be interesting than any other test case (which seems roughly true for
hillel’s example and is true by construction for truly random examples),
if you try \(N\) reductions you only
expect \(0.01 N\) reductions to work.
Suppose for a typical triple of integers we have, say, 10
reductions we can perform, and those reductions are no more likely than
chance to work, then 90%
of interesting test cases are
locally minimal, which means that we’ll typically end the reduction
process within a few steps.
In order to explore this, lets consider the following toy model: Our
space of interesting test cases are triples of integers between
0
and 10
, and we have a local reducer that
makes the following transformations:
def reduce_tuple(t):
# First try sorting the tuple
yield tuple(sorted(t))
for i in range(len(t) - 1):
if t[i + 1] < t[i]:
= list(t)
reduced + 1] = reduced[i + 1], reduced[i]
reduced[i], reduced[i yield tuple(reduced)
# Now try lowering each component
for i in range(len(t)):
if t[i] > 0:
= list(t)
reduced -= 1
reduced[i] yield tuple(reduced)
# If we can raise the next component then try that
if i + 1 < len(t) and t[i + 1] < 10:
+ 1] += 1
reduced[i yield tuple(reduced)
(if this doesn’t make sense to you don’t worry about it too much, it’s just the first local reducer that came to mind for triples of integers)
Suppose we’re reducing subject to an interesting test where the
maximal value (10, 10, 10)
is always interesting and every
other test case is interesting with probability p
. What is
the probability that this local reducer finds the globally optimal test
case?
Based on some fairly crude simulations, the local reducer starts
finding the globally optimal test-case more than half the time around
p = 0.93
, which has the amusing property that the local
reducer is reliably worse (in expectation) than the reducer that just
always tries (0, 0, 0)
and gives up if that doesn’t work
(which finds the globally minimal test case with probability
p
). Some equally crude simulation suggests that an approach
that just enumerates the smallest test cases for a while returns at
least as good a result in the same number of calls to the
interestingness test in at least 80% of cases, regardless of the value
of p
.
I’m not sure if there’s any sort of general lesson to derive here - most interestingness tests are not quite this adversarial - but I do feel like there is often something like this going on, where the reduction problem isn’t that hard despite being in a local minimum, and I wonder if something loosely inspired by this that estimates the density of interesting test cases below the current interesting test case, but I’m not sure if that’s entirely useful.
If one did have some sort of easy way to sample from test cases that are strictly smaller than the current interesting test case, it might be worth running that for longer than it would be intuitively obviously useful as a sort of crude but perhaps surprisingly effective reducer? It’s certainly more robust to this sort of problem than any kind of local transformation based reducer would be.
This also suggests that possibly test-case reducers should try a bit
more bounded exploration of the small test cases than they currently do,
but I think this is probably misleading due to the relatively small
number of test cases in this example - for most formats of interest the
number of test cases of size n
is exponential in
n
, so any amount of enumeration other than trying the
globally minimal test case (which is usually worth doing) is
probably not worth it.