DRMacIver's Notebook

Adaptive parallel test-case reduction

Published: 2023-07-27

There’s a very simple algorithm I call “adaptive parallel deletion”, which I think of as completely standard and thus often forget that literally nobody but me uses. I haven’t written it down anywhere, so here’s me finally getting around to doing that.

This algorithm is a component of test-case reduction that replaces the classic delta debugging algorithm - that it is, it takes some sequence and produces a subsequence of it satisfying some condition, such that if you remove any one element from its output the condition is no longer satisfied.

The core features of adaptive parallel deletion are:

When there are large reductions that can be made, it will usually run in O(log(n)).
When there are no reductions that can be made, it will run in O(n), but with the work scaling to as much parallelism as you can give it.
It will scale fairly gracefully in between the two.

Here it is:

def single_pass_delete(test_case, condition):
    """Returns a version of `test_case` that satisfies `condition` and
    is "one-minimal" in the sense that you cannot delete any single
    member of it and have it still satisfy the condition."""
    test_case = list(test_case)

    def can_delete(j, k):
        return condition(test_case[:j] + test_case[k:])

    i = 0
    while i < len(test_case):
        try:
            # Scan in parallel for the first location where we can delete
            # a single element of the list. If no more elements are deletable
            # this will run with as much parallelism as we can give it.
            i = find_first(range(i, len(test_case), lambda j: can_delete(j, j + 1)
        except NotFound:
            break

        del test_case[i]
        # Now that we've found an area where there's at least one deletable
        # element, we hope that it's part of a large chunk of deletable elements.
        # This finds an integer k so that we can delete the next k elements
        # and not the next k + 1 elements, running in O(log(k)). This hopefully
        # lets us delete a reasonable large chunk at a time.
        k = find_integer(lambda k: i + k <= len(test_case) and can_delete(i, i + k))
        del test_case[i:i+k]

        # Because we chose k such that k + 1 elements are not deletable, we know
        # that the element left at i (if any, we might have deleted the whole list)
        # is now not deletable, so we increment i to skip past it on the next
        # iteration. This is purely a performance optimisation, but in the worst
        # case might cause us to make twice as many calls to `condition` as we
        # need to if we skip it.
        i += 1

    return test_case

Where the functions this uses are find_integer that I’ve previously written about here which is just an exponential probe and binary search (so runs in O(log(k)) when it returns k) and find_first is defined as follows:

import os
from itertools import islice


class NotFound(Exception):
    pass


def find_first(target, predict):
    """Returns the first element of `target` that satisfies `predicate`,
    or raises `NotFound` if none do."""

    if not target:
        raise NotFound()

    parallelism = os.cpu_count()

    if parallelism <= 1:
        for x in target:
            if f(x):
                return x
        raise NotFound()
    else:
        with ThreadPoolExecutor(max_workers=parallelism) as executor:
            it = iter(target)
            chunk_size = 1
            while True:
                chunk = list(islice(it, chunk_size))
                if not chunk:
                    raise NotFound()
                for x, b in zip(chunk, executor.map(f, chunk)):
                    if b:
                        return x
                chunk_size *= 2

In particular find_first is designed so that when it does not find an answer it scales up rapidly to be embarrassingly parallel, but when it finds an answer early on it doesn’t do much redundant work (you still have the overhead of the thread pool and such, but it doesn’t actually run much extra code).