DRMacIver's Notebook

The Lazy Collection Trick

Published: 2019-02-26

Hey, psst, buddy, would you like to see some really weird code? Then boy do I have the PR for you.

A thing I’ve been working on right now is Hypothesis’s memory consumption. This is as part of some work on driving Csmith with Hypothesis.

Unfortunately it turns out that the following facts are in tension:

Csmith test cases are much larger than Hypothesis is used to.
Hypothesis has a fairly rich parse tree of the test case’s internal representation.
Python is not very good at compact data representation.

The result was that Hypothesis ends up spending literal gigabytes on representing parse trees (not per parse tree, but it’s not as far off that per parse tree as I’d like).

How do we make these two facts play well together?

Well, most of the parse trees are never used. Test case reduction is a fairly local process and only cares about small bits of it.

In the linked pull request, we are representing the test case as a list of blocks. A block is just an open interval in the underlying byte stream, with a bunch of metadata about it. Most of the time we only need one or two blocks. Sometimes we need all blocks. Sometimes we need none at all.

How do we solve this without being massively intrusive to calling code?

Easy! We implement it lazily. We store a compact representation of the endpoints of the block. From that and the underlying buffer we can build the Block object at any given index, and construct it on first access.

That’s the basic idea of the lazy collection trick. There are lots of extra details involved in the implementing PR.

The result is a huge memory saving for Hypothesis’s access patterns and I’ll probably be using this more inside Hypothesis in a few other places too.