DRMacIver's Notebook

Writing good programming abstractions

Published: 2024-01-13

I was talking to Nelson about programming abstractions the other day, and how you identify a good abstraction, and I hit on a heuristic that once I suggested it seemed obviously correct to me but that is very at odds with how I think people normally think about introducing abstractions in their codebase.I don’t mean anything fancy by abstraction. Implementing a new class or a new function for example.

It’s this: When you introduce an abstraction, it has to be an improvement at each call site. That is, even if you never reuse it, it should be make your code better.

Another way of thinking about this: If you implemented this, and then you deleted all but one of the call sites, would you be tempted to inline the abstraction? If yes, you shouldn’t do it. If you were in a fresh codebase and wanted to do something similar to your previous uses of the abstraction, would you inclined to copy the previous abstraction into your codebase in order to make use of it? That’s a sign that it’s probably a pretty good abstraction.

In contrast, I think most people’s intuitions about abstraction are around reducing code duplication. You introduce a function when you see duplicate code, because duplicate code is bad and you want to reduce it. I think this intuition is almost entirely wrong. It leads you to create bad abstractions, sometimes where you didn’t even need one at all, and it also means you miss important abstractions that you could have introduced even without any duplication.

Now, it’s not that duplication can’t be the prompt for abstraction. In the conversation Nelson and I were having, I’d introduced a class that managed a bunch of state for gathering statistics about some work in progress and doing cleanup when particular conditions were met. Originally I wrote that logic and state inline in the class managing those tasks. Then I needed it again, in a slightly different way with slightly different constraints, so I extracted a class for the common state and logic. This is a perfectly reasonable example of abstraction in the classic sense of reducing duplication.

But also, even before I’d written the second call site, the code got much cleaner, because doing this made me think about the actual interface I wanted, and I replaced a bunch of raw calls to heapq and set management with a nice API that declared what I was actually doing. This was a big improvement.

The repetition served the following purposes:

It was a prompt to think “Can I abstract this?”
By the fact that I had needed it twice, I had a better idea of how to abstract it.
It justified the work of making the abstraction.

The third is an important clarification of the heuristic: It’s not that the abstraction has to be justifiable without the duplication. There are two things to consider when making a change: How much of an improvement it is, and how much effort it is to implement. An abstraction that is a lot of work for only a mild improvement might not be worth it for one call site but clearly worth it if you’re going to use it in multiple places.

I think the best way to think about abstraction is that its purpose is to make programming less annoying. Writing the same code over and over again is annoying, and abstraction can reduce that annoyance, but it only reduces the annoyance if the abstracted code is actually less annoying than repeating yourself.

This judgement of annoyance does of course require developing a degree of taste, which makes it somewhat uninformative if you’re just starting out. Here are some ways code can be annoying that’s worth factoring out:

If you’ve got multiple pieces of state that are always used together, it can be worth putting them together into an object if you can figure out a clean API for that object.
Similarly, if there are a lot of intermediate variables that aren’t used outside of calculation, it might be worth extracting a function to reduce their visibility - it means you don’t have to keep track of where they’re used, and it’s harder to make stupid typos where you accidentally refer to a different variable where you mean.
If a function is really hard to follow, it can be worth extracting parts of it into smaller functions. e.g. recently I refactored a function to just extract the bodies of various branches of an if statement into functions. Those functions will never be called from anywhere else, but the control flow of the code was impossible to follow in a way that had previously masked a bug.
If code is irritating to get right, it can be worth thinking about whether there is an abstraction that makes it easier to get right. For example, it can be useful to factor out error handling into context managers in Python.
Sometimes there is a sort of logical “abstract algorithm” to code that is worth factoring out into a higher order function. e.g. I’ve talked about find_integer before, which I end up reusing all over the place.

In the other direction, it’s probably better not to introduce an abstraction if:

It requires introducing a function with a lot of arguments.
It significantly increases the number of lines of code when you first introduce it at one call site.
The abstraction is significantly complicated by different needs from different call sites.

In general, I think it’s worth erring on the side of not abstracting. If you’re not feeling the pain of not abstracting, it’s probably fine.

For example, here’s some code I often write variations ofI work on test-case reduction, so this sort of thing comes up a lot. I’m aware that almost nobody else regularly needs to write code that looks like this. Also yes I’m aware of the quadratic nature of this code. It’s fine, I promise. It’s not accidentally quadratic, it’s intrinsically quadratic because the can_delete checks are already O(n) and have much higher constant factors.:

i = 0
while i < len(ls):
    if can_delete(ls, i):
        del ls[i]
    else:
        i += 1

And this code is a bit fiddly and I could definitely do something along the lines of:

class DeletableListIterator:
    def __init__(self, ls):
        self.ls = ls
        self.i = 0

    def __iter__(self):
        while self.i < len(self.ls):
            yield self.i
            i += 1

    def delete(self):
        del self.ls[self.i]
        self.i -= 1

And then I could use this as:

list_iterator = DeletableListIterator(ls)
for i in list_iterator:
    if can_delete(ls, i):
        list_iterator.delete()

And it’s not really an improvement. I find this if anything slightly harder to follow than the original, though there’s not much in it. But more importantly, sometimes I want to write code like the following:

i = 0
while i < len(ls):
    if can_delete(ls, i, i + k):
        del ls[i:i+k]
    else:
        i += 1

And now I need to fiddle with the implementation to support this too.

There is almost certainly an abstraction that would make support this well - e.g. from what I know of C++’s iterator protocol it seems probably actually quite good for this.

But also… this code is fine without it? It’s a bit weird, and slightly ugly, but it’s not actually hard to get right, it’s not a lot longer than it needs to be, and once you’re familiar with the pattern it’s pretty straightforward to work with.

None of this is to discourage you from creating abstractions in general. I’m a big fan of well factored code. But it’s easy to abstract too much, or abstract the wrong things, and I think paying attention to what’s irritating about working on your codebase and fixing that is a good way to prioritise.