DRMacIver's Notebook

Understanding through calculation

Understanding through calculation

I’m in the process of trying to derust my maths skills, so I’m going to try posting about mathematics every day for June and see if that helps.

One of the things I’ve been noticing with a lot of what I’ve been reading (which has mostly been about probability) is that there are a lot of points where you want to prove some result, and the way you do that is just a bunch of algebra, and I tend to get… not exactly lost, but unsatisfied. I can understand the calculation, and I could even recreate it, but I don’t feel like I understand why the result is true more at the end than I did at the beginning.

I pointed at part of this in algebra and insight, where a result that was very easy to prove through calculation became much easier to understand by reformulating the problem with an insight. Crucially, that insight didn’t actually reduce the amount of calculation, but it replaced a completely opaque calculation with one that was more about confirming the insight actually worked.

Thing is… I don’t think you can get away without doing any uninformative calculations. Calculations are important, and sometimes what you need is an actual concrete result, and you need to calculate to do that. I think probably you can mine most calculations for insight, but I don’t think there’s any particularly deep reason that \(3 \times 106 = 318\), that’s just what you get when you run the calculation.This isn’t entirely true, as a whole bunch of mental arithmetic comes down to breaking numbers down and seeing how they fit together. For example, this one is clearly true because \(3 \times 106 = 3 \times 100 + 3 \times 6 = 300 + 18 = 318\). So it’s not that the insights don’t exist, it’s more like the claim doesn’t feel like it needs justifying. The fact that \(3 \times 106 = 318\) is more or less allowed to stand on its own. It still needs checking, and if I’d told you for example that \(3 \times 106 = 315\) you should immediately have a suspicious reaction to that, so the true answer in some ways needs less insight than the false answer. I’m not sure what point I’m making and I think I’m sortof disagreeing with the main thrust of my argument at this point. I think what I mean to say is that this sort of calculation is one you might reasonably feel comfortable leaving insightless.

But I ran into an interesting example in the opposite direction this morning, where the key to understanding something that I’d been struggling with because I lacked insight was just shut up and calculate, and after doing the calculation the result was obvious.

Here’s the problem:

Suppose you’ve got some quantity \(\mu = \sum\limits_{n \geq 0} \mu_n\), and you have unbiased (but not necessarily independent) estimators \(X_i\) for the summands, i.e. \(E(X_i) = \mu_i\).How do you construct an unbiased estimator for \(\mu\)?

Here’s an easy version that I learned from Alex Lew’s PhD thesis.You’re going to be hearing about Alex a lot in these posts I suspect, as a nontrivial amount of what I’m trying to do right now is to mainline his entire body of work until I understand all the bits of it I care about.

First, take some random positive integer valued variable \(N\), with \(P(N=n) = q(n)\), such that \(q(n) > 0\) for all \(n \geq 0\). Sample a value \(n\) from it, sample \(x_n\) from \(X_n\) and return \(\frac{x_n}{q(n)}\).

Why does this work? Well, it’s part of a general phenomenon of things that work, called importance sampling but we can do the calculation directly:

\[\begin{align*} E(\frac{X_N}{q(N)}) & = E(\sum\limits_{n} P(N = n) \frac{X_n}{q(n)}) \\ & = E(\sum\limits_{n} q(n) \frac{X_n}{q(n)}) \\ & = E(\sum\limits_{n} X_n) \\ & = \sum\limits_{n} E(X_n) \\ & = \sum\limits_{n} \mu_n \\ & = \mu \\ \end{align*}\]

Basically, dividing by \(q(n)\) “cancels out” the frequency with which we sample \(n\), and so as a result we get exactly the expectation we want. I’m not sure how it looks to you, but for me this calculation very much ends with me feeling like I get the result.

Something Alex pointed out to me is that in the case we’re interested in, you more or less have to evaluate the \(X_i\) in sequence.Specifically they’re all unbiased estimators for \(Y^n\) for some random variable \(Y\), and you calculate \(X_n\) from \(n\) independent draws of \(Y\). This makes this method quite wasteful, because it ignores all of the information you get from the first \(n - 1\) coefficients.

Instead, he pointed out that the following method works, which he described as a “Horvitz-Thompson like estimator”:

Draw \(n\) from \(N\), sample \(x_i\) from \(i = 0, \ldots, n\), and return \(\sum\limits_{i = 0}^n \frac{x_i}{P(N \geq i)}\).

I’ve spent about a week stumped about why this works. I haven’t been thinking about it constantly, but when I have I’ve been trying to gain insight into it. In particular this is apparently understandable as an application of Alex’s RAVI frameworkWhich is still very much at the stage where I stare at the calculations and Don’t Quite Get It for me. All of the pieces make sense, but I haven’t got it to cohere into something I can actually use yet., and I was very much failing to get how that worked here.

But this morning, I realised that if you frame it the right way (which Alex had already more or less said and I’d failed to realise) you can just shut up and calculate and it’s easy then.

Here’s how it works: First, draw some random finite set of integers \(S\), according to whatever distribution you like.But, crucially, independently of the \(X_i\). For our use case it will be \(S = \{0, \ldots, N\}\) but the result doesn’t depend on that.

Now, construct the random variable \(T = \sum\limits_{i \in S} c_i X_i\) for some coefficients \(c_i\) that we’ll choose later. Now:

\[\begin{align*} E(T) &= E(\sum\limits_{i \in S} c_i X_i) \\ &= E(\sum\limits_{i \geq 0} 1_{i \in S} c_i X_i) \\ &= \sum\limits_{i \geq 0} E(1_{i \in S} c_i X_i) \\ &= \sum\limits_{i \geq 0} c_i E(1_{i \in S}) E(X_i)) \\ &= \sum\limits_{i \geq 0} c_i P(i \in S) \mu_i \\ \end{align*}\]

So, in order to get \(E(T) = \mu\), we just have to pick \(c_i = \frac{1}{P(i \in S)}\). When we substitute in the choice \(S = \{0, \ldots, N\}\), this gives us our desired estimator.

Again, I don’t know if you’ll feel the same way, but for me this calculation is extremely clear. I feel like I now get why this estimator works, and it works precisely because the calculation says it does. I think some of this is because I worked out the details for myself, but I do also think that in some sense the calculation is both clear and the only way that you can possibly get a result like this.

I do feel like there’s still some need for insight there. For example I think this would have been much less clear (although would have still worked just fine) if I’d used the specific instance of the set based random variable instead of the general one.

There’s maybe also a little bit of a question of how you figure out that such an estimator is reasonable in the first place. Guess and check is a perfectly valid method, but it does feel a little bit unsatisfying. On the other hand, I think given the linearity of expectation and the law of conditional expectation, it’s pretty reasonable to expect that something like this would work, so I’m not sure.

Either way, I think probably part of derusting my mathematicsActually I’m not sure this even counts as derusting. I feel like calculation has always been the part of mathematics I was worst at. is going to be getting comfortable with gaining understanding through calculation, and figuring out when and how that works.