DRMacIver's Notebook

Probably enough probability for you

Probably enough probability for you

I found myself writing a new newsletter issue in which it would be really useful for readers to have a basic working knowledge of probability. I started trying to explain it in the middle of the issue, and this seemed like a bad idea, so instead here is an out of band explanation where I’ll attempt to provide a decent working explanation of probability for people who are not already familiar with it.

Advance warning: In this piece I will be expressing strong philosophical positions, not all of which will be perfectly aligned with other people’s strong philosophical positions on the subject. Obviously mine are correct and everyone else is a fool, and I shall show them all.

…which is to say, treat this as a useful way of thinking about it rather than necessarily a 100% definitive account that everyone else would agree with. My view on this is a somewhat pragmatic hybrid of two major schools of probability, which I think corresponds pretty well to what people actually do in practice, but I take that hybrid as central rather than treating one of them as correct and also grudgingly borrowing ideas from the other.

Another thing to note is that I am being very sloppy with or brushing over details. This is because what I am trying to convey is a rough sense of probability sufficient to get a handle on what it’s for and how to use it. If you want to fill in the details, that’s more work, but it’s also mostly not necessary.

What is a probability?

Probability is two separate things, which are often conflated:

  1. The skilled practice of putting a number on some source of uncertainty, in a way that usefully improves your ability to work with that uncertainty. These quantified uncertainties are called probabilities.
  2. A body of mathematical theory, based on certain idealised models of uncertainty that are only found in pure forms in mathematical objects, that nevertheless usefully describes and informs that skilled practice.

Most explanations of probability start with the second, and end up a bit confused on the first. In particular they try to explain what probability is, when in reality probability is not a single thing, and is best thought of as something you do rather than something that exists.

In particular you end up with a lot of arguments about whether probability is a subjective representation of uncertainty or a feature of the world itself, which is about as sensible a thing to argue about as whether numbers represent weights or lengths. Probability is a tool of description, and as such its use contains both subjective and objective elements, informed by both the thing being described, the entity describing it, and the specific details of the process resulting in the description (e.g. how much work is put in).

Instead it is better to think of probability in a pluralist way: There are multiple notions of probability, each corresponding to a given instance of trying to quantify uncertainty, and the mathematical theory suggests various rules that a notion of probability should ideally attempt to follow, which may or may not be followed in practice.

Quantifying uncertainty

Here are several motivating questions, each of which invite different notions of probability:

  1. If you roll two dice, how likely is the total value shown on the dice to be greater than or equal to 10?
  2. How confident are you that there are currently more than 67 million people in the UK?
  3. How likely is it to rain tomorrow?

A notion of probability associated with these questions is an attempt to answer them with a number. Conventionally, these numbers are between 0 and 1, with 0 representing “I am completely certain that this will not happen” and 1 representing “I am completely certain that this will happen”, although often it’s more natural to report them as a percentage between 0% (0) and 100% (1) instead.

Slightly more formally, a notion of probability for some set of possibilities (rolls of a dice, range of plausible populations of the UK, weather over the course of the day) is any method of taking subsets of those possibilities (e.g. the subset where the dice roll total is greater than or equal to 10, the subset where there are more than 67 million, the subset where it rains at least once in the course of the day), and assigning a number between 0 and 1 (or, if you prefer, 0% and 100%) to that event. Intuitively that number represents some idea of how “likely” that event is, but that is not part of the notion of probability, it is part of what we are using the notion of probability to try to do.

There are a couple reasons that I am emphasising this pluralism about probabilities, but the first and biggest one is that these questions each have a slightly different type of uncertainty to them.

In the first instance, the dice are in some sense “truly random”. There is an element of actual randomness in the world, and your uncertainty about it derives purely from that randomness (arguably you could perfectly predict the dice rolls if you had sufficiently good information and predictive power, but when I talk about dice or coin flips in this post please assume I mean ideal dice that are powered by some underlying source of genuine randomness like radioactive decay where you cannot predict it even in principle). This is sometimes called aleatoric uncertainty, meaning uncertainty deriving from pure randomness, but I’ll avoid using the term much.

In the second instance there is an objectively correct answer - there either are or aren’t more than 67 million people in the UK right now - and you could in principle find it out, but your knowledge is imperfect. You can e.g. google and get a population estimate of 67.22 million which suggests pretty strongly the answer is yes, but where do those numbers come from? How recent are they? Do they take into account Brexit and COVID departures? etc. Taking into account all of these considerations might make you less certain, but either way that uncertainty lives in your head, not in the world, where there is a true answer to the question that you can in principle find out. This sort of uncertainty is often called epistemic uncertainty, meaning uncertainty purely in your knowledge, and again I won’t use this term too much.

The third example however combines the two. Weather forecasting is an imprecise science, but it’s increasingly pretty good, especially with satellite imagery. Additionally, there are questions of general knowledge - e.g. if you have historical records you could answer with the fraction of times it’s rained on that date - but at some point you start being affected by true underlying randomness and there is a limit to how well you can do regardless of how much knowledge you gain.

This discussion of these three types of things hopefully indicates two important features:

  1. These questions do not uniquely define a notion of probability each. A notion of probability is a way of answering the question, but there are many different ways of answering the question.
  2. Some ways of answering the question produce better answers than others.

For example, suppose I answered 0 to the dice question. Perfectly allowed, there’s no rule against it. I could genuinely be 100% certain that two dice when rolled and added together will never total 10 or more. Of course, I will then look pretty stupid when the dice roll a 5 and a 6. This is in some sense a bad notion of probability, even if it’s a perfectly valid one.

I could take my best guess on the population thing (I’m like 95% certain that there are more than 67 million people in the UK right now. The 67.22 million figure is from mid-2020, and between COVID and Brexit I think there’s a decent chance we’ve popped below the 67 million mark. 95% feels a bit low, but 99% feels too high). A “better” way of doing it would be to go out and exhaustively count every single person and report 0% if the answer is no and 100% if the answer is yes. Or we could run the survey that resulted in that 67.22 million again and see what it says and do the same, which is a bit lower effort than a perfect accounting but probably good enough.

Similarly on the weather forecasting, we could make a complete guess (“IDK, 50%?”), we could use historical data, we could use satellite data, we could use complicated modelling… the more effort we put in to weather forecasting, the better we should hope our notion of probability that we get from those efforts is.

But what does “better” mean?

Well calibrated probabilities

The only way to be obviously, definitively, wrong, is to assign a probability of 0 or 1 (0% or 100%). If I say something is impossible and it happens, I was wrong, there is no doubt about that.

But it feels like this is too low a bar to clear, right? If I say something is a one in a million chance and it happens anyway then, well, maybe I just got unlucky, but if my one in a million chances happen nine times out of 10, this starts to look pretty fishy.

A distinction I think it is useful to make here is that it is not that my one in a million chance estimate is wrong in these cases, it’s that it’s bad. I have not expressed total certainty in this outcome, so you could make a legitimate case that I am still technically correct in my opinion, but also in a practical action guiding sense my level of confidence is garbage and regardless of whether I am technically correct you shouldn’t listen to me.

The problem there is not I made a one in a million prediction, or even that something I predicted as one in a million happened. Things that you should predict as one in a million chances happen all the time, because there are millions of such things. If I claim that a million different things are each likely to happen with a one in a million chance, it’s pretty reasonable that one of them happens.

Say, for example, that I claim there’s a one in a million chance of winning the lottery (it’s actually much lower than this given a better notion of probability, but there’s nothing stopping me saying this), and 50 million people play the lottery and someone wins. I would have said that each of their individual chances was one in a million, and the fact that this person in particular won is not evidence against that, because it could have been any one of the millions of others.

In fact, the problem my prediction would have here is that far too few people would win for it to feel well calibrated. If 50 million people play, that suggests about 50 people should win, and the fact that only one person wins suggests that my one in a million claim was too high, and that a better notion of probability.

This leads to the idea of the calibration of a notion of probability: My notion of probability is well calibrated on a series of questions, if when I assign a certain level of confidence to an outcome, the outcome happens about that often.

This is a little subtle, so let me expand with some examples.

Let’s start with a series of (ideal) coin flips. Suppose, with each coin flip, you ask me to assign a probability that the coin will turn up heads, and I say 60%. You flip a coin, it turns up tails. This is not particularly strong evidence that my answer of 60% is badly calibrated - I said it was more likely than not to turn up heads, but I didn’t claim tails was very unlikely. If I’d say 99.999% and it turned up tails, that would have been much stronger evidence that my notion was badly calibrated.

If, on the other hand, you keep flipping the coin, I keep saying 60%, and after 10000 coin flips it’s turned up heads 4897 times, that’s pretty good evidence that my estimate of 60% is off - certainly an estimate of 50% would be better calibrated. Currently the number that would be best calibrated would be 48.97%, but if I kept flipping the coin another 10000 times and got 5041 heads this time, the new total would be 9938, or 49.69%, at which point 50% looks better calibrated than 48.97%.

In some sense, for this very simple process, there is an objectively correct well calibrated answer that is that you should always be estimating 50%. This is because the uncertainty in it is “all in the randomness”, and also beause we’ve supposed that by hypothesis.

Over a more complicated series of questions, the question is less clear, in particular because the probability might not be the same every time. If you ask me each day whether or not it’s going to rain, and I answered 60% each time, you could measure the calibration in the same way, but suppose some days I answered 10%, some days I answered 90%, how would you determine how well calibrated I was?

One answer to this is what are called scoring rules, which attempt to score how well you do on a series of question. The way these work is that when you make a prediction, you are rewarded for a high probability if the event happened, or for a low probability if the event didn’t happen. For example, a common choice of scoring rule is the Brier score, where if you chose probability \(p\) you get a reward of \(p^2\) if the event happens, or \((1 - p)^2\) if it did not. In the extreme case where \(p = 1\) you get a reward of \(1\) where it happens and \(0\) if it doesn’t (vice versa if you chose \(p = 0\)), while if you chose \(p = \frac{1}{2}\) you get a score of \(\frac{1}{4}\) either way.

(Some scoring rules may not work exactly like this, in that they aren’t necessarily an independent sum of your score on each question, but the general idea remains the same)

There is a whole bunch of theory of scoring rules and what makes a good or bad scoring rule, but the most important part is that some scoring rules are proper, which means that the strategy that gets you the best score is to have perfectly calibrated set of probabilities, but different strategies will reward how close you are to being well calibrated in different ways. For example, the Brier score is fairly forgiving about if you occasionally said that something is a one in a million chance and it happens, while the Logarithmic scoring rule will penalise that comparatively harshly.

All of which is to say, as well as different notions of probability, there are also different ways of measuring how good a given notion of probability is.

There is one particularly useful notion of practical calibration that I’ll talk about in a bit, but before I do I want to really hammer home something about calibrated probabilities, partly because even though I thoroughly understand it I continue to find it intuitively weird.

Why are there multiple calibrated probabilities?

Let’s go back to the question of predicting rainfall.

Suppose, each day, you asked me if it would rain tomorrow somewhere in the UK, and each day I said 46.6% likely. This is the number some brief googling gives me for 2020, so this number is pretty well calibrated.

This is what you can think of as the “zero effort well-calibrated” notion of probability. It’s well calibrated, but I’ve made no attempt to take into account any information available to me.

Now suppose I pay attention to if it’s raining today. Say I claim that if it’s raining today it’s 60% likely to rain tomorrow, and if it’s not raining today it’s 35% likely to rain tomorrow (I have no idea if this is more accurate, I’d have to look at more data than I could find in 30 seconds of Googling, but something like it will be true - it’s very unlikely that whether it rains tomorrow is completely independent of whether it rains today. Let’s pretend it’s right for the sake of simplicity).

The difference between these two notions of probability is not that one is better calibrated than the other, it’s that one takes into account more work and information. It is more useful for action guidance (although still probably not quite useful enough), because we can have a better sense of whether it’s going to rain on any given day.

(“Is it going to rain anywhere in the UK?” is not necessarily particularly useful for action, mind, but you can imagine the same thing with more local forecasts).

An important thing to note here is that sometimes the day-dependent prediction gives a higher chance of rain, and sometimes a lower chance of rain. If it had always produced a lower, or higher, chance of rain, both of them could not have been equally well calibrated, because over a long series of predictions calibration means that they have to be right on average about the same amount, and if one was consistently higher it would predict a higher number of rainy days than the other over time, so they can’t both be consistently right.

The logic of probability

As well as calibration, one way you can judge how good probabilities are is whether they make sense when compared to each other. Perfectly calibrated probabilities will always make sense in this way under various technical assumptions, but this is one way of seeing that your probabilities are definitely not perfectly calibrated and maybe improving them if it seems worth it.

For example, one obvious way your probabilities could be nonsensical is if you give an event that logically must be more likely than another a lower probability. The question “Will it rain in the next week?” cannot reasonably have a lower probability than “Will it rain tomorrow?”, because every time it rains tomorrow it rains in the next week.

This will show up as being badly calibrated through a series of questions which asks both at the same time, because over a large number of predictions the number of times it rained in the next week must be at least as high as the number of times it rained tomorrow, so if the probability given for the former was lower than that for the latter you’d end up with one or the other probabilities being wrong.

(There are a number of reasons why this might actually show up as being the case if you arrange all the details right, but hopefully you can at least agree that such a prediction is silly).

The most important rule probabilities should satisfy is this: If two events are disjoint, that is they cannot happen together, then the probability of at least one of them happening is the sum of the probabilities of either of them happening. For example, the probability you assign to a die roll being five or six should be the sum of the probability you assign to it rolling five plus the probability you assign to it rolling six.

From this, we can derive most other useful rules about probability. For example, the first rule I mentioned can be expressed as follow: If whenever A happens, B also happens, then the probability of B is the probability of A plus the probability of B without A. The latter probability must be at least 0, so the probability of B must be the probability of A plus a non-negative quantity, and thus at least that of A.

Another rule is that, because the probability we assign to certainty is 1, if we assign probability \(p\) to some event then we should assign probability \(1 - p\) to it not happening.

I won’t go into these logical rules of probability too much here, but it’s worth noting that understanding them is often one of the most beneficial parts of learning probability theory, because they tend to result in fairly general rules of reasoning that you can use even when you’re not using probability explicitly.

For example, we’ve got the conjunction fallacy, which people often fall prey to:

Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations.

Which is more probable?

  1. Linda is a bank teller.
  2. Linda is a bank teller and is active in the feminist movement.

The correct answer is (1), because if (2) is true then (1) has to be true as well, but people often get this wrong.

(Although as a side note, I’m a little suspicious of this claim, and there are a number of critiques of it).

Learning to spot problems in your reasoning like this tends to be one of the reasons people are very into probability theory.

From probability to expected values

Often when we make decisions based on probabilities, we’re interested not just in the probabilities of events but some number that depends on them. For example, when we’re planning a project, we will typically want to know how much it’s going to cost.

For a given notion of probability, we can define the expected value of some score, as the sum of the score under the various possible events.

So for example, if we were to assign a probability of \(\frac{1}{6}\) to each possible face of a die when rolling it (which is, presumably, perfectly calibrated if the die is fair), the expected value of the roll is \(\frac{1 + 2 + 3 + 4 + 5 + 6}{6} = 3\frac{1}{2}\).

The important thing about expected values is that when you have a well calibrated notion of probability, and take many actions, then the average actual value of those actions will be pretty close to the average expected value.

This is a little hard to justify in general and this post has gone on long enough as it is, but to see it in the specific case let’s look at the dice case:

Suppose our probability of \(\frac{1}{6}\) for each face is well calibrated. Then over a large number of rolls,\(N\), typically about \(\frac{N}{6}\) of the rolls will have each value. Therefore the average is about \(\frac{N}{6} \times (1 + 2 + 3 + 4 + 5 + 6) \times \frac{1}{N} = \frac{1 + 2 + 3 + 4 + 5 + 6}{6} = 3\frac{1}{2}\), as desired.

This argument will work for any similar situation where you’re repeating exactly the same sort of action over and over again, but the claim is true in general (with some mild technical caveats), it’s just harder to justify and I’ll ask you to take it on faith.

Practical calibration through betting

This notion of expected values leads to one of the easiest indicators that you’re well calibrated: you can make money from betting on your probabilities.

Suppose I offer you a bet: If an event happens, I’ll give you £1, otherwise you give me £1. Should you take the bet or not?

Well, as a one off, it doesn’t matter very much, but if you’re presented with many such bets it starts to add up, and being able to make the bets well will lead to profit.

Now, suppose you have a well calibrated notion of probability for these questions, and you give the event a probability of \(p\). How much do you make off this?

Well, the expected value of your bet is \(p - (1 - p) = 2p - 1\), because if the event happens (probability \(p\)) you make £1, and if it doesn’t (probability \(1 - p\) because your probabilities are well calibrated) you lose £1 (or “make” -£1).

What this means is that if you think the event is more likely than not (i.e. \(p > \frac{1}{2}\)) you would expect to make money, and so should take the bet, and otherwise you expect to lose (or not gain if \(p = \frac{1}{2}\)).

More generally, if you make £n when the event happens or pay £m when it doesn’t, you expect to earn \(np - m(1 - p) = p(n + m) - m\), so the expected value of the bet is positive when you assign \(p > \frac{m}{m + n}\).

Why should you bet on events with a positive expected value? Well, because of the property that over many bets you expect the average value of your bets to be roughly the average value of their expected values. As a result, if you make a lot of positive expected value bets, the average value of your bets is positive, and thus the total value of your bets is positive and you’ve made money.

Now, I should add to this that actual practical gambling is much more complicated than this, and involves a lot of subjects like bankroll management. The above logic really only works as long as you are gambling amounts that are small compared to how much money you have in reserve: You only get to make many bets if you don’t go bankrupt first!

This also isn’t really about gambling, and I probably don’t recommend gambling as a general choice (particularly against professional bookies, who almost certainly have a better notion of probability than you do and can make sure the odds you’re being offered are not worth it), but it’s true for general strategic choices too: Trying many safe things that have, in some sense, a positive expected value for you, and you’ll probably do pretty well.

Is all of this worth it?

One of the big philosophical debates around probability is between “All uncertainty can usefully be represented as probabilities” and “Oh yeah, how certain are you of that?”.

My view on the subject is this: You can always assign a probability to any question, but your probabilities will often not be very good, and whether it is worth the effort to make them better is highly variable. In particular on questions that are very complicated and involve a lot of unknown unknowns, it’s very unclear that thinking of them in terms of probabilities improves your thinking on the subject much, or that “Making your probabilities better” is a good way to think about learning more.

There are a lot of arguments that an ideal agent represents all uncertainty as probabilities. This may well be, I don’t have a very strong opinion on the behaviour of ideal agents, but I think idealising often elides the fact that these ideal agents have suspiciously godlike powers of thought and observation. With actual real practical agents, quantifying your uncertainty that way may not be the best use of your time.

However, there are situations in which probability really shines. In particular when you are dealing with many similar questions which involve a mix of genuine randomness and uncertain knowledge, it is often very helpful to be able to do at least a rough and ready probability estimate of various things in order to get a feel for how to reason about it, and for some jobs and tasks that are at the cutting edge of working with uncertainty (e.g. finance, logistics, forecasting), thinking in terms of probabilities is a vital skill.

Even if you don’t regularly encounter such situations, learning the basics of probability is often a useful thing to do, because it gives you tools for reasoning about the uncertainties of daily life, and generally gets you comfortable with working with uncertainty in a way that is hard to get elsewhere.

Is it worth it to think of everything in terms of probability? Almost certainly (about 99.5%) not, but often it’s worth learning how to think of everything in terms of probability so that you can then make the sensible choice not to.