DRMacIver's Notebook

Berkson's paradox is everywhere

Berkson's paradox is everywhere

Everyone likes to tell you that correlation is not causation. They're right, but they're also boring. Also all of their examples are wrong.

What they don't tell you about so often is the most common reason why correlation does not imply correlation is that most of the correlations you see in day to day life are, in some sense, fake: They are not present in the underlying reality, they're an artifact of how you're looking at it. This is Berkson's paradox.

My go to example of this (I don't actually know if it's true, but it's a nice illustration) is that professional tennis players tend to be either tall or fast - tall tennis players tend to be slower, fast tennis players tend to be shorter.

You could come up with some complicated biological explanation about why these two traits might be negatively correlated, but it would be wrong, because they're not negatively correlated in the general population, or at least not to the same degree. The reason is much simpler than that: Short, slow, people will rarely play tennis professionally.

As a result, if height and speed are entirely independent of eachother (even if they're slightly positively correlated!) when you look at professional tennis players they will become negatively correlated, because it's more likely to be one or the other than it is to be both.

This may seem like a weird niche edge case, but once you start noticing it it's everywhere.

For example, you know how popular things are often kinda bad and the really good things are niche? Is that true, or is it just that you don't notice things that are bad and unpopular?

You know how people who disagree with you seem kinda shouty? Do you notice people who disagree with you and are not shouty?

This repeats over and over again: The patterns we see in day to day life come as much from the process that causes us to observe the data as it does from any underlying causal mechanism.

Does this matter? I think it does, for two reasons:

  1. As with "correlation does not imply causation" in general, we need to be careful about letting the inferences we draw suggest actions unless we understand what the actual causal mechanism is.
  2. Because these correlations come from the process rather than reality, if the process changes then many of our old inferences become invalid.

(2) is particularly important when we experience culture shifts, because the entire generating process will have changed, so many things we thought were true become invalid.