The intuition and the math behind Simpson's paradox

Introduction

When we reason qualitatively, we often rely on our intuition. However, intuition is often loaded with certain meta-biases that cloud our judgment; one of these biases comes into play when we think about “local” and “global” statements.

What is a local or a global statement? One way of distinguishing between them is how many conditions we must guarantee hold before we can talk about the statement - so this terminology is relative. A local statement is a statement that uses many more conditions (a higher amount of specificity) than a global statement. For example, any statement about a certain group of people in the past few years is a local statement relative to a statement that is about all species that have ever existed on the earth.

So the crux here is whether you are talking about the whole or a subset.

Enter Simpson’s paradox (not really a paradox, though). This refers to a bias where we tend to think that if a statement is globally true, it must be locally true as well (or vice versa).

Examples and intuition

The first example

The most common example of this is the following (called Simpson’s inversion). Think about this: if a batsman A’s average (runs per innings) is more than batsman B’s average in 2 seasons separately, will A have a higher average than B overall?

Don’t blame yourself if you thought, on first instinct, that the answer is yes. The answer is in fact no and has been seen historically in this same context too. No points if you thought it was no just because I would not be writing this post otherwise.

Before we resolve this apparent paradox, I want to point out why we are prone to this fallacy in reasoning in the first place. The key word here is average. Whenever we are looking at averages, we are missing out on cumulative aggregate behaviour and instead focusing on efficiency. Keep this in mind when we show an example below (we’ll also look at this mathematically). Note that this simplification is adjacent to the one done in the representativeness heuristic, in which we discard the statistical probability (which talks about the relative frequency) of an event being in a group and instead just look at how close it is to a representative element of a group. Reading this explanation will make much more sense after the example, but it’s a good guiding principle to keep in mind.

Here’s a concrete example as promised:

Let’s say A played 10 innings and B played 100 innings in the first season, and that A played 100 innings and B played 10 innings in the second season.

In the first season, A had an average of 20 runs per innings, and B had an average of 30 runs per innings. In the second season, A had an average of 80 runs per innings, and B had an average of 90 runs per innings.

So in all, A made \(20 \times 10 + 80 \times 100 = 8200\) runs, and B made \(30 \times 100 + 90 \times 10 = 3900\) runs. So A made more than double the runs that B made, in the same number of innings.

Wait, but how is that possible?

If you look closely, you can see a pattern. Whenever it was possible for both players to score a lot, A performed better than B (and capitalized more). In the other case, B was better.

So in this case, the difference comes down to exploiting the opportunity when you had it. We will formalize this theme mathematically too. But before that, let’s see some real life examples of how people make decisions based on this.

Pricing products for customers: let’s say you have two customers, one big and one small. You are competing against other companies in terms of pricing. Based on solely what will happen in the immediate future (and not thinking of other higher-order consequences), would you rather give a discounted price to the big customer or the smaller customer? Is it possible to give discounts to both and yet make more money than your competitor, who does not give out discounts at all?
Gerrymandering: If you were given the choice to be able to define the boundaries of electoral districts, and you knew the composition of voters, would you be able to get wins in the majority of districts if you had, say, fewer votes than your opposing party? Hint: can you ensure that the districts you will lose in will be highly concentrated with voters of the other party and the places you win in, you will win by a small edge?).

If you thought about these, you probably realized that there is a difference between the average outcome and the cumulative outcome (which also depends on how many times you accessed the outcome). This is why local estimates are not always going to be good proxies for global estimates.

The second example

Let’s take another guiding example, that will lead us to a slightly different intuition (this time, dealing with why global estimates are not good proxies for local estimates).

On average, treatment A shows a lower survival rate than treatment B. Will you have a better chance of survival if you chose B over A, based on this information?

This seemingly obvious question should make you more suspicious than the previous example. After all, why would I not choose treatment B?

The idea is that not looking at the complete picture is (again) leading to issues here. To see what I mean by this, consider the following scenario. Treatment A is used more by patients who are way into the disease than ones who have just gotten diagnosed, and thus the unconditional mortality rate for such patients is quite high. Meanwhile treatment B is used more by patients who have just gotten diagnosed, and didn’t want to undertake treatment A which could have had more serious side-effects. So the unconditional mortality rate is much lower for such patients. However, treatment A is always more effective in these cases separately, and treatment B does not stand a chance in the more severe cases. So a good choice in this case would be to identify the case instead of making blanket generalizations.

So the unconditional statistics of treatment A are mostly dictated by the ones who undergo it, rather than the actual efficacy itself.

Making decisions based on completely global considerations is generally bad compared to looking at both the local and global aspects of it (i.e., doing more cross-sectional analysis). A couple of real life examples:

Analyzing customer satisfaction data: let’s say, a company’s global customer satisfaction score seems higher for Product A compared to Product B. However, when they looked into the statistics more deeply, B has better satisfaction scores among big customers, while A performs better for small customers. A global decision that prioritizes A for all customers would fail and potentially lead to losses for the company, with big customers moving away to competitors (this is amplified by the prevalence of Pareto distributions whenever wealth is involved). However, a better decision would be to prioritize B for big customers and A for small customers. Note how this also hints at how voting systems have fatal flaws when you vote in terms of aggregates instead of more fine-grained methods for different aspects (think - monetary policy, public policy, health policy).
Promotion decisions in jobs: let’s say a company is deciding on the number of people to promote, and they divide people into a historically underpromoted class and a historically overpromoted class, based on overall statistics. However, on looking more deeply, it turns out that the underpromoted class has been more represented in a department where promotions are not common, and the overpromoted class has been more represented in a department where promotions are common. So instead of promoting the underpromoted class and risking the careers of employees who are not ready for such a change (and not promoting employees who are), the better strategy would be to fix this representation problem across departments instead.

Formalization and some math

Now that we’ve understood that one major issue behind this paradox is that local statistics are not good proxies for global statistics (and vice versa), it makes sense to quantify this.

We’ll be looking at it in the easiest setting to analyze mathematically - a partition of the universe into multiple sets, a binary choice for each of these sets (for which the expected value as well as the number of allowed realizations of that outcome are specified), and the probability of being in any set.

Let’s say we have a partition of the universe into \(n\) sets \(U = \cup_{i=1}^n S_i\), and in the \(i\)-th set, we are able to make two decisions: the first has expected value \(u_i\) and you must realize the outcome \(k_i\) times, and the latter has expected value \(v_i\) and you are must realize the outcome \(l_i\) times. The probability of the future lying in the \(i\)-th set is \(p_i\) (we can also model the problem by allowing making a decision for each set exactly once instead of assigning these probabilities).

Simpson’s paradox in the first example can be looked at like this: we can have \(u_i < v_i\) for all \(i\), but end up having \(\frac{\sum_{i=1}^n u_i k_i}{\sum_{i=1}^n k_i} > \frac{\sum_{i=1}^n v_i l_i}{\sum_{i=1}^n l_i}\).

One question that can be asked in this setting (and the question we will be thinking about for the rest of this post) is: how do you maximize (or minimize, equivalently) the total value of the bet? What if you need to maximize the expected value of the bet? What if you need to do something in between?

The independent choice case is easy (i.e., when you know the future is in a certain set before making the decision):

If you want to maximize the expected value per realization, you should pick the outcome with the higher expected value for each set.
If you want to maximize the total payoff, you should pick the outcome with the higher total payoff for each set.

This roughly corresponds to the second example above. The very fact that we knew whether the “future” was in some set (no temporal aspect is necessary here, you can also treat it as having extra cross-sectional information about the situation at hand) was able to help us make better decisions.

The independent cases were obvious, where being greedy was optimal. The mathematically non-obvious scenario comes when we have dependencies, and it is hard to tell whether being greedy is optimal or not.

Let’s look at one example of when we have dependencies. For now, let’s assume that we must either make the first choice in all sets or the second choice in all sets (i.e., when we don’t know anything about the future except the payoffs and probabilities of outcomes conditioned on whether the future is in that set or not). Also for the sake of simplicity, let’s assume that all the sets are equally likely.

Then we need to choose between these two payoffs: \(\sum_{i=1}^n u_i k_i\) and \(\sum_{i=1}^n v_i k_i\). This part itself shows that it is important to look at the cumulative outcome instead of the average outcome.

What if there was more freedom available to us? For example, let’s say we were allowed to choose the \(k_i\)-s, but under certain constraints. Suppose for the sake of simplicity that the constraints are frequency constraints so that the expected number of realizations is the same (or the sum is the same). In other words, you want something like allocating 1 realization to 4 sets, 2 realization to 2 sets, 3 realizations to 1 set, 4 realizations to 3 sets and so on.

So before choosing between the first choice and the second choice, we can fix how many realizations we want to get (i.e., choose \(k_i\) according to values of the \(u_j\)-s), and then choose the better of the first and second choices.

In other words, it gets down to maximizing \(\sum_{i=1}^n u_i k_i\), where \(k_i\) varies over permutations of a given array (or in more math terms, the multiset of \(k_i\) is constrained to be a certain given multiset).

The greedy choice here is to assign the largest number of realizations to the largest expected value event. And this works too - the reason is the rearrangement inequality.

The rearrangement inequality is a pretty powerful inequality. It basically says the same thing as we have said here already. Here’s the formal statement:

For every choice of real numbers \(x_{1}\leq \cdots \leq x_{n}\) and \(y_{1}\leq \cdots \leq y_{n}\) and every permutation \(\sigma\) of the numbers \(1,2,\ldots n\) we have \(x_{1}y_{n} + \cdots + x_{n}y_{1} \leq x_{1}y_{\sigma (1)} + \cdots + x_{n}y_{\sigma (n)} \leq x_{1}y_{1}+\cdots +x_{n}y_{n}\).

So if two sequences are sorted in the same way, their dot product is the maximum possible, and if they are sorted the opposite ways, their dot product is the smallest possible.

The proof is simple - for the right side, if we try to bubble-sort \(y\) (i.e., swapping out-of-order adjacent elements), we will notice that at every swap, the value remains the same or becomes larger. The left side is analogous (or you can apply it to the case where \(y\) is replaced by \(-y\)).

This is actually pretty powerful - you can use trivial applications of this inequality to, for example, prove the AM-GM inequality and the Cauchy Schwarz inequalities, the former being a weak inequality but harder to prove than the latter. In this way, it is a building block of the most basic yet powerful ways of comparing numbers/outcomes.

As an aside, by varying \(\sigma\) over all permutations and taking the mean, we can also say that \(n\) times the product of the mean of the sequences lie between these bounds as well. This points us to the fact that making decisions based on a much finer granularity is likely to be much better than making decisions from just looking at the big picture. (A reader might note that this holds only if we are able to account for different outcomes at a very low level, and does not directly apply to cases where irreducibility/emergent behaviour is a problem and the computation itself is expensive and can distort the overall utility of the decision).

Take a moment to try to connect these situations and conclusions with the examples we mentioned in the previous section.

As far as trying to understand Simpson’s paradox on a causal level goes, the theory of probabilistic graphical models comes up with certain conditions on probabilistic dependencies between events to detect “causal” structures that allow for certain counterintuitive behaviours. I won’t go into too much detail on this topic here, but it is being brought up to illustrate the point that Simpson’s paradox is not really a paradox, but something that can be reasoned about.

Introduction#

Examples and intuition#

The first example#

The second example#

Formalization and some math#

Introduction

Examples and intuition

The first example

The second example

Formalization and some math