This post was originally written on Codeforces; relevant discussion can be found here.
Recently someone asked me to explain how to solve a couple of problems which went like this: “Find the expected time before XYZ happens”. Note that here the time to completion is a random variable, and in probability theory, such random variables are called “stopping times” for obvious reasons. It turned out that these problems were solvable using something called martingales which are random processes with nice invariants and a ton of properties.
I think a lot of people don’t get the intuition behind these topics and why they’re so useful. So I hope the following intuitive introduction helps people develop a deeper understanding. Note that in the spirit of clarity, I will only be dealing with finite sets and finite time processes (unless absolutely necessary, and in such cases, the interpretation should be obvious). I will do this because people tend to get lost in the measure-theoretic aspects of the rigor that is needed and skip on the intuition that they should be developing instead. Also, note that sometimes the explanations will have more material than is strictly necessary, but the idea is to have a more informed intuition rather than presenting a terse-yet-complete exposition that leaves people without any idea about how to play around with the setup.
People already familiar with the initial concepts can skip to the interesting sections, but I would still recommend reading the explanations as a whole in case your understanding is a bit rusty. The concept of “minimal” sets is used for a lot of the mental modelling in the post, so maybe you’d still want to read the relevant parts of the blog where it is introduced.
I would like to thank Everule and rivalq for suggesting me to write this post, and them and meme and adamant for proofreading and discussing the content to ensure completeness and clarity.
BONUS: Here’s an interesting paper that uses martingales to solve stuff.
Table of contents
- Probability, sigma algebras, and random variables
- Expected value of a random variable and conditional probability
- Conditional expectation of a random variable with respect to a sigma algebra
- Martingales
- Stopping times
- Some math problems
- Some competitive programming problems
Probability, sigma algebras, and random variables
Let’s look at the simplified definition of a probability space first. We say simplified here because there are some technicalities that need to be sorted out for infinite sets, which we’re not considering here anyway.
A probability space is a triple \((\Omega, \mathcal{F}, P)\) consisting of
- \(\Omega\): the (non-empty) ground set, also known as the sample space. It is helpful to think of it as the set of all possible outcomes of an experiment.
- \(\mathcal{F}\): the \(\sigma\)-algebra. This is a collection of subsets of \(\Omega\). Note that this is not always the power set of \(\Omega\). It is helpful to think of its elements (the chosen subsets) as “events”. For example, when rolling a die, one possible event can be “getting an even number”. For this experiment, we have \(\Omega = \{1, 2, 3, 4, 5, 6\}\) and this event is \(\{2, 4, 6\}\).
- \(P\): the probability function, a function from \(\mathcal{F}\) to \([0, 1]\). This is just a weight that we assign to every event that we have chosen.
For this function to have some nice properties and for it to make sense as some measure of “chance”, we add some more constraints to these functions:
- \(\Omega\) is an event (i.e., \(\Omega \in \mathcal{F}\)).
- If \(A \in \mathcal{F}\), then \(\Omega \setminus A \in \mathcal{F}\). That is, if something happening is an event, then it not happening is also an event.
- If \(A \in \mathcal{F}\) and \(B \in \mathcal{F}\), then \(A \cup B \in \mathcal{F}\). That is, if \(X\) happening is an event and \(Y\) happening is an event, then at least one of \(X\) and \(Y\) happening is also an event. Note that this combined with the previous constraint allows us to say that \(A \cap B\) is an event and so on due to De Morgan’s law.
- \(P(\Omega) = 1\), that is, the probability of the whole sample space is \(1\).
- For disjoint sets \(A\) and \(B\), \(P(A \cup B) = P(A) + P(B)\). This, along with the previous constraint, is sufficient to derive identities like \(P(\emptyset) = 0\), \(P(A) + P(\Omega \setminus A) = 1\) and so on.
Let’s now build some intuition about this definition (the following, especially the part about minimal sets, applies only to finite sets, but there are a lot of similarities in the conclusions when we try to generalize to other sets using measure-theoretic concepts).
Note that since \(\mathcal{F}\) is closed under intersections, if we do the following procedure multiple times: take two sets and replace them with their intersection, we will end up with some “minimal” non-empty sets that we can’t break further (this notation is not standard at all, but we will use this a lot in what follows). Consider the set of all such possible minimal sets. Clearly, these sets form a partition of \(\Omega\) since they are disjoint (else contradiction to minimality) and since every element is in a minimal set (using the fact that \(\mathcal{F}\) is closed under complements and constructing a minimal set if needed, or by just intersecting all sets containing that element — there is at least one such set, which is \(\Omega\)). Now since \(\mathcal{F}\) is also closed under unions, it also equals the set of all possible unions of these minimal subsets. In other words, we have a partition of the elements of \(\Omega\) into some sets, and \(\mathcal{F}\) is the set formed by all possible unions of sets in that partition. We can now think of \(\sigma\)-algebras in terms of partitions! Also, note that the probabilities for each of these minimal events of the \(\sigma\)-algebra add up while taking the unions.
In some sense, we can now construct an equivalent probability space \((\Omega’, \mathcal{F}’, P’)\) where \(\Omega’\) is the set of minimal events of \(\mathcal{F}\), \(\mathcal{F}’\) is the power set of \(\Omega’\), and \(P’\) is defined naturally from \(P\). This definition is not standard but will be useful later on. Note that if \(\mathcal{F}\) was the power set of \(\Omega\), then this equivalent probability space would have been the same as the original probability space. In this sense, all other \(\sigma\)-algebras are “coarser” than the power set, whose minimal sets are all singletons.
At this point, we should realize that \(\mathcal{F}\) is a measure of how much “information” we have about \(\Omega\) (we can’t distinguish between two elements in the same minimal set). This idea, along with the partition intuition, is a good way to deal with how you gain information over time in random processes, but we’ll come to that a bit later.
Let’s now turn to the formal definition of a random variable.
A random variable with respect to a probability space \((\Omega, \mathcal{F}, P)\) is a function \(X\) from \(\Omega\) to \(\mathbb{R}\) such that \(X^{-1}((-\infty, x]) \in \mathcal{F}\) for all \(x \in \mathbb{R}\). That is, the set of pre-images of real numbers \(\le x\) should be an element of \(\mathcal{F}\).
Note that since we are dealing with finite sets, we will use an alternative (but equivalent for finite sets) definition where we replace the interval \((-\infty, x]\) by \(\{x\}\).
When written like this, it might be a bit intimidating, but the underlying idea is quite simple. We have a sample space \(\Omega\), and we have another set \(\mathbb{R}\). Now we want to map elements of \(\Omega\) to \(\mathbb{R}\) in some way so that we can do some computations over elements of \(\Omega\). However, since in a probability space, \(\mathcal{F}\) tells us how much information we have (or are allowed to look at), we shouldn’t be able to distinguish between elements of \(\Omega\) that are inside the same minimal event of \(\mathcal{F}\). Since there are finitely many elements in \(\Omega\), and elements inside the same minimal event should be mapped to the same real number, there can be only finitely many intervals that \(X\) partitions \(\mathbb{R}\) into, and the condition in the definition is necessary and sufficient (due to closedness of \(\mathcal{F}\) under intersection and unions).
As a side-note, note that we can also define functions of a random variable at this point by simply composing the function \(X\) with a function from \(\mathbb{R}\) to \(\mathbb{R}\). Similarly, if we have two random variables on the same probability space, we can also define functions of them quite naturally. After all, we just need to specify their values on the minimal sets and we’re done.
Expectation of a random variable and conditional probability
Let’s try to investigate random variables a bit further. Let’s say we have a probability space \((\Omega, \mathcal{F}, P)\) and a random variable \(X\) on this probability space. How do we capture some sort of a mean value of \(X\) over the entire probability space? One possible answer is the following: for each minimal event, we consider the common value of the random variable at each element of that event, and add it to a weighted sum with weight equal to the probability of that minimal event. Note that minimal events form a partition so the sum of these weights is \(1\). In some sense, this is an average of the random variable over all minimal events, weighted by their importance.
But how do we write this up mathematically without referring to minimal events? Let’s take advantage of the condition in the definition of a random variable. Instead of summing over minimal events, we can sum over distinct values \(x\) of the random variable, and weight it by the probability of that variable being equal to \(x\). This probability is well-defined due to the condition in the definition of a random variable (because the event that the random variable equals \(x\) is in fact a disjoint union of minimal events and their probabilities add up). This definition is also consistent with our earlier intuition.
So we can define the expected value of a random variable \(X\) to be \(E[X] := \sum_{x \in \text{range}(X)} x \cdot P(X = x)\).
Note that by our earlier definition related to minimal events, we can also clearly see that for any random variables \(X, Y\) on the same probability space, we have \(E[X + Y] = E[X] + E[Y]\).
Let’s now change gears and think a bit about how we can isolate an event. Suppose we have an event \(E\). Let’s say that we want to ignore everything else, and only look at the elements of \(\Omega\) (and consequently the minimal events of \(\mathcal{F}\) that form a partition of \(E\)). Can we make a probability space out of it? One easy way would be to just inherit the structure of \(\Omega\) and \(\mathcal{F}\) and set the probabilities of everything that uses elements other than the minimal elements constituting \(E\) to be \(0\), but we run into one trouble if we simply decide to inherit the probability function. The sum of probabilities over all such minimal events is not \(1\). One possible idea is to normalize it by \(P(E)\), which is possible iff it is non-zero. It can easily be checked that this probability space is indeed a valid probability space.
This is how we define the conditional probability space wrt an event \(E\) with \(P(E) > 0\). The sample space and the \(\sigma\)-algebra remain the same, but the probability function becomes \(P(A | E) := P’(A) = P(A \cap E) / P(E)\). This can be verified to be consistent with our earlier reasoning by considering minimal elements.
A cleaner way of looking at it is that we’re really only concerned with some subset of \(\Omega\) and \(\mathcal{F}\) that correspond to elements in \(E\), and we define a probability space with \(E\) as the ground set. But just to make sure that we can interoperate with the other events in \(\mathcal{F}\), we take the event, trim down the parts that are not in \(E\), and then use the probability (or relative weight) of that set wrt the probability (or weight) of \(E\) instead.
One corollary of this is the following: \(\sum_{E_i} P(A | E_i) P(E_i) = P(A)\) for a partition of \(\Omega\) into events \(E_i\). This can be seen to be true if we consider the minimal sets in \(A\). In fact, this is used while solving a lot of recurrences, like in Bayes’ theorem.
This also has an expected value variant, and if we naturally restrict a random variable to each conditional probability space, by summing over minimal sets, we get that if \(E[X | E_i]\) is the expected value of \(X\) wrt the conditional probability space with respect to \(E_i\), then \(\sum_{E_i} E[X | E_i] P(E_i) = E[X]\). This variant is used in recurrences too, like finding the expected time before we toss two heads in a row with a fair coin (though the sample space is infinite there, these identities still hold).
Conditional expectation with respect to a sigma algebra
Let’s continue with the partition and information analogy even further. Recall how we had defined an equivalent probability space. It corresponds to collapsing equivalence classes (determined by the partition) into a single element, and making a probability space on that as the sample space. What if we do that again, after defining another sparser/coarser probability space on the resulting probability space?
Of course, this leads to a coarser partition of the sample space in some sense, and if \(\mathcal{F}_1\) is the original \(\sigma\)-algebra, and \(\mathcal{F}_2\) is the new \(\sigma\)-algebra (both of them corresponding to the same \(\Omega\), after “unwrapping/unrolling” the second coarsening), then \(\mathcal{F}_2 \subseteq \mathcal{F}_1\).
Now suppose we had a random variable \(X\) on the original probability space. How do we reduce/adapt the random variable to this coarser probability space? This is where expected value comes into the picture. We will choose a definition for conditional expectation, and see that it satisfies a lot of good properties and is consistent with our conditional probability identities that we figured out in the previous section.
Note that \(X\) is constant on all minimal sets of \(\mathcal{F}_1\). Let’s define the conditional expectation of \(X\) with respect to the \(\sigma\)-algebra \(\mathcal{F}_2\) as the random variable \(E[X | \mathcal{F_2}] := X’\) such that on every element of a minimal set \(E_i\) of \(\mathcal{F}_2\), \(X’\) assumes the value \(E[X | E_i]\). This essentially means that we are trying to take a “conditional average” of the information about the values of \(X\) on the set \(E_i\). It is easy to verify that \(E[X’] = E[X]\), where the first expectation is taken over the coarser probability space, and the second is taken on the finer probability space. In a case of unfortunate naming, \(X’\) is called the conditional expectation of \(X\) wrt the \(\sigma\)-algebra \(\mathcal{F}_2\).
To make this a bit clearer, let’s take an example.
Example
Let’s say that we have a probability space \((\Omega, \mathcal{F}_1, P)\), a random variable \(X\) on this space, and another \(\sigma\)-algebra \(\mathcal{F}_2 \subseteq \mathcal{F}_1\) as follows:
- \(\Omega = \{1, 2, \dots, 10\}\)
- \(\mathcal{F}_1\) has minimal sets \(\{1, 2, 3\}, \{4\}, \{5, 6\}, \{7\}, \{8, 9, 10\}\).
- \(P(S)\) for these minimal sets \(S\) are \(1/8, 1/8, 1/4, 1/3, 1/6\) respectively.
- \(\mathcal{F}_2\) has minimal sets \(\{1, 2, 3, 5, 6\}, \{4\}, \{7, 8, 9, 10\}\).
- \(X\) takes the values \([1, 1, 1, 2, 3, 3, 3, 4, 4, 4]\) for \([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\) respectively.
Then, \(E[X | \mathcal{F}_2]\) is determined as follows:
- On the minimal event \(S = \{1, 2, 3, 5, 6\}\), the conditional expectation \(E[X | S] = (1/8 + 3/4) / (1/8 + 1/4) = 7/3\).
- On the minimal event \(S = \{4\}\), the conditional expectation \(E[X | S] = 2\).
- On the minimal event \(S = \{7, 8, 9, 10\}\), the conditional expectation \(E[X | S] = (3/3 + 4/6)/(1/3 + 1/6) = 10/3\).
So the random variable \(E[X | \mathcal{F}_2]\) takes the values \([7/3, 7/3, 7/3, 2, 7/3, 7/3, 10/3, 10/3, 10/3, 10/3]\) for \([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\) respectively.
We note a couple of points here:
- In a way, we are trying to replace the random variable with its “best approximation” in some sense (for instance, a variance minimization sense, but that is totally unrelated). Also note that the expectation of the random variable does not change when we apply this coarsening to \(X\).
- Note that for the trivial \(\sigma\)-algebra \(\mathcal{F}_2 = \{\emptyset, \Omega\}\), the random variable \(E[X | \mathcal{F_2}]\) is identically the constant function that equals \(E[X]\).
We can also chain a couple of coarsenings. Suppose \(\mathcal{F}_3 \subseteq \mathcal{F}_2 \subseteq \mathcal{F}_1\) and \(X\) is a random variable wrt the probability space \((\Omega, \mathcal{F}_1, P)\). Then we have \(E[E[X | \mathcal{F}_2] | \mathcal{F}_3] = E[X | \mathcal{F_3}]\), as can be verified easily by chaining conditional expectations. In fact, the fact that expectation remains the same even after coarsening is a direct consequence of this and the last point.
A more commonly used term is conditional expectation with respect to another random variable. It is usually defined as this:
Suppose \(X\) and \(Y\) are random variables on the same probability space. For any element \(\omega\) of the sample space where \(Y(\omega) = y\), we define \(E[X | Y]\left(\omega\right) = E[X | E]\) where \(E\) is the event that \(Y = y\). That is, for every element where \(Y\) assumes a value of \(y\), we take the conditional expectation wrt the event that \(Y = y\) and set the value of the answer to that conditional expectation.
However, this is the same as saying this: let \(\mathcal{F}_Y\) be the \(\sigma\)-algebra where the minimal sets correspond to distinct values of \(Y\). Then \(E[X | Y]\) is identically equal to the random variable \(E[X | \mathcal{F}_Y]\). As a side-note, this constructed \(\sigma\)-algebra is called the \(\sigma\)-algebra generated by the random variable \(Y\).
In this sense, conditional expectation with respect to a \(\sigma\)-algebra is a cleaner and more sophisticated concept than conditional expectation with respect to a random variable, though both concepts are equivalent (you can always induce a \(\sigma\)-algebra by choosing where to put equal values for the random variable you’re conditioning on).
Another way of thinking about this concept is to think of it as projecting down from a set of minimal events to a coarser set of minimal events. For each chunk of minimal events we’re collapsing, we replace it with a single minimal event by the “center of mass” of those events, where the “position” is the value of the random variable, and the “mass” equals the probability of those minimal events. This works out mathematically too, and is precise for this setup.
Just for the sake of completeness, since the above definition captures what conditional expectation with respect to a \(\sigma\)-algebra does, but breaks down when you try to generalize it to infinite sample spaces, here’s the “official definition” that people use (which is quite counter-intuitive and takes some time to digest):
A conditional expectation of \(X\) given \(\mathcal{H}\), denoted as \(E[X | \mathcal{H}]\), is any random variable on \((\Omega, \mathcal{H}, P)\) which satisfies \(\int_H E[X | \mathcal{H}] d\mathbb{P} = \int_H X d\mathbb{P}\) for each \(H \in \mathcal{H}\).
This essentially says the same thing when you think about the possible \(H\), but in a more non-constructive way. Turns out that this random variable is not unique in the strictest sense, but \(P(Y \ne Z) = 0\) holds, where \(Y\) and \(Z\) are two candidates for this random variable. Also, this definition makes it clear that \(X - E[X | \mathcal{H}]\) is orthogonal to the indicator function (so we get a more precise definition of projecting down).
Martingales
Let’s say there is a “process” of some sort, where at every step, we get more information. Let’s say our sample space for the whole experiment is \(\Omega\). Here, \(\Omega\) is almost always infinite, but the intuition we built carries forward here, even though the concepts like minimal sets might not. Let’s take a very simple example: we toss a coin at each step. The sample space for the whole experiment is the set of all infinite binary strings. Let’s think about the kind of information that we have available at various steps in the process. After step 0, all the information we have (for instance, the complete information about a random variable that has information about what has happened until now) corresponds to the trivial \(\sigma\)-algebra \(\{\emptyset, \Omega\}\). After step \(i\), we have information for the first \(i\) coin tosses, so the information corresponds to the \(\sigma\)-algebra with “minimal” sets being the sets of binary strings which are the same in the first \(i\) positions.
To understand why \(\sigma\)-algebras are necessary to capture information, perhaps the following analogy helps. Think in terms of minimal sets. Each time we get some more information, we might be able to distinguish between elements in minimal sets. To be able to do operations like finding the expectation of some random variable in this scenario, we need to be able to distinguish between these elements too. In some sense, you’re inspecting multiple parallel universes at the same time, and in each universe, you have a certain prefix of things that happened, for each of which you want to be able to analyze what happens later.
Note that we always get more information at every stage of such a stochastic process. So we define a filtration as follows (without any reference to a process for the sake of generality): a sequence of \(\sigma\)-algebras \(\mathcal{F}_i\) such that \(\mathcal{F}_i \subseteq \mathcal{F}_j\) iff \(i \le j\).
Now let’s get to what a process really is. At every step of a process, we have a random variable that shows up. But what is the \(\sigma\)-algebra for that random variable? To remedy this, we take a \(\sigma\)-algebra \(\mathcal{F}\) that captures all possible information. So we define a stochastic process as follows:
For a given probability space \((\Omega, \mathcal{F}, P)\), and an index set \(I\) with a total order \(\le\), a stochastic process \(X\) is a collection of random variables \(\{X(i) | i \in I\}\).
For our purposes, \(I\) will mostly be the non-negative integers, though \(I = \mathbb{R}\) is used extensively in a lot of contexts. \(I\) having a total order means that there is some concept of time associated with \(I\).
How do we now construct a “natural filtration” that captures the information that the first \(i\) steps of the process (i.e., \(\{X(j) | j \le i\}\)) give us? Remember how we said we can construct a \(\sigma\)-algebra from a random variable? We can also use the same construction to construct it from a finite set of random variables, by refining two partitions together while it is possible (this corresponds to the join of the two partitions in the refinement lattice, if you’re familiar with posets and lattices). So the filtration can be set up this way too, by refining the current \(\sigma\)-algebra with a new random variable each time. More formally, paraphrasing from Wikipedia,
Let \({\displaystyle (X_{n})_{n\in \mathbb {N} }}\) be a stochastic process on the probability space \({\displaystyle (\Omega ,{\mathcal {F}},P)}\). Then \[{\displaystyle {\mathcal {F}}_{n}:=\sigma (X_{k}\mid k\leq n)}\] is a \(\sigma\)-algebra and \({\displaystyle \mathbb {F} =({\mathcal {F}}_{n})_{n\in \mathbb {N} }}\) is a filtration. Here \({\displaystyle \sigma (X_{k}\mid k\leq n)}\) denotes the \(\sigma\)-algebra generated by the random variables \({\displaystyle X_{1},X_{2},\dots ,X_{n}}\). \({\mathbb F}\) is a filtration, since by definition all \({\displaystyle {\mathcal {F}}_{n}}\) are \(\sigma\)-algebras and \({\displaystyle \sigma (X_{k}\mid k\leq n)\subseteq \sigma (X_{k}\mid k\leq n+1).}\)
This is known as the natural filtration of \({\displaystyle {\mathcal {F}}}\) with respect to \({\displaystyle X}\).
Finally, we are able to define what a martingale is. Note that if \(\sigma(X_i) \subseteq \mathcal{H}\) for some \(\sigma\)-algebra \(\mathcal{H}\), then \(E[X_i | \mathcal{H}] = X_i\) (think about the minimal sets and the definition). A martingale essentially says that conditional on the information we have until time \(i\), a random variable \(X_j\) for \(j \ge i\) looks like \(X_i\). More formally,
A martingale is a stochastic process \(X_1, X_2, \dots\) such that \(E[X_{n + 1} | \sigma(X_1, X_2, \dots, X_n)] = X_n\), and \(E[|X_n|] < \infty\).
Equivalently, the first condition can be written as \(E[X_{n + 1} - X_n | \sigma(X_1, X_2, \dots, X_n)] = 0\). Note that we talk about “almost sure” equality of random variables when dealing with more complicated probability spaces, but for our purposes, we will not concern ourselves with this detail.
Note that this is necessary (trivial) and sufficient, because \(X_j - X_i = (X_j - X_{j - 1}) + (X_{j - 1} - X_{j - 2}) + \dots + (X_{i+1} - X_i)\), and by applying the nested coarsening property, it reduces to \(E[X_j - X_i | \sigma(X_1, \dots, X_i)] = 0\).
Why is a martingale so useful? Firstly, it gives us an invariant, and as is well-known, invariants make life easier when we try to prove results in math. Secondly, it originated in the theory of gambling, and for fair gambling games, the expected value of the profit at a later point in time should be equal to the profit at that point in time, which is precisely the martingale condition!
Let’s look at some constructions of martingales.
- Consider a random walk, where the initial position is \(X_0 = 0\), and \(X_i = X_{i-1} \pm 1\) with equal probability \(p \le 1/2\), and \(X_i = X_{i-1}\) with probability \(1 - 2p\). Then this is a martingale (from the definition).
- Consider a random walk denoting the total assets of a gambler who keeps reinvesting all his capital into the game. The game gives a payout of \(r\) percent if he wins (with probability \(p\)), a loss of \(r\) percent if he loses (with probability \(p\)), and no payout in case its a draw (with probability \(1 - 2p\)). Then his wealth is a martingale.
There are many more examples of martingales, but I won’t go too much in depth about that, and would leave you to refer to some other resource for more examples.
Some more intuition about martingales: let’s see what happens to the conditional expectation in the definition of a martingale when we apply a convex function to a stochastic process (i.e., to all random variables in the stochastic process). Since probabilities are all positive, whenever we’re taking the conditional expectation, we are taking a convex combination of the values of the convex function of a random variable. By Jensen’s inequality, we get that rather than equality, \(\ge\) holds in the defining equation for martingales. Similarly, if we had a concave function, \(\le\) would hold in the defining equation. These types of processes (where \(\ge\) and \(\le\) hold instead of \(=\)) are called sub-martingales and super-martingales respectively. Note that not all stochastic processes are one of sub-/super-martingales. Note that \(\ge 0\) and \(\le 0\) means that the random variable on the LHS is almost surely non-negative and non-positive, respectively.
An example of a sub-martingale
Let’s say we have a martingale \(X\). Then we claim that \(X^2\) is a sub-martingale. Note that by Jensen’s inequality, we have \(\sum_{i = 1}^N \lambda_i f(x_i) \ge f \left(\sum_{i = 1}^{N} \lambda_i x_i\right)\), for convex \(f\) and non-negative \(\lambda_i\) that sum to \(1\). Applying this while coarsening the minimal sets of the finer \(\sigma\)-algebra (which is done while taking the conditional expectation) gives us that \(E[X_{N + 1}^2 \mid \sigma(X_{i}^2)_{i = 0}^{N}] \ge X_N^2\), which proves the claim. In fact, in the case when \(X_n\) is the gambler’s fortune when profit and loss (of equal magnitude) are equally likely, \(X_n^2 - n\) is a martingale. Note that with the most general definition of a martingale, you could do just fine with the earlier filtration (i.e., \(\sigma(X_i)\)). For that, all that is needed is for a process to be adapted to the filtration, but since we defined martingales on the natural filtration, we went ahead with the more obvious filtration.
I won’t go into a lot of detail about martingales (since they deserve a field of study of their own), but will just point out some nice results, that help to get a feel of how martingales behave:
Azuma’s inequality: this provides a bound on how “close” we will be to the initial value of the random variable given bounds on the movement of the martingale (or how much in the wrong direction we can go for a sub-/super-martingale). Let’s say \(\{X_k\}\) is a super-martingale (which a martingale also is), and let’s suppose \(|X_k - X_{k - 1}| \le c_k\) almost surely. Then we have \(P(X_N - X_0 \ge \varepsilon) \le \exp(-\varepsilon^2/(2 \sum_{k = 1}^N c_k^2))\). Since the negative of a super-martingale is a sub-martingale, we get for a sub-martingale \(X\), \(P(X_N - X_0 \le -\varepsilon) \le \exp(-\varepsilon^2/(2 \sum_{k = 1}^N c_k^2))\). Since a martingale is both a sub-martingale and a super-martingale, we have a probabilistic bound on how close \(X_N\) will be to \(X_0\), that is, \(P(|X_N - X_0| \ge \varepsilon) \le 2 \exp(-\varepsilon^2/(2 \sum_{k = 1}^N c_k^2))\). There is a stronger version of this inequality, but I think this is quite sufficient to see how well-behaved martingales are.
Doob’s martingale inequality: this is a Markov-style inequality on the probability that a sub-martingale will exceed a value \(x\). More formally, for a sub-martingale \(X\), we have \(P(\max_{1 \le i \le n} X_i \ge x) \le E[\max(X_n, 0)] / x\). This can be used to also show Markov’s inequality.
Doob’s martingale convergence theorem: this is a result that says that a super-martingale that is bounded from below will converge (almost surely) to a random variable with finite expectation. Note that if we bar the case when the variable diverges to infinity in some sense, the other possible way for it to not converge is to oscillate. This inequality roughly says that bounded martingales don’t oscillate.
Stopping times
Stopping times are quite interesting mathematical objects, and when combined with martingales, they give a lot of interesting results and are a very powerful tool.
Suppose we are in the gambler’s shoes, and we want to write an algorithm to finish the game, depending on certain outcomes in the process. Stopping times are a way to formalize the analysis of when such exit conditions are reached. The best we can do right now, given the current theory we have, is to define it as a random variable (that may or may not depend on the stochastic process). We also have the constraint that we can’t see the future, so the decision about when to stop must be taken with all the information at the current time.
To that end, let’s consider a random variable \(\tau\) (that takes values in the index set, that is the set of non-negative integers here) on a probability space \((\Omega, \mathcal{F}, P)\), and let’s say we have a stochastic process \(X\) on the same probability space with an associated filtration \(\{\mathcal{F}_i\}\). We say that \(\tau\) is a stopping time if the event that \(\{\tau = t\}\) is an event in \(\mathcal{F}_t\). That is, the decision to stop at time \(t\) must be taken with information no more than what we have at time \(t\). For also keeping this consistent with real number indexing, we modify the condition to \(\{\tau \le t\}\) being an event in \(\mathcal{F}_t\).
To understand why this ensures no forward-looking, consider the following. We bunch together elements of \(\Omega\) that have stopping time \(\le t\), and the condition is that we shouldn’t be able to distinguish them if without information from time after \(t\). Similarly, it must not be the case that two elements of \(\Omega\) in the same minimal set of \(\mathcal{F}_t\) have different stopping times, when not both are \(> t\). In our coin flip example, the latter makes sense if you think about why you can’t have stopping time for 000010110… = 4 and that for 000011000… = 6.
Another way of formalizing the same thing is to say that the stochastic process (denoting the stopping of a process) defined by \(Y_t = 0\) if \(\tau < t\) and \(1\) otherwise is an adapted process wrt the filtration. That is, \(\sigma(Y_t) \subseteq \mathcal{F}_t\) for all \(t\).
Since the stopping time is an index, the most natural thing at this point is to think of how to index the stochastic process with some function of stopping time. If we go back to the definition of a random variable, it should be clear that intuitively we need to assign a value to it at every element in \(\Omega\). In simple cases, this is trivial, since we can just compute the value of \(\tau\) at the minimal set which has this element (let’s say this value is \(t\)), and compute the value of \(X_t\) at the minimal set again. From here, it should be obvious that \(X_\tau\) is indeed a random variable on the original probability space, and it denotes the value of the stochastic process when the process is exited. Also note that by the same reasoning, \(X_{\min(\tau, t)}\) is a random variable whose induced \(\sigma\)-algebra is a subset of \(\mathcal{F}_t\). \(X_{\min(\tau, t)}\) corresponds to a stopped process, which means that we just arbitrarily stopped a process after a time limit. These are very important types of processes, and have been extensively studied.
The cool part about martingales and stopping times together is that this setup has been studied a lot, and there are quite a few important results that I’ll list without proof, which can help us solve a lot of problems as well as build an understanding around stopping times and martingales:
Martingale central limit theorem: This is a generalization of the usual central limit theorem to martingales. The associated process is the usual martingale, but with differences between consecutive values being bounded by a fixed bound almost surely. We choose a stopping time as follows: at each step \(i + 1\), compute the conditional variance of \(X_{i + 1}\) wrt \(\mathcal{F}_i\), and keep adding it to a counter. Stop once this counter is more than a certain value \(\sigma^2\). (In a sense, we are waiting till we have collected a large enough variance/move). If the stopping time here is \(\tau_\sigma\), then the random variable \(X_{\tau_\sigma} / \sigma\) converges to a normally distributed variable with mean \(0\) and variance \(1\), as \(\sigma\) tends to \(\infty\). This also assumes that the sum of variances diverges.
Doob’s optional stopping theorem: This is a very important theorem, and will be quite useful to compute the expected values of stopping times later on. It talks about the fairness of a martingale when we stop after a random stopping time, in the sense that we developed above. The metric for fairness is that \(E[X_\tau] = E[X_0]\), and it turns out that this is true if any of the following three conditions holds:
- The stopping time \(\tau\) is bounded above.
- The martingale is bounded and the stopping time \(\tau\) is finite almost surely (i.e., \(P(\tau = \infty) = 0\)).
- The expected value of \(\tau\) is finite, and the difference between \(X_{i + 1}\) and \(X_i\) is bounded. Note that this can be used to compute the stopping time if we modify the martingale a bit, as we shall see later.
Some math problems
We’ll start out with some math problems that I have seen over the years (apologies since I don’t remember the sources for some of them).
Problem 1: Suppose there is a gambler who has \(n\) dollars and gambles infinitely many times. The payoff of the \(t^{th}\) round is \(X_t\), which is an integer bounded above and below by some fixed constants. We know that \(E[X_t|X_{t-1},\cdots,X_1]\geq c\) where \(c > 0\). Suppose the probability of the gambler going bankrupt is \(p(n)\). Do we have \(\lim_{n\rightarrow \infty}p(n)=0\)?
Solution sketch
Use Azuma’s inequality by constructing a sub-martingale equal to assets at time \(t\) minus \(ct\).
Problem 2 (folklore): Suppose there is an infinite grid, with a non-negative real number in each. Suppose that for every cell, the number in it is the mean of the numbers on its 4 neighbouring squares. Show that all the numbers must be equal.
Solution sketch
Construct a martingale, where the underlying process is the random walk with the random variable being the number on the current square. Now use the martingale convergence theorem, and you might need to use the Hewitt-Savage zero-one law depending on how you use the convergence.
Problem 3 (USA IMO 2018 Winter TST 1, P2): Find all functions \(f\colon \mathbb{Z}^2 \to [0, 1]\) such that for any integers \(x\) and \(y\), \[f(x, y) = \frac{f(x - 1, y) + f(x, y - 1)}{2}\]
Solution sketch
Try constructing a similar martingale as in the previous problem.
Problem 4 (2022 Putnam A4): Suppose that \(X_1, X_2, \ldots\) are real numbers between 0 and 1 that are chosen independently and uniformly at random. Let \(S=\sum_{i=1}^kX_i/2^i,\) where \(k\) is the least positive integer such that \(X_k<X_{k+1},\) or \(k=\infty\) if there is no such integer. Find the expected value of \(S\).
Solution sketch (due to ABCDE)
Try to make the partial sum \(S_n\) into a martingale. The easiest way is to subtract out its expected value, and then to check that it really ends up becoming a martingale. Now define the relevant stopping time naturally using the algorithm defined in the problem, and apply Doob’s optional stopping theorem.
Problem 5 (AGMC 2021 P1): In a dance party initially there are \(20\) girls and \(22\) boys in the pool and infinitely many more girls and boys waiting outside. In each round, a participant from the pool is picked uniformly at random; if a girl is picked, then she invites a boy from the pool to dance and then both of them leave the party after the dance; while if a boy is picked, then he invites a girl and a boy from the waiting line and dance together. The three of them all stay after the dance. The party is over when there are only (two) boys left in the pool.
What is the probability that the party never ends?
Now the organizer of this party decides to reverse the rule, namely that if a girl is picked, then she invites a boy and a girl from the waiting line to dance and the three stay after the dance; while if a boy is picked, he invites a girl from the pool to dance and both leave after the dance. Still the party is over when there are only (two) boys left in the pool. What is the expected number of rounds until the party ends?
Solution sketch
- is easy using a recurrence. For (b), note that the recurrence changes a bit. Let \(X_n\) be the total number of people in the pool. Is there a \(c\) such that \(X_n^2 + cn\) is a martingale? Now apply Doob’s optional stopping theorem and see what happens.
Problem 6 (folklore): Let a monkey type randomly on a typewriter with alphabet \(\Sigma\). For a given string \(w\in\Sigma^*\), let \(w_1,\dots,w_n\) be the nonempty strings that are simultaneously a prefix and suffix of \(w\). Then show that the expected number of keystrokes the monkey needs to make to type \(w\) is \(\sum_{i=1}^n|\Sigma|^{|w_i|}\).
Solution sketch
There are two ways to construct a martingale here. Look at the suffix of size \(\min(\text{number of steps}, |w|)\). Let’s construct a random variable as follows: for all current suffixes \(s_i\) that are also a prefix of \(w\), the random variable equals \(\sum_{s_i} |\Sigma|^{|s_i|}\). Look at what happens when we add a new character; is this a martingale? Can we do something to make it one? Now use Doob’s optional stopping theorem.
A more “natural” way of motivating this construction is through a gambling perspective. Let’s say at every time \(t\), a new gambler comes ad pays \(1\) to play the game. He reinvests his bet (\(|\Sigma|\) times the previous bet) if the next character is part of a prefix that started back when he entered the game. If he loses, he exits the game and doesn’t do anything. Now this is a martingale, and the total money all gamblers paid to the casino is the expected value of the stopping time.
A different approach for the above problem is discussed here.
Problem 7 (Alon, Spencer): Let \(G=(V,E)\) be a graph with chromatic number \(\chi(G)=1000\). Let \(U \subseteq V\) be a random subset of \(V\) chosen uniformly from among all \(2^{|V|}\) subsets of \(V\). Let \(H=G[U]\) be the induced subgraph of \(G\) on \(U\). Prove that the probability that \(H\) has a chromatic number of \(\le 400\) is less than \(0.01\).
Solution sketch
Let \(J\) be the induced subgraph on \(V \setminus U\). Then clearly \(\chi(H) + \chi(J) \ge 1000\), else we can get a better coloring of \(G\) than the optimal one. Note that choosing \(U\) and \(V \setminus U\) is equally likely, so the expected value of \(\chi(H)\) is at least \(500\). Now we try to use Azuma’s inequality here. But how do we induce a martingale here?
Recall that if we start with a random variable on a probability space \((\Omega, \mathcal{F}, P)\) and have a filtration \(\mathcal{F}_i\) with everything being a subset of \(\mathcal{F}\), \(E[X | \mathcal{F}_i]\) is a martingale. Let’s take advantage of the fact that we are given the chromatic number of the graph yet again. Consider a coloring of the graph, and partition it into sets \(P_i\) demarcated by the color of their vertices. Now consider the filtration where we reveal \(U \cap P_i\) one by one (these sets are independent). Note that the change in the chromatic number can be at most \(1\). So by Azuma’s inequality, we get \(P(\chi(H) \le 400) < P(\chi(H) - e \le -100) \le \exp(-100^2/(2 \cdot 1000))\), where \(e\) is the expected chromatic number of \(H\).
Some competitive programming problems
Almost all the problems that I know about in a competitive programming perspective are discussed in this Chinese post and this editorial to 1479E that explains the main idea in a slightly different manner. I won’t repeat their discussion (the equations should be understandable if you understand what the main approach with the math problems above is), but I would just list out the problems that they link as well as some problems that appeared later (in reverse chronological order).