As one can easily guess, this blog is about a mental model of thought. I chose to write about it since I feel that introspection about ways of thinking (and consequently what “thinking before doing something” means) is greatly lacking among people, and that it is critical to make better decisions (and realizing when there is no “better” decision).

I won’t bore you with the philosophical details, so I’ll approach it from a probabilistic perspective, which is closer to what I personally choose to think in terms of. A word of caution: I will sometimes oversimplify things in order to drive home the point, but sometimes it might seem to contradict with what I say in the later parts of this post. The key here is context, and if you keep track of it, things will make more sense. To understand probability concepts that I mention here in a bit more detail, I recommend reading my post on probability. If you don’t understand/don’t want to go through the math examples here, don’t worry - I’ll intersperse it with a general idea of what we are trying to do, so looking for those explanations should help. Wherever you find something that’s not something you already know, you should probably just make a note of it and go ahead (and read more about it later). If it is still not clear, you can let me know and I’ll try to clarify that part (for you and the other readers of this post). Or even by just looking up the thing on your favorite search engine/LLM, you’ll likely learn a lot, even if in a less pointed manner.

In online discourse, one finds many conflicts over beliefs, with people vehemently defending their positions that may or may not have to do with objective truths (on the other hand, there are a few people who don’t believe that there are objective truths at all, but we shall do away with such cynicism for now, for a more pragmatic view of thought).

From a probabilistic perspective, this means that the probabilities people assign to certain events don’t agree, and their opinions that come out of some sort of “representative”-finding algorithm (like taking the most probable outcome, or the one that has the largest impact) are different owing to this difference. In fact, I have good reason to believe that for many people, if not most of them, these probabilities are either 0 or 1 when they think about it consciously and uncertainty plays no role - and that the continuity is just haphazardly internalized and is updated every time they gain some knowledge from the world, whether it is an opinion or some other observable aspect of something that happened. It might be favorable for us from an “evolutionary” perspective - making some decision is better than not making a decision at all when inaction can wipe you out - but since there is a correlation between being consistently successful and making better decisions (correcting for luck, since for superlative amounts of success, it is often the situation that matters more), trying to make better decisions might just be the missing part for most people.

In my mental model, there are multiple levels of uncertainty. I like to categorize them in the following ways:

  • Uncertainty in outcomes - we must realize that outcomes have probabilities.
  • Uncertainty in relative estimates of outcomes - we must realize that our probability and impact estimates of things that we can account for are uncertain.
  • Uncertainty in universe coverage - we must realize that the craziest things are always yet to come, so we inevitably suffer from a lack of knowledge about possible events.
  • Uncertainty in knowledge modeling - we must realize that no model of dealing with uncertainties (or causality) is ever going to be infallible.

For the sake of making things clear, I’ll elaborate each of these points.

Uncertainty in outcomes

This is the place where most people trip up. Improper usage of words and phrases like “impossible”, “makes no sense”, “absurd”, “illogical”, “nonsensical” is often a symptom of not understanding uncertainty in outcomes.

In short, when you use these words, you are denying that a certain outcome is even possible, rather than acknowledging that it has a small but positive probability based on all information you have so far. And ignoring these is never a good idea. If you’ve ever multiplied a very small number with a very large number, you will realize that the impact of a very low probability outcome can overpower the impact of any other outcome in expected value. For example, consider the event “USA starts suffering from hyperinflation within the next decade”. What would you do with your savings in such a case? Would you even care? On the first glance, it has a very low probability if you ask any person who has not been living in a cave for the last 50 years. But the outcome will have a huge impact, and the contribution to the expected value of the US economy might not be as small as people think it is. So no smart decision maker will fail to acknowledge this possibility (precisely accounting for it is a different matter altogether).

Humans are complex, but still prone to these fallacies. Most people will get disgruntled with you if you tell them that there is a possibility that their parents are not who they think they are (and the wise choice is to not bring up such topics at all). The same will happen with some people if you tell them that it is possible that their friends have cheated on an exam, or that their country is on the wrong side of a sociopolitical conflict for some definition of wrong. On a forum with cartels or circlejerking being prevalent, you have a much higher chance of either getting heavily downvoted or banned for saying anything that goes against the popular opinion, or suggesting a possibility that goes against their interests. After correcting for social reasons (herd mentality and so on), this arises out of not being able to/wanting to assign probabilities to things you dislike, and simply trying to eliminate them mentally. This is how amateurs end up blowing up in financial markets too, because they don’t hedge against the possibility of being wrong.

A bit of a tangent - math goes further and distinguishes between probabilistic impossibility and mathematical impossibility. Probabilistic impossibility is when an event inherently has probability zero (for instance, what is the probability of picking the integer 1 from the set of all integers if you choose them uniformly?). Mathematical impossibility is something that your chosen axioms don’t allow you to deduce without a contradiction. For example, 0 is not the same as 1 when equality is in the sense of integers. Practically speaking, mathematical impossibility and probabilistic impossibility are the same if we assume that everything is discrete and infinities are not involved, since you can just enumerate all cases.

Coming back to the original point, it is also important to note that “outcomes” as we talk about here are not related to the future at all. There is an idea of what you can distinguish based on all the knowledge you have so far - so it has to do with the incomplete knowledge, not whether it was in the past, present, or the future. For example, consider a time before which there was no concept of fire. A person analyzing everything that has ever happened in the world would bunch together outcomes based on a tree diagram of some sort, and there will be no branching based on whether someone was burnt by fire or not. Nevertheless, using this, conditional on the probabilities of things remaining the same over time (as well as nothing new happening in the time being that could change the person’s set of assumptions - i.e., new knowledge), the person would be able to get an estimate of the probability distribution on what will happen given the state of the world, with some accuracy.

In short, we are interested in dealing with incomplete knowledge through the concept of uncertainty (“probability”), and model how we change our beliefs with new and more detailed knowledge.

Probability theory deals with it by thinking of the knowledge available at any instant in time in terms of sigma algebras and conditional probability. A sigma algebra is essentially a collection of sets of “sample points” (each set is called an event, to denote the fact that an event lets you isolate some group of those sample points) and their probabilities that satisfy the basic probability axioms (for the sake of intuition, I am not going to address the fact that this is technically not correct for some infinite sets of events). “Sample points” are the most granular outcomes possible, and they can be bunched into the smallest sets that can be distinguished from one another via certain kinds of observations (the events). Over time, as our knowledge grows, we are able to break these sets into even smaller sets, as we are able to distinguish more and more things due to events branching out. (For the mathematically inclined reader: a sigma algebra on a finite sample space is sort of the power set of these most granular sets, as my post on probability explains. However, for the purposes of this post, we shall just be looking at these most granular sets since it encodes all the information and is easier to think about).

Let’s take the example of a sequence of tosses of a fair coin (we have a closer-to-life example below, but understanding this one is important). Let’s say you have done an experiment a million times, and in that experiment you tossed a coin 100 times. You want to assign ids to all these runs based on the final sequence in a way that indistinguishable things have the same id, so you start looking at the outcomes letter by letter. Let’s say you are two tosses in (i.e., you know the results of the first two tosses for each run of the experiment). Any two eventual sequences are indistinguishable if they have the same results on the first two tosses. So a person who has seen the first two tosses for all experiments will have effectively only 4 distinguishable sets of outcomes (these are the most granular information we can have - does the sequence start with HH, HT, TH or TT?), and we can assign at most 4 distinct ids to the outcomes. As we look at the results of the next toss for all experiments, we are able to distinguish, within one previously granular set, between sequences based on the third toss, so there can be at most 8 distinct ids assignable to the outcomes. In summary, as you go on, you gather more and more knowledge about the specific outcome you got.

In the diagram below, after the first toss, the events H and T are distinguishable. After the second toss, HH, HT, TH, TT are distinguishable. Nodes in this tree have all the 100-length sequences of heads and tails with the prefix matching the label of the node. The labels of the edges are the probabilities that, given that we are on the left node of the edge, we get to the right node of the edge.

To think of it in terms of the sigma algebra, let’s look at the successive Venn diagrams below, where each color corresponds to a most-granular set (and we have truncated the experiment to three tosses instead of a hundred):

This Venn diagram illustrates how our knowledge grows over time, breaking down larger sets of possible outcomes into smaller, more specific events.

For a real life example, let’s say you’re investigating a criminal case. You have more and more information flowing in as you proceed with the investigation. In the very beginning, all you know is that the crime happened, and a set of suspects. As the investigation proceeds, you get to learn more information about the culprit (based on eyewitnesses, forensic reports and so on), and that updates your beliefs, making the set of suspects smaller. The sigma algebra framework is a bit more involved, since it also allows all possible universes of outcomes - what if the first eyewitness said that the culprit was wearing a blue shirt instead of a red one? What if the blood type was B and not A? In general, since any part of the investigation is not a hundred percent correct (for example, eyewitnesses can suffer from trauma that affects their memories), we are not really updating the set of possible culprits itself, but as we get more information, we are able to distinguish between the probabilities of certain sets of the suspects having the culprit - each time we get new information, there is a potential split in the most granular sets of suspects we can distinguish. This is done via conditional probabilities, which is really just expanding the original sigma algebra by adding the event of an eyewitness identifying certain people to the sigma algebra, and zooming in to the refinement of the sigma algebra that this addition entails.

Okay, so that was a lot of words. For the mathematically inclined, I recommend reading my post on probability if you don’t know what a sigma algebra is, though you can probably still get by the intuition we developed above. I’ll present this in a much more systematic manner below:

Let’s say that there are a total of 4 suspects \(\{A, B, C, D\}\) and one of them is surely the culprit, with \(A, B\) wearing a red shirt and \(C, D\) wearing a blue shirt. Additionally, we have \(A, C\) having blood type A+ and \(C, D\) having blood type B+. There is one eyewitness and a blood type report:

  • \(W_1\): has an estimated reliability of \(70\%\) and says that the culprit was wearing a red shirt.
  • \(W_2\) (the report): has an estimated reliability of \(95\%\) and says that the culprit had blood type A+.
  • \(W_1\) and \(W_2\) are “independent”

Initially, before any eyewitness or report came up, our sample space was \(\{A, B, C, D\}\), and the distinguishable subsets (the sigma algebra) were \(\{\}\) and \(\{A, B, C, D\}\). Here, for example, \(\{A, B\}\) represents the event that the culprit is either \(A\) or \(B\). So the probability associated with \(\{\}\) is \(0\) and that with \(\{A, B, C, D\}\) is \(1\).

Let’s add the testimony of the eyewitness. To do so, we add the element \(E_1\) to the sigma algebra, which corresponds to the eyewitness being correct. Then we figure out all the possible distinguishable subsets. Since the testimony of the first witness can’t distinguish between \(A, B\) or \(C, D\), our sets can only be of this form:

  • \(\{\}\) - no one is the culprit. The probability of this is 0.
  • \(\{A, B, C, D\}\) - the culprit is one of \(A, B, C, D\). The probability of this is 1.
  • \(E_1\) - the eyewitness is right. That is, this subset is actually \(\{A, B\}\). The probability of this is \(0.7\).
  • \(E_1^c\) - the eyewitness is wrong. That is, this subset is actually \(\{C, D\}\). The probability of this is \(0.3\).

In other words, our most granular sets are \(\{A, B\}\) and \(\{C, D\}\).

Note that we still don’t know the probability of, say, \(A\) being the culprit. The best we can do at this point is to apply an arbitrary prior on \(\{A, B\}\) to decide who is the culprit, which is not how one should approach an investigation anyway. This is where the distinguishability comes into the picture - in the generating process of the crime scene, we have no basis of saying whether A or B was more likely to be the culprit (or whether they were equally likely, for that matter). A might have had more of a tendency to commit the crime at hand across multiple runs of the universe, and B might just be a one year old kid who can’t do anything itself. The moment you assign a prior (like A and B are equally likely to be the culprit), you are claiming that the sigma algebra of the generating process looks a certain way (in terms of both the sets as well as the probabilities of those sets), by adding an event of the form “I got a report that says that one year olds are 90% less likely to commit a crime than an adult”. This is exactly what we did when we accounted for both the eyewitness and the report; it was an indirect prior.

Now let’s try adding the information of the report. We can distinguish between \(\{A, C\}\) and \(\{B, D\}\) now as well. One of the axioms of a sigma algebra is that the intersection of two sets in the sigma algebra must be in the sigma algebra too. So \(\{A\}\) should be a distinguishable set (i.e., an event in the sigma algebra) too. By doing this for each pair of sets, we will realize that our most granular sets are \(\{A\}\), \(\{B\}\), \(\{C\}\) and \(\{D\}\).

Since the report and the eyewitness are independent, we have \(P(A) = P(E_1 \cap E_2) = P(E_1) \cdot P(E_2) = 0.7 \cdot 0.95\). The rest of the probabilities can be found analogously.

What we did here was track down the probabilities by refining the sigma algebra (which in the end was the power set of \(\{A, B, C, D\}\)) via evidence as we got more of it.

An aside: note that handling contradictory witnesses is a more nuanced topic, but you must either make a choice or work out all possible sigma algebras and give them weights based on some priors (the easiest one is reliability, but it might be possible that reliabilities are not comparable across subsets of evidence). Contradictory witnesses essentially challenge the estimates of reliability that are being used, forcing us to update our basic beliefs as well.

Similarly, we assumed here that the witnesses (events) are going to be independent. In the general case, we require their joint distribution to be able to update more granular probabilities via Bayes’ theorem.

This technique is not restricted to this example only. We can model a ton of stuff using these ideas or its generalizations, provided that we have reliable estimates for everything we need. We can also sometimes bound (probabilistically) some part of the effect our actions would have, given our beliefs (or bounds) on the probabilities involved.

Uncertainty in relative estimates of outcomes

Even considering the fact that there are probabilities associated with certain events is halfway between amateur decision making and decision making at the highest level.

However, probabilistic thinkers often become overconfident in their estimates by ignoring certain uncertainties in their estimation methods themselves. One of those is the very fundamental uncertainty in the relative outcomes of two events.

When I say outcome here, I’m referring to both some metric of impact of an event, conditional on it happening, as well as the probability of the event happening.

Uncertainty in impact

Let’s consider the uncertainty in impact of an event, under the simplifying assumption that the probability of something happening is perfectly known. There are multiple ways people can disagree on this:

  • Horizon: “Shortsighted” and “farsighted” are two words that can be found in some form in almost every language - they exist solely because there is some sort of distinction between them. People are fundamentally biased towards certain horizons, based on what they have built a bias towards. The impact of an event tomorrow is not the same as the impact of one happening a couple of years down the line. It would matter to you if your bank balance goes to zero tomorrow, but you likely won’t care about the existence of humanity a million years from now. If you’re 70 years old, you will not be that concerned about the IT industry 10 years down the line, but an 18 year old choosing a CS degree would want to take into account whether the IT industry will still be a secure choice or not. A business owner would want to ensure that their investment into the business breaks even, so their horizon would be longer than an employee who has much less at stake. So if a person wants to make quick money as an employee, their best bet would be to chase hype, but for one who wants to make a safer long term bet on their business, they will want to correct for hype cycles, by choosing a business that makes (or should make) money regardless of hype in the market.

    All this is pretty justified. I care about one horizon, and you do about another. But how uncertainty comes into the picture here is when you think about when something will happen, which naturally makes the above issue important for estimation.

  • Risk aversion: We can model some part of human behavior by assuming that people have different “utility” functions (for those in the know, I don’t like utility functions either but please bear with me), and as a consequence, they demand premiums for taking on risk. Consider a person who does not like risk but is faced with the decision to buy a stock which has a chance of going up or down a hundred dollars with equal probability. For him, the happiness from gaining a hundred dollars will not offset the sadness from losing a hundred dollars, so he will refuse the bet. He would rather demand a risk premium - i.e., based on his perceived happiness from gaining or losing a hundred dollars, he will find the expected gain from the stock in the future (which will be negative), and would like to buy the stock at a better price that lets him offset this value. Similarly, if someone is faced with any risk, they will have different utilities for different events. Someone living on an isolated private island with enough means of survival will not be as concerned with (and will assign lower magnitudes of negative impact to) the downfall of the global economy as the director of a bank, but will be more concerned about the possibility of tsunamis than the latter. Just to clarify - risk aversion is not bad. It arises out of taking your future into account (for example, probability of bankruptcy, opportunity cost and so on). As a side-note, the effects we model using utilities can also be often modelled by fudging the probabilities estimates, and this method has its own pros and cons.

    This is not just about disagreements either. If you are trying to model people’s behavior, you need to be aware of these preferences. And people are not perfectly rational (for some mathematical definition of being rational), so this brings uncertainty in your estimates (for example, chronic gamblers are risk-loving instead of risk-averse, but they might be rational in their own eyes while being irrational in terms of some math definition).

Note that we have added uncertainties in computing the relative impact of things from a people perspective. It is not just limited to humans - we often have large and complex systems which sometimes cut corners for “better performance” on average (in more sophisticated terms, we call this a tradeoff), and this varies from being completely fine to completely unjustified. For instance, the chess engine Stockfish has some memory unsafe code. No comments from my side on whether this decision was a net positive for whatever the project cares about was or not (or whether this discussion was even a good use of time). As we can see from the discussions on this link, the impact is what is being disagreed upon given that we know that the code is unsafe. We don’t know if someone is running Stockfish on some mission-critical hardware just to pass time studying chess, or even if the memory unsafe code is actually exploitable or not. Clearly there are prior beliefs that come into the picture here, which is an inherent source of uncertainty.

As an aside, the statements in this thread (and other discussions online) along the lines of “prove to us that this is possible” are not about possibility, they are about certainty; often when people say this, they really just want proof that this will happen by asking you to prove that the probability of this happening is 1. By then, the damage is already done. The better choice from a probabilistic perspective is to identify when a certain adversarial thing might be possible (because that makes the utility drop suddenly due to a bump in the probability estimate) and arrange for things to happen in a way that protect against it, instead of waiting for the probability to go to 1 and realize a larger hit. Instead of asking to prove that something is possible (where in fact you’re asking them to ascertain that there exists a way in which the loss can definitely be realized by someone), it is better to ask yourself - “how do I prove that this is impossible?” You need only one observation to disprove your claim of impossibility, and this highlights a fundamental asymmetry in knowledge (and this is why some smart people sometimes hide behind erroneous and vague probabilistic arguments - to avoid being seen as being wrong in retrospect, since having been visibly right most of the time in the past is for some reason given too much weight in society). When someone asks for proof when presented with a possibility, it shows that they don’t understand the fact that statements of the form \(p = 0\) when negated don’t result in the statement \(p = 1\), but in the statement \(p \neq 0\) instead, another symptom of thinking that probabilities are just 0 or 1. In a way, thinking about probabilities this way allows you to act on every piece of information that you get - because incomplete knowledge is sometimes useful.

Uncertainty in probabilities

Let’s now consider the uncertainty in probabilities assigned by people.

We start by noting that “generator” probabilities (the probability that the generating process generates a particular sample path for us with) and “empirical” probabilities (ones we estimate from applying statistical methods to the results of experiments) differ. Let’s for the sake of convenience assume that these are the same - this assumption is often wrong, for reasons we will see in the next section.

The first point about horizon still holds.

Another reason is the way in how people have built their belief systems over time. We have a tendency to mode-collapse into whatever we have observed. We never observe samples from the joint distribution - only samples from the part of the distribution conditioned on the general state of the world immediately around us, both temporally and spatially. It is generally true that odds depend on the conditions you impose. For example, the odds of someone making at least a million dollars a year versus them making at least ten million dollars a year are dependent on what job that person does - conditioned on them being a tech employee or them being the CEO of a Fortune 500 company, these odds are very different.

So for covering all cases, we need a model that tells us, given any set of observed things, the ratio of the conditional probabilities given those observed things. This implies knowing the ratio of joint probabilities of whatever set the queries will come from.

What worsens this is that we live on a single sample path, and finding properties of the world that are invariant under certain conditions requires that we have a ton of samples under the same conditions, with biases cleaned up and all. For finding things that are generally true now (and for the immediate future, perhaps), we need a lot of experience from all the people living right now. That is still not sufficient, since there are some things that are conditionally true for the whole world in different regimes - to truly get to the most complete joint distribution of things by eliminating this conditional dependence on our “world-line”, we would need multiple instances of our world, all of which preserve the time-invariant laws, but have different sample paths (time in some sense as a monovariant for everyone, lest nitpickers stop at this point and complain that general relativity has worked out really well so measured time is never an invariant).

Finding things that are generally true requires us to have all history and the future (which is clearly not possible) for all possible sample paths (which is again not possible).

Regarding the issue with respect to stationarity of distributions (will my estimates based on the past hold for the future too?) - everyone has a different opinion on it. It is a popular idea among the younger generations at any point in history to think that the wisdom of the older generations is outdated - “the world has changed”. Meanwhile, people say history repeats itself. So what is the truth? Even if you do Bayesian updates of your beliefs (as we did a couple of sections ago), what is the learning rate (high learning rate means you are giving a lot less weight to history compared to when you have a low learning rate)? Do you replace everything that goes against your observations, or do you give it a very small weight, preferring historical data over the future? No one seems to have a good answer to this. Even estimating the degree of stationarity (in order to be able to make decisions that hold in a certain regime) is also difficult, because how would you find the probability (leave alone the persistence of probability) of tail events (events much closer to the ends of the distribution) with an insufficient number of sample paths (critical decision making is not Gaussian, unfortunately, and the number of samples we require for any decent estimates is exponentially higher for related distributions we see in real life)? In any case, you would need to be able to handle possibly debilitating regime shifts, because you’re taking on the risk of assuming that things will remain the same for a long enough time (and you don’t worry about the time till the next regime). And if you add a binary variable to the set of your decision variables, the size of your joint distribution doubles right away - this means you need even more sample paths in the same time period.

To make updates a bit faster with the same “reliability”, people impose priors like Occam’s razor in the form “find the simplest thing that models reasonably well”. Sometimes it turns out to work, and sometimes it does not, owing to the arbitrariness of such priors.

All of this tells us that there is some uncertainty principle involved (think: something like Heisenberg’s uncertainty principle). A form of this uncertainty is acknowledged by the exploration-exploitation tradeoff (do I explore to get a better and more robust model? or do I use what I have at hand and take on risk which would potentially give decent rewards based on what I have seen?). If you don’t have the generator of the random walk, you can’t make reliable estimates of anything about that random walk. In that sense, only the generator is all-knowing, and we can only sample it, not measure it.

In short, there are a ton of ways in which people can disagree on relative estimates of outcomes, and we must always agree that our estimates will be uncertain, and sometimes, plain wrong.

In fact, how do you even estimate the uncertainty of your probability estimates? As you go all the way down, you’ll realize that all these estimates are doing is giving you a false sense of security, and they can be very wrong. A higher order estimate being wrong implies that all the lower order estimates are deeply problematic, and in ways that amplify the uselessness of each previous order estimate, inevitably leading to disastrous consequences at some point. In a way, we are still led to making more overconfident estimates, and it is a statistical observation in all fields that I know of that the most disastrous consequences come from overconfident estimates far more commonly than underconfident estimates.

Uncertainty in universe coverage

Even if we have a model that is good enough in terms of finding relative probabilities - which is already limited by our uncertainty principle about finding relative probabilities from historical data in a transient environment, psychological effects like tunnel vision and overconfidence - are we even sure that we are accounting for everything that can ever happen?

There is a multitude of quotes by famous people throughout history, that constantly remind us of our own limitations. Here are a few of them for the sake of really driving home the point:

  • “There are more things in heaven and earth, Horatio, than are dreamt of in your philosophy.” - Shakespeare
  • “Anything that can happen, will happen.” - Murphy’s law
  • “What we know is a drop, what we don’t know is an ocean.” - Isaac Newton
  • “Chaos is the law of nature; Order is the dream of man.” - Henry Adams

The concept of black swans (high impact, rare, unforeseen events) comes into picture when you think about uncertainty in this manner. The most destabilizing events for the Bayesian updater would be the ones which are seldom seen (perhaps never seen before) but have disproportionate impacts.

In the context of markets, experienced traders are very well aware of this - there is natural selection at play when people see “free” risk premiums in the market and blow up in “unforeseen events”. The ones who do not are ones who can deal with such uncertainty - on a long enough sample path, if you don’t size your maximum risk appropriately, the probability of blowing up is nearly one. The fact that there are people who get away without doing this does not mean that they are geniuses - they were just riding their luck and happened to get out of the market before they got ruined.

In adversarial conditions, sometimes it is just the tails that matter. There are industries where the winner takes all. For instance, many growth investors initially invest in a pool of companies, and discard the losers and bet more on the winners. This is explicitly a bet against mean reversion, and a bet on the tails that one of these companies will be heavily successful.

Another example: life insurance is something that people still buy. If there were no unforeseen things, the concept of insurance would never exist.

If the arguments around things being more predictable as we gain more knowledge about the future and advance technologically are correct, insurance premiums should have gone down drastically after correcting for inflation/healthcare costs/legal costs in a market where insurance premiums are dictated by demand and supply. Interestingly, this has not been the case. Insurance companies do risk estimations in a way that balances demand-supply with their expected shortfall/risk capacity/losses. However, most of the pricing is dictated solely by tail events - they would not want to go bankrupt by selling too much insurance to people at cheap prices. In fact, health insurance premiums increased abruptly after the COVID-19 pandemic and the increase is expected to be persistent. So it might be the case that insurance is more likely to be underpriced than overpriced at any point in time, in terms of the expected value (and not necessarily the probability, which is deceiving in such scenarios), and that as we get more exposure to black swans, these possibilities start getting factored into the universe of events that people use for pricing tail risk.

So what is the optimal way to deal with such risk? Clearly, traditional probabilistic thinking that assigns utilities and some margin of error completely fails here. The point is we can never price this risk. So the other way is to find ways that give you “optionality” in case these events happen, which you can use as an instrument against ruin in those cases (some go as far as saying that you can’t make good decisions without understanding option theory - the financial kind).

This is a much deeper (and way harder) topic to think about than the previously mentioned uncertainties. And for good reason. Almost anything that looks impressive is a tail event. There are a lot of famous examples of tail events that have impacted the way modern life works, and I leave it to the reader to inspect major moments in history to find some.

The problem is that modeling these requires an absurdly high sample complexity, in order to fit confidence intervals that statisticians like so much. This difficulty arises from the heavy tails that are often found in real life distributions that are associated with so many things that matter.

Consider the problem of finding the global mean assets per capita. Let’s say you have a sample size of a thousand people. Note that the top 0.1% of people have orders of magnitude higher assets than the average person, so the presence of one or two or three of them will dominate the sample mean in this case. So intuitively you need to have a much larger sample size than the usual (in a Gaussian world, a similar impact may be caused by the top 0.001%, which does not dominate the variance in the sample mean on average).

Taking this to decision-making, you require a much higher number of samples to do reliable statistical analysis. Given the fact that we live on a single sample path, this makes things exceedingly hard to estimate.

The moral of the story is that estimating the impact of extremes is really hard, and you might be either off by orders of magnitude, or worse, end up subtracting two very large and uncertain numbers (that dominate the rest of the numbers), rendering all your estimates invalid. So you fall back to theory and make decisions that allow you to unconditionally avoid these risks, at least in part.

Uncertainty in knowledge modeling

So far, we’ve explored how insights from mathematical probability can help us make more informed decisions. But is probability the right framework for modeling reality (Einstein’s claim that God doesn’t play dice might still hold weight after all)? Human thought, after all, is rooted in what we’ve experienced or imagined, and we naturally struggle with concepts that lie beyond our comprehension. Building on the reasoning in the previous section, we should ask a broader, perhaps unanswerable question: How robust are the ways of thinking we’ve developed over time? Have we overlooked something fundamental in our models? Or, more provocatively, is the very act of modeling the right approach at all?

This is venturing into philosophy/metaphysics territory, so I will just present a bunch of crazy-sounding questions to think about:

  • What if there are no invariant laws with respect to anything but the trivial tautological invariants embedded in the laws and what we can deduce using our system of mathematical logic? In this case, it would be hard to justify the success of any sort of modeling with what we know about the past, as well as priors that we have built over time.
  • Are there any absolute monovariants (things that always increase or always decrease) in nature? This affects causality too, since causality is about establishing a consistent ordering between events, and this requires a monovariant that is permitted by the fundamental laws of the universe. Whether it’s clock time, entropy, or some other quantity, the monovariant provides the comparability needed to define which events are causes and which are effects. Note that this question itself is phrased in a manner that assumes that there is some monovariant - for a monovariant to exist, it must increase in tandem with something, which must already be a monovariant. There are many ontological models of time that try to present answers to similar questions.
  • Does it even make sense to be using some things as predictors of something else? A very simple example is correlation v/s causation - historically, many pairs of things that were thought of being in a cause-effect relationship were in due course realized to be just highly correlated due to having the same cause (a V-shaped dependency instead of a linear one). The question this example raises is - is causation actually real, or is it just correlations all the way down? This is what probabilistic ontologies question as well. At a deeper level, is there some mathematical framework that the “generator” (which does not have to be dependent on causality in the traditional sense) of reality uses (in more conventional terms, can the universe be perfectly modeled by such a framework) that makes causation intelligible? In other words, can we always reconcile our observations with their underlying generator through some form of logic? And are models based on uncertainty from incomplete knowledge sufficient to explain the projections we observe in reality? While probabilistic ontologies are often seen as mathematical models themselves, this question has deeper consequences like asking the question whether there exists an ontology that perfectly explains reality.

You will naturally be prone to asking yourself such questions when you are constantly doing research at the edge of science, aspects of which we can’t seem to ever explain (for example - cosmology, particle physics and so on) - back when I was a student and in touch with some of the people working at the frontiers of these fields, I was pleasantly surprised at the fact that there are many people at the highest level of science who think about such questions in their free time (even though it merely corresponds to mental speculation given our limitations). There are fundamental questions at the very foundation of science that we can’t answer yet, and the validity of science is taken at the face value of its merits in modeling what we have been able to.

As with any other tool, it is important to remember that any mental model you use will be a tool to model something, and will only be an approximation to reality that you are comfortable with. We are inherently limited by the ideas we can ever produce. We already see logic failing to predict many things like human behavior, and people call it irrational behavior. Some food for thought - if humans are just complex systems, where does irrationality come from? And how do you decide whether the system is irrational?

Anyway, this section is just a reminder that no matter how much information you ever have (even of the future), there is at least a small possibility that things will not work out that way, no matter how much logic you apply to it. The lesson is to realize that we know practically nothing (echoing Socrates for once), and to stay humble.

Asymmetries and biases

As a consequence of “irrational” thinking, there are always some asymmetries and biases present in the minds of people. They can generally be explained in terms of non-probabilistic approaches and remedied to some extent by probabilistic approaches.

  • Availability heuristic - people often assign higher probabilities to things that they can recall easily. For example, fatal plane crashes are shown on the news more than fatal car crashes, so it’s easier to think that cars are safer than planes in terms of the fatality rate. Similarly, more recent events can make people forget the long term, leading to inaccurate estimates due to amnesiac approaches to dealing with past information. The remedy for this is to use information that is as complete and comprehensive as possible for making estimates.
  • Anchoring bias - on the other hand, people also tend to assign higher probabilities to their initial few experiences/pieces of information. A common sales tactic exploits this by marking up the prices of items during festive seasons online (as compared to the fair price of the commodity) and then showing a limited time discount. Even if the final price is higher than the fair prices, the already high tendency in people to go shopping during festive seasons and make successful purchases is amplified due to anchoring bias, and leads to higher sales. The remedy for this is to engage in sufficient information updates, and getting to better estimates by incorporating more information probabilistically.
  • Gambler’s fallacy - how often have you heard someone say “this coin has gotten heads 5 times a row, it will definitely be tails now”? Well, if it sounds funny, people have this inherent bias when they see time series. It is worth thinking about how the sequences “HHHHH” and “HTHHT” have exactly the same probability, and from a probabilistic perspective, neither is more special than another (except maybe in terms of information-theoretic ideas like entropy measures or the Kolmogorov complexity of these strings). The remedy is to reassess your assumptions of dependence/independence of the experiment in question - here had the gambler realized that the next toss is independent of all the tosses that already happened, he would probably be much better off.
  • Representativeness heuristic - we often confuse distance-to-representative-of-group with probability-of-being-in-group. To see how this is wrong, consider two sets of people - A and B. A is much smaller than B (twenty individuals versus a thousand), but has a striking feature that all the people in this group wear glasses. People will generally map a glass-wearing person to A, as opposed to B, due to the person being closer to a representative member of A than to one of B. Meanwhile, if we go by global statistics and assume that at least 50% of people in B wear glasses, then the chances of the person being in A will be no more than 20/520 (which is less than 4%). What happened here? The unconditional probability of being in B was much higher than being in A, and it turned out to dominate the final conditional probability. Meanwhile, the distance metric completely ignored this fact. This is not just encountered while we talk about groups of people. Replace people by outcomes, and we have a common fallacy. The remedy is to always remember that unconditional probabilities can dominate conditional probabilities just like they did here.

Beyond these basic biases, there are more subtle second-order biases - ones that arise from the interaction of our judgments with deeper assumptions about probability and risk.

  • Overconfidence bias - people tend to consistently overestimate the potential expected value of their bets, either by overestimating win probabilities or by underestimating downside ruin (or both). Gamblers are a clear example of this. In simple gambling games, the house always has an advantage, but we still see gamblers playing such games, hoping for a large win. Software engineers have a tendency to rebuild things from scratch on encountering inconveniences - not thinking clearly about the time, expertise and investment needed for such an endeavor, and more often than not end up with a suboptimal product (in general, aversion toward using something external is called the “not invented here” syndrome). People also underestimate the impact of traffic on travel time all the time, and when leaving “on time”, reach late more often than early. In other fields, people who are aware of the probabilistic nature of things also sometimes put too much belief in their probability estimates. The widespread practice of using Gaussians to model any error distribution (instead of just the ones that warrant using it) is a great example.

Some very important consequences to think about:

  • People assume things are way more predictable than they actually are, and cover for extreme situations way less than they should. This has been seen all the time in financial markets, for example, and history repeats itself.
  • People put too much trust in what has already happened as a predictor of the future in order to sleep soundly at night. While it is true that you should learn from as much history as possible (definitely better than learning from only the last decade or so), remember that the craziest things are always going to happen in the future. The world is an adversarial place.
  • A common idea is that decision was a good one if its outcome was good - “the ends justify the means”. Let’s say that in a lawless land, a person A was faced with the choice of either paying a dollar to the mafia, or playing Russian roulette and winning twenty. A decided to play Russian roulette and won 20 dollars. Person B praised A for having taken a profitable decision. What opinion would you form of person B? Though this example was a bit extreme, it hopefully shows you the dangers of retrospective reasoning. It would have been a completely different story had A seen the future unmistakably before making that decision - we can only make decisions that are the best according to the knowledge we have at the moment.
  • People also think that the more successful a person is, the better decisions they have taken in life (i.e., track records matter). And conversely, if someone is not really successful, then they can’t make good decisions in life. This is a non-obvious extension of retrospective reasoning. Of course, people taking better decisions in life are more successful, but as the magnitude of success increases, the prior of being successful is so muddled by tail events that no one could have controlled for, that the chances of a person having made better decisions are either not changed or drop (by virtue of better decisions guarding people from unlimited risks on both sides as a side effect of protecting them from downside - there’s no free lunch). And this is after accounting for the fact that sometimes success is just chance.

To get the last sentence across, consider the following folklore story (I might be misremembering, but I have been told that this is a true story, but was also told to take it with a grain of salt):

It’s January, and you get physical mail from some offshore company, that says that your country’s main financial index will go up at the end of this month. You ignore this as a prank, but realize from the news on 31st January that the writer was indeed correct. You dismiss this as sheer luck and go on with your day. The next day, you get mail from the same company, saying that by the end of the month, the index will go down from its current level. You ignore the mail again, and again, the mailer seems to have been right. This happens in March, April, and so on, all the way up to September.

The company is always right! Convinced of the future prediction capabilities of the company, you write back to the address on the letter which was constantly soliciting funding for their hedge fund from the masses. You get back a letter giving you their bank details, and you wire half of your life savings to it, to which you get a prompt acknowledgment. Then you wait for next month, and your money is gone. You are greatly distressed, and while talking to your neighbor on your porch, the neighbor tells you that he got the mail in January too, but the February prediction was wrong.

Immediately you realize what the scammers had done. They had started out with the postal addresses of some ten thousand households in affluent neighborhoods. In January, they sent “up” predictions to half of those people and “down” predictions to the other half. The next month, they chose the set of people who got the right predictions, and sent “up” predictions to half of them, and “down” predictions to the other half. They did this for as long as some people took to get convinced that the scammers could predict the market.

After nine months, around twenty people would have the fortuitous sequence of correct market movements. At least one of them was bound to be fooled by this sequence of binary predictions. And there was our loser. For the price of around twenty thousand postage stamps, they would have made a couple million bucks. There would have been people who would have been convinced for less, so there were potentially a lot more victims, and the scammers probably made a fortune.

The moral of the story? Don’t get your probabilities mixed up.

Conclusion

Okay, now that you know about this framework, how do you use it?

Let’s go over what we discussed (though I would highly recommend re-reading the above post to extract the most value from the details).

  1. Acknowledge first-order uncertainty - the first step is to acknowledge that there is uncertainty in all our decisions. Probabilistic thinking provides a framework for doing this in a systematic and disciplined manner, allowing us to make better decisions. Being rigid about beliefs makes people slow and encodes less information into why they did what they did.
  2. Update your beliefs and be humble - new information is almost always helpful, as long as you don’t discard valuable old information too (the optimal tradeoff is debatable, though). Updating your beliefs may make you sound like a hypocrite, but on the other hand it shows that you’re being humble, flexible and respectful to what the world has in store for you. In this regard, the world generally reciprocates in the manner you treat it, no matter how cliched this might sound.
  3. Acknowledge higher-order uncertainty - if your outcomes were uncertain, are there really any grounds to believe that the probabilities you assign to outcomes are perfect? We need to acknowledge that our estimates of impacts, probabilities, or even our consideration of possible events will be wrong. While probabilistic models give us a better look at risk and uncertainty, they are generally based on historical data and priors, and assumptions that may not always hold, and it’s not hard to get tunnel-visioned into using them all the time. Always consider the possibility of rare, high-impact events that might not be foreseeable at all, and structure your actions in a way that prevents ruin in such scenarios. Sometimes it might just not be possible at all, though.
  4. Acknowledge biases and actively try to eliminate the bad ones - cognitive biases like the ones mentioned in the last section can distort your perception of probabilities and risk. Counteract these biases by seeking out more information and validating your assumptions. Probabilistic thinking will generally help you avoid such biases if you just sit down and think about it, so introspection does go a long way. Do remember that not all biases are bad, and biases can only be recognized against a standard “unbiased” thought-process, which will vary under different priors.

This post does not claim that necessary amounts of risk should not be taken (or that you should always be overly conservative about your estimates), as a cursory reading might seem to suggest. It advocates for a more nuanced way of thinking that lets you minimize bad decisions by virtue of being careful. To constantly update your beliefs about something, you need to explore it in the arena at some point (or at least someone else does, because second-hand experience for you is first-hand experience for someone, but it will at some point have to be you) - so to remain competitive, you should take on at least enough risk that lets you track the world, and enough risk that lets you achieve your goals (as Magnus Carlsen says, many good chess players could have been the best in the world had they been more confident). If this point is still not clear, I recommend reading this blog until the understanding settles in - it’s something that will change how you think about probabilities in the real world.

As always, feel free to let me know if you have any questions or comments!

Disclaimer

The information provided in this blog is for educational and informational purposes only. It is intended to offer insights into the concepts of probability, uncertainty, and cognitive biases, and to stimulate thoughtful discussion. The content does not constitute financial advice, investment advice, or any form of professional advice.

All information discussed in this blog was publicly available prior to the publication of this content.

The opinions expressed in this blog are those of the author and do not necessarily reflect the views of any organization or entity. Readers should conduct their own research and consult with a qualified financial advisor before making any financial or investment decisions. The author and the blog make no representations or warranties regarding the accuracy, completeness, or reliability of the information contained herein, and assume no liability for any actions taken based on the content of this blog.

The blog does not provide any trade secrets or proprietary information, and is not intended to guide specific trading strategies or financial outcomes. All trading and investment activities involve risk, and past performance is not indicative of future results.