Calibrating Confidence

If you’re here for the game, visit this.

Calibration

Everyone knows that being overconfident can often lead you to making reckless, unnecessarily aggressive decisions, and being underconfident leads to not taking enough opportunities. This duality shows up everywhere in real life - from aspects like investing/business, where real money is at stake, to more personal matters like career progression and navigating interpersonal relationships.

If we were always right about things, we could blindly believe in ourselves, and overconfidence would not exist. If we were completely clueless (for some definition of completely clueless), we would be better off asking a stochastic parrot to take our decisions for us.

But, of course, we have an incomplete amount of information for almost everything, as much as we would want to have perfect information about everything. So the optimal play is to factor that in somehow, so that our decisions perfectly convey our beliefs about the payoffs given we hypothetically take different decisions, in a probabilistic sense for instance.

This is what we call calibration - aligning your confidence with reality. A well-calibrated person would not be oblivious to their underconfidence/overconfidence, and would theoretically self-correct to be optimally confident about their decisions. By the way, this is pretty rare in a general sense (if at all it is possible, that is), and is subject to model error, if you’re into that sort of thing.

Quantifying calibration

That was a pretty high-level explanation. How do we actually ensure that we are calibrated? Given how much of a black-box human intuition seems to be, we should probably deal with it the same way we would approach a complex system - by designing tests that poke at some observable property of the system, in the hopes of getting some interpretable information out of it.

Let’s talk about one such (trivia-based) test, which is quite popular (for instance, I was introduced to these sort of tests from a test here). Note that this estimates only certain aspects of your “global” calibration (namely, how well you’re calibrated with respect to trivia), however it can be a lot of fun and also provides an opportunity to reflect about your confidence.

One of the defining properties of good calibration is consistency with probability estimates. One of the most obvious tests is to check whether the mean confidence I have on each problem of a test is equal to the fraction of problems I end up solving.

Another way to think about this is to look at a question in a test, and classify it according to some algorithm (or just vibes) - in more concrete terms, you would ideally specify an estimate of the probability (perhaps with some error bars) that you would know the answer to a test correctly. For example, you may ask yourself “Does this look like a question I would get right about 70% of the time?” and then to validate your calibration, do this over a large number of questions, while noting down the answers of each question. Being well-calibrated in the 70% range would mean that among the answers marked as 70%, you will be right around 70% of the times. It might be slightly hard to reason about because it feels so self-referential, but this very property also makes it a good measure of what it is trying to measure (locally, at least).

So armed with these metrics, the test goes like this: you answer each trivia question with an answer and a probability estimate of whether this answer is correct or not. At the end, you compare this with a perfectly calibrated result. The test linked above does precisely this.

Modifying the original test

There were a few things that felt a bit off with the original test, which I took the liberty of changing in the new test:

Static question bank: Using the same questions over and over makes it easier for some people to memorize, and makes a lot of estimates go to 100%. That defeats the purpose of the test by artificially inflating scores. I tried to rectify this by using a very large question bank (~4000 questions), courtesy of the Open Trivia Database - this could not have been possible without their dataset and friendly licensing terms! (Though I have to admit - I couldn’t vet such a large database as much as I would like to - there seem to be a few errors, but for the most part they should not affect the statistical accuracy of the results).
No control over the distribution of questions: This was a given with the original test, so I added in a bunch of categories/difficulties in order to zoom into certain parts of the set of questions and check for calibration issues on different types of questions. This also helps avoid some pitfalls of the database of problems (apparently, questions on video games form a big chunk of the database, but I did not undersample those because the very fact that there are a lot of those means that people may like such questions). Note that it is also important to test yourself on things that you don’t have any idea about, so the default option is testing yourself on all questions.
Too many/few questions: Lots of people like to have a go at these kinds of tests just for the fun of it, so just so that more people can reflect on how confident they are, I’ve added a shorter test. And for the people who want their results to have more statistical significance, I’ve also added a hundred question test.
Only boolean answers: The original test only had boolean answers. Sometimes our judgment gets clouded by having more than two options (Arrow’s impossibility theorem deals with this in the form of ordinal preferences too). This is why I added the possibility of taking a multiple choice test with more choices in order to also have the added bonus of being forced to deal with consistency while answering.

Other aspects of calibration testing

One observation from real life is that even if people might be good at assigning probabilities to things, they often overestimate how good they are at forecasting the impacts of those things. That is, thinking in probabilities v/s thinking in expected values. To check calibration with respect to this metric, it might be helpful to have a test that forces you to make decisions in the face of skewed outcomes. In general, an ideal test of calibration would be much broader than what we have at the moment, and would at the same time force the test-taker to estimate their uncertainty more precisely. However, an ideal test, if one exists, must also have a way of quantifying the impact of unknown unknowns, which might very well be impossible.

This test also just tests people on trivia, which means that it is talking only about being well-calibrated on epistemic uncertainty (i.e., uncertainty in knowing something that could in principle be perfectly known to you), instead of aleatoric uncertainty (i.e., fundamental/intrinsic uncertainty, for instance, in the sense of possible things occurring in the future). Aleatoric uncertainty is much less well-behaved, so tests for it are quite different from this test - for example, one narrow class of games that try to account for it try to make you internally model the entropy of your own decisions, as a test of whether you can reproduce an RNG in some sense (for instance, this). So it would be interesting to have the same test with minimal modifications that tests your calibration on either of them. (You can say real life is such a test, but games are perhaps more fun and have more measurable causal outcomes).

Some observations

There were some interesting observations when I gave this test to some people I know:

People who deal with uncertainty for a living are generally better at this test compared to others who don’t have as much of a daily exposure to uncertainty. This includes people running businesses, doing finance for a living and so on. I’m guessing that playing poker and similar games for a living also leads to a positive effect on certain parts of calibration.
There are some ranges that show systematic idiosyncratic behaviour. For instance, in True/False tests, people often are poorly calibrated in the 70s, and for the multiple choice tests, people seem to have issues with the 60%-90% range. My suspicion is that reasoning about higher probabilities is hard, and it is perhaps related via correlation to how we throw around estimates like 90%, 99%, 99.9% and so on all the time, while not stopping to think about medium-high probability estimates.

Improving your own calibration

There are a bunch of great resources that talk about improving your calibration in a manner better (and perhaps more practical) than I could have, for instance, here and here (among a large number of other posts on similar websites). I urge you to go and have a look at those resources. Similarly, there are a lot of books about probabilistic thinking and so on that are aimed at making you a better decision maker.

However, if you’re not paying attention, it’s pretty easy to get lulled into a false sense of security/become overconfident with these tests. Unknown unknowns vastly outnumber the known unknowns, and that’s why there’s always some amount of risk-aversion needed for decision making. And sometimes things are fundamentally incomparable (unless you have utilitarian beliefs, that is). As always, most ideas are tools to approximate the world and must not be treated as undeniable truths, and the same goes for any sort of estimation.

Calibration#

Quantifying calibration#

Modifying the original test#

Other aspects of calibration testing#

Some observations#

Improving your own calibration#