If you have been on the internet for a while, you probably know by now that LLMs (large language models) are a thing, and talking to them feels pretty close to talking to a human.
If you have been on the internet for a decade or so, you can probably guess what this blog is going to be about.
The task at hand was downloading an interesting video from YouTube that I happened to come across in my old link archive using the ever-handy yt-dlp.
The last time I had used yt-dlp was quite a while back, and these days, every time I see a man page that tells me that a command I don’t care about has more than 5 options, I stop reading. Naturally, this is the case for many tools for downloading stuff from the internet, so I just follow these steps:
Sadly, tldr (which is a great tool by the way) didn’t have man pages for yt-dlp. So I loaded up llamafile (the LLM tool of my choice), and anticipating that the LLM will probably give a command to just download the video without audio (a handy feature of yt-dlp that’s also a footgun at times), I gave Llama 3.1 8B the following prompt.
Okay, everything looks fine… but wait. That sample link must be familiar to anyone who didn’t start using the internet just yesterday.
If you’re not already familiar with this, you have been rickrolled for the first time. Read the Wikipedia page if you don’t know what it’s about.
An obvious potential reason popped up in my head almost immediately. The vast popularity of the meme (including the fact that the video code appears 91k+ times on Github) should have made it by far the most shared video link on the internet. A grim reminder that an LLM is only as smart as its training data (the other time you can see this is when you want to search for how to do things with tools that came out after the knowledge cutoffs of these LLMs, funnily including, at times, the APIs for the LLMs themselves).
But I digress.
Coming back, I tried using the largest model I had locally - the Gemma 2 27B. The reason I downloaded it was that people were saying that it performed on par with much larger models at the time of release.
The same outcome. I had heard of stories about how Google’s Gemini models were/are pretty heavily censored, so at this point I was wondering if they didn’t do the same sort of thing for their open source models. Or if they use a separate layer for detecting “harmful” content instead of giving the LLM a raw input or serving the raw output from the LLM. Or maybe it’s because the models I am using are actually quantized, and according to a recent paper, current “unlearning” methods are fragile towards quantization.
Anyway, moving forward, I tried another local model which was released more recently - the Qwen 2.5 14B.
Surprisingly, the model knew, without me prompting, that something was up with the link, but it didn’t diagnose the problem properly (in fact, if you’re into rickrolling lore, you’ll recognize that there were indeed other uploads of the same video in order to stop people from memorizing the link on YouTube and forcing them to at least view the video for the first few seconds, with some sophisticated rickrolls redirecting some legitimate-looking domain names to these links). If you add a newline to the end of the input, it does that properly too.
At this point, I thought - okay, enough local model spam. Let’s try to see how closed models perform at this.
The first closed model was OpenAI’s no-login-required GPT-4o mini.
Suspiciously close to the Gemma output, and pretty much the same thing.
Thinking that they probably don’t care about people who aren’t using their services with an account (other than the fact that all of GPT-4o mini outputs are probably being used to train their models), I decided to actually try out GPT-4o.
This turned out to be a valid YouTube link, but it linked to a TED talk about body language instead of the promised montage of oddly satisfying clips. Had it only been a rickroll…
To rectify this, I tried to ask it for URLs instead of a single URL.
Okay, so it thought that I wanted some channels to watch. My bad I guess, since channel pages are also valid YouTube URLs, and they have a preview of their latest videos.
This time, I specifically asked it for some video links.
This is cheating, though - we are looking for how LLMs behave (and even if it can be argued that RAG capabilities are also just appendages on top of models, we should not compare apples to oranges here). In this spirit, I tried to get it to output something that it already remembered.
Okay, we get some interesting results. Four of the links work (the second-last one does not), and the titles of a couple of videos seem to have changed by the uploaders over time. This might give an idea of when these videos were scraped, or what the internet refers to them as (in the training dataset of course).
But can we still reverse-rickroll the LLM (in the sense of making it serve us a rickroll)? Back to the basics, I guess:
This result was kind of surprising for me, given that just before trying out the closed models, I was trying to ascertain whether this was worthy of a blog post by searching online if someone had already done this. It turned out that someone who was using GPT4 for the AI assistant his company made got reports that their customers were getting rickrolled. Surely OpenAI was going to take safety measures against this, right?
Well, they technically did (as can be seen through the amount of work we needed to do to coax out the link from the model), but as the paper linked earlier points out, unlearning is not an easy game. And LLMs need to be more interpretable if they are to comply with constraints that humans can reason about.
Anyway, I tried to do this with the new Claude Sonnet 3.5 too. At first, I was kind of lazy to actually log in, so I tried using a random website called hix.ai I found on some random thread online:
This could not be so convenient, right? I was not sure if they were using some other model under the hood (though I originally went there to try out o1 and o1-mini a while ago and from the quality of results was convinced that they were indeed using them), or if there were differences in the chat interface and the API interface, so I just went to Claude directly.
And this is where the fun starts.
Well, this is the overly-censored Claude I know. I tried a bunch of things to try and get it to generate at least some random link, but to no avail. So the only thing that remained was to try the indirect approach.
Well, this gets us to some random URL (which is actually correctly labelled), but it’s still not a rickroll. However, I noticed something weird. It didn’t open the code region that it generally does when you ask it for code.
Can we make our query simpler and check if code-mode was more compliant to the underlying probability distribution?
Turns out that we can.
Just for fun, I asked the models to also come up with a bunch of other links. Here’s a summary of how they did:
Here are a few conclusions that I derived/reinforced from this whole exercise:
When we reason qualitatively, we often rely on our intuition. However, intuition is often loaded with certain meta-biases that cloud our judgment; one of these biases comes into play when we think about “local” and “global” statements.
What is a local or a global statement? One way of distinguishing between them is how many conditions we must guarantee hold before we can talk about the statement - so this terminology is relative. A local statement is a statement that uses many more conditions (a higher amount of specificity) than a global statement. For example, any statement about a certain group of people in the past few years is a local statement relative to a statement that is about all species that have ever existed on the earth.
So the crux here is whether you are talking about the whole or a subset.
Enter Simpson’s paradox (not really a paradox, though). This refers to a bias where we tend to think that if a statement is globally true, it must be locally true as well (or vice versa).
The most common example of this is the following (called Simpson’s inversion). Think about this: if a batsman A’s average (runs per innings) is more than batsman B’s average in 2 seasons separately, will A have a higher average than B overall?
Don’t blame yourself if you thought, on first instinct, that the answer is yes. The answer is in fact no and has been seen historically in this same context too. No points if you thought it was no just because I would not be writing this post otherwise.
Before we resolve this apparent paradox, I want to point out why we are prone to this fallacy in reasoning in the first place. The key word here is average. Whenever we are looking at averages, we are missing out on cumulative aggregate behaviour and instead focusing on efficiency. Keep this in mind when we show an example below (we’ll also look at this mathematically). Note that this simplification is adjacent to the one done in the representativeness heuristic, in which we discard the statistical probability (which talks about the relative frequency) of an event being in a group and instead just look at how close it is to a representative element of a group. Reading this explanation will make much more sense after the example, but it’s a good guiding principle to keep in mind.
Here’s a concrete example as promised:
Let’s say A played 10 innings and B played 100 innings in the first season, and that A played 100 innings and B played 10 innings in the second season.
In the first season, A had an average of 20 runs per innings, and B had an average of 30 runs per innings. In the second season, A had an average of 80 runs per innings, and B had an average of 90 runs per innings.
So in all, A made \(20 \times 10 + 80 \times 100 = 8200\) runs, and B made \(30 \times 100 + 90 \times 10 = 3900\) runs. So A made more than double the runs that B made, in the same number of innings.
Wait, but how is that possible?
If you look closely, you can see a pattern. Whenever it was possible for both players to score a lot, A performed better than B (and capitalized more). In the other case, B was better.
So in this case, the difference comes down to exploiting the opportunity when you had it. We will formalize this theme mathematically too. But before that, let’s see some real life examples of how people make decisions based on this.
If you thought about these, you probably realized that there is a difference between the average outcome and the cumulative outcome (which also depends on how many times you accessed the outcome). This is why local estimates are not always going to be good proxies for global estimates.
Let’s take another guiding example, that will lead us to a slightly different intuition (this time, dealing with why global estimates are not good proxies for local estimates).
On average, treatment A shows a lower survival rate than treatment B. Will you have a better chance of survival if you chose B over A, based on this information?
This seemingly obvious question should make you more suspicious than the previous example. After all, why would I not choose treatment B?
The idea is that not looking at the complete picture is (again) leading to issues here. To see what I mean by this, consider the following scenario. Treatment A is used more by patients who are way into the disease than ones who have just gotten diagnosed, and thus the unconditional mortality rate for such patients is quite high. Meanwhile treatment B is used more by patients who have just gotten diagnosed, and didn’t want to undertake treatment A which could have had more serious side-effects. So the unconditional mortality rate is much lower for such patients. However, treatment A is always more effective in these cases separately, and treatment B does not stand a chance in the more severe cases. So a good choice in this case would be to identify the case instead of making blanket generalizations.
So the unconditional statistics of treatment A are mostly dictated by the ones who undergo it, rather than the actual efficacy itself.
Making decisions based on completely global considerations is generally bad compared to looking at both the local and global aspects of it (i.e., doing more cross-sectional analysis). A couple of real life examples:
Now that we’ve understood that one major issue behind this paradox is that local statistics are not good proxies for global statistics (and vice versa), it makes sense to quantify this.
We’ll be looking at it in the easiest setting to analyze mathematically - a partition of the universe into multiple sets, a binary choice for each of these sets (for which the expected value as well as the number of allowed realizations of that outcome are specified), and the probability of being in any set.
Let’s say we have a partition of the universe into \(n\) sets \(U = \cup_{i=1}^n S_i\), and in the \(i\)-th set, we are able to make two decisions: the first has expected value \(u_i\) and you must realize the outcome \(k_i\) times, and the latter has expected value \(v_i\) and you are must realize the outcome \(l_i\) times. The probability of the future lying in the \(i\)-th set is \(p_i\) (we can also model the problem by allowing making a decision for each set exactly once instead of assigning these probabilities).
Simpson’s paradox in the first example can be looked at like this: we can have \(u_i < v_i\) for all \(i\), but end up having \(\frac{\sum_{i=1}^n u_i k_i}{\sum_{i=1}^n k_i} > \frac{\sum_{i=1}^n v_i l_i}{\sum_{i=1}^n l_i}\).
One question that can be asked in this setting (and the question we will be thinking about for the rest of this post) is: how do you maximize (or minimize, equivalently) the total value of the bet? What if you need to maximize the expected value of the bet? What if you need to do something in between?
The independent choice case is easy (i.e., when you know the future is in a certain set before making the decision):
This roughly corresponds to the second example above. The very fact that we knew whether the “future” was in some set (no temporal aspect is necessary here, you can also treat it as having extra cross-sectional information about the situation at hand) was able to help us make better decisions.
The independent cases were obvious, where being greedy was optimal. The mathematically non-obvious scenario comes when we have dependencies, and it is hard to tell whether being greedy is optimal or not.
Let’s look at one example of when we have dependencies. For now, let’s assume that we must either make the first choice in all sets or the second choice in all sets (i.e., when we don’t know anything about the future except the payoffs and probabilities of outcomes conditioned on whether the future is in that set or not). Also for the sake of simplicity, let’s assume that all the sets are equally likely.
Then we need to choose between these two payoffs: \(\sum_{i=1}^n u_i k_i\) and \(\sum_{i=1}^n v_i k_i\). This part itself shows that it is important to look at the cumulative outcome instead of the average outcome.
What if there was more freedom available to us? For example, let’s say we were allowed to choose the \(k_i\)-s, but under certain constraints. Suppose for the sake of simplicity that the constraints are frequency constraints so that the expected number of realizations is the same (or the sum is the same). In other words, you want something like allocating 1 realization to 4 sets, 2 realization to 2 sets, 3 realizations to 1 set, 4 realizations to 3 sets and so on.
So before choosing between the first choice and the second choice, we can fix how many realizations we want to get (i.e., choose \(k_i\) according to values of the \(u_j\)-s), and then choose the better of the first and second choices.
In other words, it gets down to maximizing \(\sum_{i=1}^n u_i k_i\), where \(k_i\) varies over permutations of a given array (or in more math terms, the multiset of \(k_i\) is constrained to be a certain given multiset).
The greedy choice here is to assign the largest number of realizations to the largest expected value event. And this works too - the reason is the rearrangement inequality.
The rearrangement inequality is a pretty powerful inequality. It basically says the same thing as we have said here already. Here’s the formal statement:
For every choice of real numbers \(x_{1}\leq \cdots \leq x_{n}\) and \(y_{1}\leq \cdots \leq y_{n}\) and every permutation \(\sigma\) of the numbers \(1,2,\ldots n\) we have \(x_{1}y_{n} + \cdots + x_{n}y_{1} \leq x_{1}y_{\sigma (1)} + \cdots + x_{n}y_{\sigma (n)} \leq x_{1}y_{1}+\cdots +x_{n}y_{n}\).
So if two sequences are sorted in the same way, their dot product is the maximum possible, and if they are sorted the opposite ways, their dot product is the smallest possible.
The proof is simple - for the right side, if we try to bubble-sort \(y\) (i.e., swapping out-of-order adjacent elements), we will notice that at every swap, the value remains the same or becomes larger. The left side is analogous (or you can apply it to the case where \(y\) is replaced by \(-y\)).
This is actually pretty powerful - you can use trivial applications of this inequality to, for example, prove the AM-GM inequality and the Cauchy Schwarz inequalities, the former being a weak inequality but harder to prove than the latter. In this way, it is a building block of the most basic yet powerful ways of comparing numbers/outcomes.
As an aside, by varying \(\sigma\) over all permutations and taking the mean, we can also say that \(n\) times the product of the mean of the sequences lie between these bounds as well. This points us to the fact that making decisions based on a much finer granularity is likely to be much better than making decisions from just looking at the big picture. (A reader might note that this holds only if we are able to account for different outcomes at a very low level, and does not directly apply to cases where irreducibility/emergent behaviour is a problem and the computation itself is expensive and can distort the overall utility of the decision).
Take a moment to try to connect these situations and conclusions with the examples we mentioned in the previous section.
As far as trying to understand Simpson’s paradox on a causal level goes, the theory of probabilistic graphical models comes up with certain conditions on probabilistic dependencies between events to detect “causal” structures that allow for certain counterintuitive behaviours. I won’t go into too much detail on this topic here, but it is being brought up to illustrate the point that Simpson’s paradox is not really a paradox, but something that can be reasoned about.
]]>As one can easily guess, this blog is about a mental model of thought. I chose to write about it since I feel that introspection about ways of thinking (and consequently what “thinking before doing something” means) is greatly lacking among people, and that it is critical to make better decisions (and realizing when there is no “better” decision).
I won’t bore you with the philosophical details, so I’ll approach it from a probabilistic perspective, which is closer to what I personally choose to think in terms of. A word of caution: I will sometimes oversimplify things in order to drive home the point, but sometimes it might seem to contradict with what I say in the later parts of this post. The key here is context, and if you keep track of it, things will make more sense. To understand probability concepts that I mention here in a bit more detail, I recommend reading my post on probability. If you don’t understand/don’t want to go through the math examples here, don’t worry - I’ll intersperse it with a general idea of what we are trying to do, so looking for those explanations should help. Wherever you find something that’s not something you already know, you should probably just make a note of it and go ahead (and read more about it later). If it is still not clear, you can let me know and I’ll try to clarify that part (for you and the other readers of this post). Or even by just looking up the thing on your favorite search engine/LLM, you’ll likely learn a lot, even if in a less pointed manner.
In online discourse, one finds many conflicts over beliefs, with people vehemently defending their positions that may or may not have to do with objective truths (on the other hand, there are a few people who don’t believe that there are objective truths at all, but we shall do away with such cynicism for now, for a more pragmatic view of thought).
From a probabilistic perspective, this means that the probabilities people assign to certain events don’t agree, and their opinions that come out of some sort of “representative”-finding algorithm (like taking the most probable outcome, or the one that has the largest impact) are different owing to this difference. In fact, I have good reason to believe that for many people, if not most of them, these probabilities are either 0 or 1 when they think about it consciously and uncertainty plays no role - and that the continuity is just haphazardly internalized and is updated every time they gain some knowledge from the world, whether it is an opinion or some other observable aspect of something that happened. It might be favorable for us from an “evolutionary” perspective - making some decision is better than not making a decision at all when inaction can wipe you out - but since there is a correlation between being consistently successful and making better decisions (correcting for luck, since for superlative amounts of success, it is often the situation that matters more), trying to make better decisions might just be the missing part for most people.
In my mental model, there are multiple levels of uncertainty. I like to categorize them in the following ways:
For the sake of making things clear, I’ll elaborate each of these points.
This is the place where most people trip up. Improper usage of words and phrases like “impossible”, “makes no sense”, “absurd”, “illogical”, “nonsensical” is often a symptom of not understanding uncertainty in outcomes.
In short, when you use these words, you are denying that a certain outcome is even possible, rather than acknowledging that it has a small but positive probability based on all information you have so far. And ignoring these is never a good idea. If you’ve ever multiplied a very small number with a very large number, you will realize that the impact of a very low probability outcome can overpower the impact of any other outcome in expected value. For example, consider the event “USA starts suffering from hyperinflation within the next decade”. What would you do with your savings in such a case? Would you even care? On the first glance, it has a very low probability if you ask any person who has not been living in a cave for the last 50 years. But the outcome will have a huge impact, and the contribution to the expected value of the US economy might not be as small as people think it is. So no smart decision maker will fail to acknowledge this possibility (precisely accounting for it is a different matter altogether).
Humans are complex, but still prone to these fallacies. Most people will get disgruntled with you if you tell them that there is a possibility that their parents are not who they think they are (and the wise choice is to not bring up such topics at all). The same will happen with some people if you tell them that it is possible that their friends have cheated on an exam, or that their country is on the wrong side of a sociopolitical conflict for some definition of wrong. On a forum with cartels or circlejerking being prevalent, you have a much higher chance of either getting heavily downvoted or banned for saying anything that goes against the popular opinion, or suggesting a possibility that goes against their interests. After correcting for social reasons (herd mentality and so on), this arises out of not being able to/wanting to assign probabilities to things you dislike, and simply trying to eliminate them mentally. This is how amateurs end up blowing up in financial markets too, because they don’t hedge against the possibility of being wrong.
A bit of a tangent - math goes further and distinguishes between probabilistic impossibility and mathematical impossibility. Probabilistic impossibility is when an event inherently has probability zero (for instance, what is the probability of picking the integer 1 from the set of all integers if you choose them uniformly?). Mathematical impossibility is something that your chosen axioms don’t allow you to deduce without a contradiction. For example, 0 is not the same as 1 when equality is in the sense of integers. Practically speaking, mathematical impossibility and probabilistic impossibility are the same if we assume that everything is discrete and infinities are not involved, since you can just enumerate all cases.
Coming back to the original point, it is also important to note that “outcomes” as we talk about here are not related to the future at all. There is an idea of what you can distinguish based on all the knowledge you have so far - so it has to do with the incomplete knowledge, not whether it was in the past, present, or the future. For example, consider a time before which there was no concept of fire. A person analyzing everything that has ever happened in the world would bunch together outcomes based on a tree diagram of some sort, and there will be no branching based on whether someone was burnt by fire or not. Nevertheless, using this, conditional on the probabilities of things remaining the same over time (as well as nothing new happening in the time being that could change the person’s set of assumptions - i.e., new knowledge), the person would be able to get an estimate of the probability distribution on what will happen given the state of the world, with some accuracy.
In short, we are interested in dealing with incomplete knowledge through the concept of uncertainty (“probability”), and model how we change our beliefs with new and more detailed knowledge.
Probability theory deals with it by thinking of the knowledge available at any instant in time in terms of sigma algebras and conditional probability. A sigma algebra is essentially a collection of sets of “sample points” (each set is called an event, to denote the fact that an event lets you isolate some group of those sample points) and their probabilities that satisfy the basic probability axioms (for the sake of intuition, I am not going to address the fact that this is technically not correct for some infinite sets of events). “Sample points” are the most granular outcomes possible, and they can be bunched into the smallest sets that can be distinguished from one another via certain kinds of observations (the events). Over time, as our knowledge grows, we are able to break these sets into even smaller sets, as we are able to distinguish more and more things due to events branching out. (For the mathematically inclined reader: a sigma algebra on a finite sample space is sort of the power set of these most granular sets, as my post on probability explains. However, for the purposes of this post, we shall just be looking at these most granular sets since it encodes all the information and is easier to think about).
Let’s take the example of a sequence of tosses of a fair coin (we have a closer-to-life example below, but understanding this one is important). Let’s say you have done an experiment a million times, and in that experiment you tossed a coin 100 times. You want to assign ids to all these runs based on the final sequence in a way that indistinguishable things have the same id, so you start looking at the outcomes letter by letter. Let’s say you are two tosses in (i.e., you know the results of the first two tosses for each run of the experiment). Any two eventual sequences are indistinguishable if they have the same results on the first two tosses. So a person who has seen the first two tosses for all experiments will have effectively only 4 distinguishable sets of outcomes (these are the most granular information we can have - does the sequence start with HH, HT, TH or TT?), and we can assign at most 4 distinct ids to the outcomes. As we look at the results of the next toss for all experiments, we are able to distinguish, within one previously granular set, between sequences based on the third toss, so there can be at most 8 distinct ids assignable to the outcomes. In summary, as you go on, you gather more and more knowledge about the specific outcome you got.
In the diagram below, after the first toss, the events H and T are distinguishable. After the second toss, HH, HT, TH, TT are distinguishable. Nodes in this tree have all the 100-length sequences of heads and tails with the prefix matching the label of the node. The labels of the edges are the probabilities that, given that we are on the left node of the edge, we get to the right node of the edge.
To think of it in terms of the sigma algebra, let’s look at the successive Venn diagrams below, where each color corresponds to a most-granular set (and we have truncated the experiment to three tosses instead of a hundred):
This Venn diagram illustrates how our knowledge grows over time, breaking down larger sets of possible outcomes into smaller, more specific events.
For a real life example, let’s say you’re investigating a criminal case. You have more and more information flowing in as you proceed with the investigation. In the very beginning, all you know is that the crime happened, and a set of suspects. As the investigation proceeds, you get to learn more information about the culprit (based on eyewitnesses, forensic reports and so on), and that updates your beliefs, making the set of suspects smaller. The sigma algebra framework is a bit more involved, since it also allows all possible universes of outcomes - what if the first eyewitness said that the culprit was wearing a blue shirt instead of a red one? What if the blood type was B and not A? In general, since any part of the investigation is not a hundred percent correct (for example, eyewitnesses can suffer from trauma that affects their memories), we are not really updating the set of possible culprits itself, but as we get more information, we are able to distinguish between the probabilities of certain sets of the suspects having the culprit - each time we get new information, there is a potential split in the most granular sets of suspects we can distinguish. This is done via conditional probabilities, which is really just expanding the original sigma algebra by adding the event of an eyewitness identifying certain people to the sigma algebra, and zooming in to the refinement of the sigma algebra that this addition entails.
Okay, so that was a lot of words. For the mathematically inclined, I recommend reading my post on probability if you don’t know what a sigma algebra is, though you can probably still get by the intuition we developed above. I’ll present this in a much more systematic manner below:
Let’s say that there are a total of 4 suspects \(\{A, B, C, D\}\) and one of them is surely the culprit, with \(A, B\) wearing a red shirt and \(C, D\) wearing a blue shirt. Additionally, we have \(A, C\) having blood type A+ and \(C, D\) having blood type B+. There is one eyewitness and a blood type report:
Initially, before any eyewitness or report came up, our sample space was \(\{A, B, C, D\}\), and the distinguishable subsets (the sigma algebra) were \(\{\}\) and \(\{A, B, C, D\}\). Here, for example, \(\{A, B\}\) represents the event that the culprit is either \(A\) or \(B\). So the probability associated with \(\{\}\) is \(0\) and that with \(\{A, B, C, D\}\) is \(1\).
Let’s add the testimony of the eyewitness. To do so, we add the element \(E_1\) to the sigma algebra, which corresponds to the eyewitness being correct. Then we figure out all the possible distinguishable subsets. Since the testimony of the first witness can’t distinguish between \(A, B\) or \(C, D\), our sets can only be of this form:
In other words, our most granular sets are \(\{A, B\}\) and \(\{C, D\}\).
Note that we still don’t know the probability of, say, \(A\) being the culprit. The best we can do at this point is to apply an arbitrary prior on \(\{A, B\}\) to decide who is the culprit, which is not how one should approach an investigation anyway. This is where the distinguishability comes into the picture - in the generating process of the crime scene, we have no basis of saying whether A or B was more likely to be the culprit (or whether they were equally likely, for that matter). A might have had more of a tendency to commit the crime at hand across multiple runs of the universe, and B might just be a one year old kid who can’t do anything itself. The moment you assign a prior (like A and B are equally likely to be the culprit), you are claiming that the sigma algebra of the generating process looks a certain way (in terms of both the sets as well as the probabilities of those sets), by adding an event of the form “I got a report that says that one year olds are 90% less likely to commit a crime than an adult”. This is exactly what we did when we accounted for both the eyewitness and the report; it was an indirect prior.
Now let’s try adding the information of the report. We can distinguish between \(\{A, C\}\) and \(\{B, D\}\) now as well. One of the axioms of a sigma algebra is that the intersection of two sets in the sigma algebra must be in the sigma algebra too. So \(\{A\}\) should be a distinguishable set (i.e., an event in the sigma algebra) too. By doing this for each pair of sets, we will realize that our most granular sets are \(\{A\}\), \(\{B\}\), \(\{C\}\) and \(\{D\}\).
Since the report and the eyewitness are independent, we have \(P(A) = P(E_1 \cap E_2) = P(E_1) \cdot P(E_2) = 0.7 \cdot 0.95\). The rest of the probabilities can be found analogously.
What we did here was track down the probabilities by refining the sigma algebra (which in the end was the power set of \(\{A, B, C, D\}\)) via evidence as we got more of it.
An aside: note that handling contradictory witnesses is a more nuanced topic, but you must either make a choice or work out all possible sigma algebras and give them weights based on some priors (the easiest one is reliability, but it might be possible that reliabilities are not comparable across subsets of evidence). Contradictory witnesses essentially challenge the estimates of reliability that are being used, forcing us to update our basic beliefs as well.
Similarly, we assumed here that the witnesses (events) are going to be independent. In the general case, we require their joint distribution to be able to update more granular probabilities via Bayes’ theorem.
This technique is not restricted to this example only. We can model a ton of stuff using these ideas or its generalizations, provided that we have reliable estimates for everything we need. We can also sometimes bound (probabilistically) some part of the effect our actions would have, given our beliefs (or bounds) on the probabilities involved.
Even considering the fact that there are probabilities associated with certain events is halfway between amateur decision making and decision making at the highest level.
However, probabilistic thinkers often become overconfident in their estimates by ignoring certain uncertainties in their estimation methods themselves. One of those is the very fundamental uncertainty in the relative outcomes of two events.
When I say outcome here, I’m referring to both some metric of impact of an event, conditional on it happening, as well as the probability of the event happening.
Let’s consider the uncertainty in impact of an event, under the simplifying assumption that the probability of something happening is perfectly known. There are multiple ways people can disagree on this:
Horizon: “Shortsighted” and “farsighted” are two words that can be found in some form in almost every language - they exist solely because there is some sort of distinction between them. People are fundamentally biased towards certain horizons, based on what they have built a bias towards. The impact of an event tomorrow is not the same as the impact of one happening a couple of years down the line. It would matter to you if your bank balance goes to zero tomorrow, but you likely won’t care about the existence of humanity a million years from now. If you’re 70 years old, you will not be that concerned about the IT industry 10 years down the line, but an 18 year old choosing a CS degree would want to take into account whether the IT industry will still be a secure choice or not. A business owner would want to ensure that their investment into the business breaks even, so their horizon would be longer than an employee who has much less at stake. So if a person wants to make quick money as an employee, their best bet would be to chase hype, but for one who wants to make a safer long term bet on their business, they will want to correct for hype cycles, by choosing a business that makes (or should make) money regardless of hype in the market.
All this is pretty justified. I care about one horizon, and you do about another. But how uncertainty comes into the picture here is when you think about when something will happen, which naturally makes the above issue important for estimation.
Risk aversion: We can model some part of human behavior by assuming that people have different “utility” functions (for those in the know, I don’t like utility functions either but please bear with me), and as a consequence, they demand premiums for taking on risk. Consider a person who does not like risk but is faced with the decision to buy a stock which has a chance of going up or down a hundred dollars with equal probability. For him, the happiness from gaining a hundred dollars will not offset the sadness from losing a hundred dollars, so he will refuse the bet. He would rather demand a risk premium - i.e., based on his perceived happiness from gaining or losing a hundred dollars, he will find the expected gain from the stock in the future (which will be negative), and would like to buy the stock at a better price that lets him offset this value. Similarly, if someone is faced with any risk, they will have different utilities for different events. Someone living on an isolated private island with enough means of survival will not be as concerned with (and will assign lower magnitudes of negative impact to) the downfall of the global economy as the director of a bank, but will be more concerned about the possibility of tsunamis than the latter. Just to clarify - risk aversion is not bad. It arises out of taking your future into account (for example, probability of bankruptcy, opportunity cost and so on). As a side-note, the effects we model using utilities can also be often modelled by fudging the probabilities estimates, and this method has its own pros and cons.
This is not just about disagreements either. If you are trying to model people’s behavior, you need to be aware of these preferences. And people are not perfectly rational (for some mathematical definition of being rational), so this brings uncertainty in your estimates (for example, chronic gamblers are risk-loving instead of risk-averse, but they might be rational in their own eyes while being irrational in terms of some math definition).
Note that we have added uncertainties in computing the relative impact of things from a people perspective. It is not just limited to humans - we often have large and complex systems which sometimes cut corners for “better performance” on average (in more sophisticated terms, we call this a tradeoff), and this varies from being completely fine to completely unjustified. For instance, the chess engine Stockfish has some memory unsafe code. No comments from my side on whether this decision was a net positive for whatever the project cares about was or not (or whether this discussion was even a good use of time). As we can see from the discussions on this link, the impact is what is being disagreed upon given that we know that the code is unsafe. We don’t know if someone is running Stockfish on some mission-critical hardware just to pass time studying chess, or even if the memory unsafe code is actually exploitable or not. Clearly there are prior beliefs that come into the picture here, which is an inherent source of uncertainty.
As an aside, the statements in this thread (and other discussions online) along the lines of “prove to us that this is possible” are not about possibility, they are about certainty; often when people say this, they really just want proof that this will happen by asking you to prove that the probability of this happening is 1. By then, the damage is already done. The better choice from a probabilistic perspective is to identify when a certain adversarial thing might be possible (because that makes the utility drop suddenly due to a bump in the probability estimate) and arrange for things to happen in a way that protect against it, instead of waiting for the probability to go to 1 and realize a larger hit. Instead of asking to prove that something is possible (where in fact you’re asking them to ascertain that there exists a way in which the loss can definitely be realized by someone), it is better to ask yourself - “how do I prove that this is impossible?” You need only one observation to disprove your claim of impossibility, and this highlights a fundamental asymmetry in knowledge (and this is why some smart people sometimes hide behind erroneous and vague probabilistic arguments - to avoid being seen as being wrong in retrospect, since having been visibly right most of the time in the past is for some reason given too much weight in society). When someone asks for proof when presented with a possibility, it shows that they don’t understand the fact that statements of the form \(p = 0\) when negated don’t result in the statement \(p = 1\), but in the statement \(p \neq 0\) instead, another symptom of thinking that probabilities are just 0 or 1. In a way, thinking about probabilities this way allows you to act on every piece of information that you get - because incomplete knowledge is sometimes useful.
Let’s now consider the uncertainty in probabilities assigned by people.
We start by noting that “generator” probabilities (the probability that the generating process generates a particular sample path for us with) and “empirical” probabilities (ones we estimate from applying statistical methods to the results of experiments) differ. Let’s for the sake of convenience assume that these are the same - this assumption is often wrong, for reasons we will see in the next section.
The first point about horizon still holds.
Another reason is the way in how people have built their belief systems over time. We have a tendency to mode-collapse into whatever we have observed. We never observe samples from the joint distribution - only samples from the part of the distribution conditioned on the general state of the world immediately around us, both temporally and spatially. It is generally true that odds depend on the conditions you impose. For example, the odds of someone making at least a million dollars a year versus them making at least ten million dollars a year are dependent on what job that person does - conditioned on them being a tech employee or them being the CEO of a Fortune 500 company, these odds are very different.
So for covering all cases, we need a model that tells us, given any set of observed things, the ratio of the conditional probabilities given those observed things. This implies knowing the ratio of joint probabilities of whatever set the queries will come from.
What worsens this is that we live on a single sample path, and finding properties of the world that are invariant under certain conditions requires that we have a ton of samples under the same conditions, with biases cleaned up and all. For finding things that are generally true now (and for the immediate future, perhaps), we need a lot of experience from all the people living right now. That is still not sufficient, since there are some things that are conditionally true for the whole world in different regimes - to truly get to the most complete joint distribution of things by eliminating this conditional dependence on our “world-line”, we would need multiple instances of our world, all of which preserve the time-invariant laws, but have different sample paths (time in some sense as a monovariant for everyone, lest nitpickers stop at this point and complain that general relativity has worked out really well so measured time is never an invariant).
Finding things that are generally true requires us to have all history and the future (which is clearly not possible) for all possible sample paths (which is again not possible).
Regarding the issue with respect to stationarity of distributions (will my estimates based on the past hold for the future too?) - everyone has a different opinion on it. It is a popular idea among the younger generations at any point in history to think that the wisdom of the older generations is outdated - “the world has changed”. Meanwhile, people say history repeats itself. So what is the truth? Even if you do Bayesian updates of your beliefs (as we did a couple of sections ago), what is the learning rate (high learning rate means you are giving a lot less weight to history compared to when you have a low learning rate)? Do you replace everything that goes against your observations, or do you give it a very small weight, preferring historical data over the future? No one seems to have a good answer to this. Even estimating the degree of stationarity (in order to be able to make decisions that hold in a certain regime) is also difficult, because how would you find the probability (leave alone the persistence of probability) of tail events (events much closer to the ends of the distribution) with an insufficient number of sample paths (critical decision making is not Gaussian, unfortunately, and the number of samples we require for any decent estimates is exponentially higher for related distributions we see in real life)? In any case, you would need to be able to handle possibly debilitating regime shifts, because you’re taking on the risk of assuming that things will remain the same for a long enough time (and you don’t worry about the time till the next regime). And if you add a binary variable to the set of your decision variables, the size of your joint distribution doubles right away - this means you need even more sample paths in the same time period.
To make updates a bit faster with the same “reliability”, people impose priors like Occam’s razor in the form “find the simplest thing that models reasonably well”. Sometimes it turns out to work, and sometimes it does not, owing to the arbitrariness of such priors.
All of this tells us that there is some uncertainty principle involved (think: something like Heisenberg’s uncertainty principle). A form of this uncertainty is acknowledged by the exploration-exploitation tradeoff (do I explore to get a better and more robust model? or do I use what I have at hand and take on risk which would potentially give decent rewards based on what I have seen?). If you don’t have the generator of the random walk, you can’t make reliable estimates of anything about that random walk. In that sense, only the generator is all-knowing, and we can only sample it, not measure it.
In short, there are a ton of ways in which people can disagree on relative estimates of outcomes, and we must always agree that our estimates will be uncertain, and sometimes, plain wrong.
In fact, how do you even estimate the uncertainty of your probability estimates? As you go all the way down, you’ll realize that all these estimates are doing is giving you a false sense of security, and they can be very wrong. A higher order estimate being wrong implies that all the lower order estimates are deeply problematic, and in ways that amplify the uselessness of each previous order estimate, inevitably leading to disastrous consequences at some point. In a way, we are still led to making more overconfident estimates, and it is a statistical observation in all fields that I know of that the most disastrous consequences come from overconfident estimates far more commonly than underconfident estimates.
Even if we have a model that is good enough in terms of finding relative probabilities - which is already limited by our uncertainty principle about finding relative probabilities from historical data in a transient environment, psychological effects like tunnel vision and overconfidence - are we even sure that we are accounting for everything that can ever happen?
There is a multitude of quotes by famous people throughout history, that constantly remind us of our own limitations. Here are a few of them for the sake of really driving home the point:
The concept of black swans (high impact, rare, unforeseen events) comes into picture when you think about uncertainty in this manner. The most destabilizing events for the Bayesian updater would be the ones which are seldom seen (perhaps never seen before) but have disproportionate impacts.
In the context of markets, experienced traders are very well aware of this - there is natural selection at play when people see “free” risk premiums in the market and blow up in “unforeseen events”. The ones who do not are ones who can deal with such uncertainty - on a long enough sample path, if you don’t size your maximum risk appropriately, the probability of blowing up is nearly one. The fact that there are people who get away without doing this does not mean that they are geniuses - they were just riding their luck and happened to get out of the market before they got ruined.
In adversarial conditions, sometimes it is just the tails that matter. There are industries where the winner takes all. For instance, many growth investors initially invest in a pool of companies, and discard the losers and bet more on the winners. This is explicitly a bet against mean reversion, and a bet on the tails that one of these companies will be heavily successful.
Another example: life insurance is something that people still buy. If there were no unforeseen things, the concept of insurance would never exist.
If the arguments around things being more predictable as we gain more knowledge about the future and advance technologically are correct, insurance premiums should have gone down drastically after correcting for inflation/healthcare costs/legal costs in a market where insurance premiums are dictated by demand and supply. Interestingly, this has not been the case. Insurance companies do risk estimations in a way that balances demand-supply with their expected shortfall/risk capacity/losses. However, most of the pricing is dictated solely by tail events - they would not want to go bankrupt by selling too much insurance to people at cheap prices. In fact, health insurance premiums increased abruptly after the COVID-19 pandemic and the increase is expected to be persistent. So it might be the case that insurance is more likely to be underpriced than overpriced at any point in time, in terms of the expected value (and not necessarily the probability, which is deceiving in such scenarios), and that as we get more exposure to black swans, these possibilities start getting factored into the universe of events that people use for pricing tail risk.
So what is the optimal way to deal with such risk? Clearly, traditional probabilistic thinking that assigns utilities and some margin of error completely fails here. The point is we can never price this risk. So the other way is to find ways that give you “optionality” in case these events happen, which you can use as an instrument against ruin in those cases (some go as far as saying that you can’t make good decisions without understanding option theory - the financial kind).
This is a much deeper (and way harder) topic to think about than the previously mentioned uncertainties. And for good reason. Almost anything that looks impressive is a tail event. There are a lot of famous examples of tail events that have impacted the way modern life works, and I leave it to the reader to inspect major moments in history to find some.
The problem is that modeling these requires an absurdly high sample complexity, in order to fit confidence intervals that statisticians like so much. This difficulty arises from the heavy tails that are often found in real life distributions that are associated with so many things that matter.
Consider the problem of finding the global mean assets per capita. Let’s say you have a sample size of a thousand people. Note that the top 0.1% of people have orders of magnitude higher assets than the average person, so the presence of one or two or three of them will dominate the sample mean in this case. So intuitively you need to have a much larger sample size than the usual (in a Gaussian world, a similar impact may be caused by the top 0.001%, which does not dominate the variance in the sample mean on average).
Taking this to decision-making, you require a much higher number of samples to do reliable statistical analysis. Given the fact that we live on a single sample path, this makes things exceedingly hard to estimate.
The moral of the story is that estimating the impact of extremes is really hard, and you might be either off by orders of magnitude, or worse, end up subtracting two very large and uncertain numbers (that dominate the rest of the numbers), rendering all your estimates invalid. So you fall back to theory and make decisions that allow you to unconditionally avoid these risks, at least in part.
So far, we’ve explored how insights from mathematical probability can help us make more informed decisions. But is probability the right framework for modeling reality (Einstein’s claim that God doesn’t play dice might still hold weight after all)? Human thought, after all, is rooted in what we’ve experienced or imagined, and we naturally struggle with concepts that lie beyond our comprehension. Building on the reasoning in the previous section, we should ask a broader, perhaps unanswerable question: How robust are the ways of thinking we’ve developed over time? Have we overlooked something fundamental in our models? Or, more provocatively, is the very act of modeling the right approach at all?
This is venturing into philosophy/metaphysics territory, so I will just present a bunch of crazy-sounding questions to think about:
You will naturally be prone to asking yourself such questions when you are constantly doing research at the edge of science, aspects of which we can’t seem to ever explain (for example - cosmology, particle physics and so on) - back when I was a student and in touch with some of the people working at the frontiers of these fields, I was pleasantly surprised at the fact that there are many people at the highest level of science who think about such questions in their free time (even though it merely corresponds to mental speculation given our limitations). There are fundamental questions at the very foundation of science that we can’t answer yet, and the validity of science is taken at the face value of its merits in modeling what we have been able to.
As with any other tool, it is important to remember that any mental model you use will be a tool to model something, and will only be an approximation to reality that you are comfortable with. We are inherently limited by the ideas we can ever produce. We already see logic failing to predict many things like human behavior, and people call it irrational behavior. Some food for thought - if humans are just complex systems, where does irrationality come from? And how do you decide whether the system is irrational?
Anyway, this section is just a reminder that no matter how much information you ever have (even of the future), there is at least a small possibility that things will not work out that way, no matter how much logic you apply to it. The lesson is to realize that we know practically nothing (echoing Socrates for once), and to stay humble.
As a consequence of “irrational” thinking, there are always some asymmetries and biases present in the minds of people. They can generally be explained in terms of non-probabilistic approaches and remedied to some extent by probabilistic approaches.
Beyond these basic biases, there are more subtle second-order biases - ones that arise from the interaction of our judgments with deeper assumptions about probability and risk.
Some very important consequences to think about:
To get the last sentence across, consider the following folklore story (I might be misremembering, but I have been told that this is a true story, but was also told to take it with a grain of salt):
It’s January, and you get physical mail from some offshore company, that says that your country’s main financial index will go up at the end of this month. You ignore this as a prank, but realize from the news on 31st January that the writer was indeed correct. You dismiss this as sheer luck and go on with your day. The next day, you get mail from the same company, saying that by the end of the month, the index will go down from its current level. You ignore the mail again, and again, the mailer seems to have been right. This happens in March, April, and so on, all the way up to September.
The company is always right! Convinced of the future prediction capabilities of the company, you write back to the address on the letter which was constantly soliciting funding for their hedge fund from the masses. You get back a letter giving you their bank details, and you wire half of your life savings to it, to which you get a prompt acknowledgment. Then you wait for next month, and your money is gone. You are greatly distressed, and while talking to your neighbor on your porch, the neighbor tells you that he got the mail in January too, but the February prediction was wrong.
Immediately you realize what the scammers had done. They had started out with the postal addresses of some ten thousand households in affluent neighborhoods. In January, they sent “up” predictions to half of those people and “down” predictions to the other half. The next month, they chose the set of people who got the right predictions, and sent “up” predictions to half of them, and “down” predictions to the other half. They did this for as long as some people took to get convinced that the scammers could predict the market.
After nine months, around twenty people would have the fortuitous sequence of correct market movements. At least one of them was bound to be fooled by this sequence of binary predictions. And there was our loser. For the price of around twenty thousand postage stamps, they would have made a couple million bucks. There would have been people who would have been convinced for less, so there were potentially a lot more victims, and the scammers probably made a fortune.
The moral of the story? Don’t get your probabilities mixed up.
Okay, now that you know about this framework, how do you use it?
Let’s go over what we discussed (though I would highly recommend re-reading the above post to extract the most value from the details).
This post does not claim that necessary amounts of risk should not be taken (or that you should always be overly conservative about your estimates), as a cursory reading might seem to suggest. It advocates for a more nuanced way of thinking that lets you minimize bad decisions by virtue of being careful. To constantly update your beliefs about something, you need to explore it in the arena at some point (or at least someone else does, because second-hand experience for you is first-hand experience for someone, but it will at some point have to be you) - so to remain competitive, you should take on at least enough risk that lets you track the world, and enough risk that lets you achieve your goals (as Magnus Carlsen says, many good chess players could have been the best in the world had they been more confident). If this point is still not clear, I recommend reading this blog until the understanding settles in - it’s something that will change how you think about probabilities in the real world.
As always, feel free to let me know if you have any questions or comments!
The information provided in this blog is for educational and informational purposes only. It is intended to offer insights into the concepts of probability, uncertainty, and cognitive biases, and to stimulate thoughtful discussion. The content does not constitute financial advice, investment advice, or any form of professional advice.
All information discussed in this blog was publicly available prior to the publication of this content.
The opinions expressed in this blog are those of the author and do not necessarily reflect the views of any organization or entity. Readers should conduct their own research and consult with a qualified financial advisor before making any financial or investment decisions. The author and the blog make no representations or warranties regarding the accuracy, completeness, or reliability of the information contained herein, and assume no liability for any actions taken based on the content of this blog.
The blog does not provide any trade secrets or proprietary information, and is not intended to guide specific trading strategies or financial outcomes. All trading and investment activities involve risk, and past performance is not indicative of future results.
This is a collection of my thoughts on AI and problem-solving, which have remained relatively constant over the past few years, and some comments on recent efforts in this direction.
Recently, Deepmind claimed that they had made significant progress on the IMO Grand Challenge, and their automated systems called AlphaProof and the buffed version of their original AlphaGeometry, called AlphaGeometry 2 now perform at the IMO silver level.
Having monitored this field in the past on-and-off, this set off some flags in my head, so I read through the whole post here, as well as the Zulip discussion on the same here. This post is more or less an elaboration on my comment here, as well as some of my older comments on AI/problem-solving. This is still not the entirety of what I think about AI and computer solvability of MO problems.
I had a couple of gripes with the content of their blogs, and the overly-sensational headlines, so I would like to dissect their approach, and provide (heavily-opinionated) commentary on their approaches and propose approaches that I feel will be worth pursuing, in my capacity as a retired math Olympiad contestant and someone who has fiddled around with automated theorem proving and proof assistants like Lean and Coq in the past (while also doing cutting-edge machine learning research for quite a while now).
Problem-solving is not something that the field of AI is a stranger to. In fact, most early approaches to AI were focused on reasoning tasks, that you can read about on the Wikipedia page here. Early approaches were more or less symbolic and grounded in logic (read: the Lisp era), while computational approaches started to take over after some time. The focus also shifted from reasoning to much more manageable goals (pattern recognition and natural language processing). That brings us to the present, where LLMs are all the rage and by the mere mention of LLMs, VCs will empty their pockets to invest in your firm (okay, not that much, but you get the point).
Given how talking to ChatGPT and friends seems like talking to a real human being, by some definition of “intelligence”, people have started claiming that these models have some inherent intelligence.
Well, technically speaking, in some approximate sense, these models (which are all just transformer models as of now) just provide some sort of likelihood-maximizing next token given the whole context until now, where the likelihood function is trained from data.
The actual mechanism is much more complex. However, here’s a brief (and crude and not necessarily with correct notation) overview of the general understanding of how the architecture works. The explanation is skimmable - it is only for understanding what the architecture does that grants it such strong pattern-recognition abilities for the domains it is used in.
We first compute vector embeddings that embed tokens (that the text is split into) into a high dimensional space where more “collinear” vectors are supposed to be closer in meaning. Then a part of the architecture (called an attention head) looks at these embeddings and computes certain properties of these embeddings. To do this, it computes three different types of vectors:
Using one attention head, we are able to capture one property, and using multiple of these (i.e. using a multi attention head architecture), we end up with embeddings that concern themselves with certain properties (no matter how non-interpretable). Doing this over and over again captures the interactions of these properties and some other stuff, and the hypothesis is that in the end, we end up with highly informative embeddings of each token that have context information.
Now using this architecture and some initial embedding (this can also be trained, as is usually done in GPTs), you can do a lot of things, depending on what you train this architecture on, for example. One way of training is to give the output as the embedding of the next word for each prefix. And at inference, just read off the last token. For more details, the open source GPT2 repository is a good place to start.
But does this imply intelligence? There are a lot of people who say that it does (AI optimists overwhelmingly fall in this category), and lots of people on the other side of the debate. Speaking of myself, I tend to be on the other side of the debate more often than not. (Besides, what people call intelligence is different for each of them, and this is especially important to note when languages evolve and the same term means different things in different contexts, with some contexts dying over time. For example, the word hallucination implies that LLMs can reason normally but they are prone to mistakes - this phenomena is also seen by usage of words like context, intelligence, reasoning and so on, in the AI context, where terms are constantly modified to match what models are capable of, shifting the original goalposts in easier directions - as an example, AI now and AI in the 80s had quite different goals and different definitions of intelligence).
Why? I feel that reasoning is a crucial part of intelligence. Such reasoning must come from certain hard rules embedded into a system, albeit in perhaps complex ways - if you think of what the LLM learns as a realization of some kind of incomplete probabilistic logic, then usual logic is the limit where there is perfect certainty and things are completely specified. LLMs are famously bad at counting and arithmetic, for example, and that is simply one consequence of their failure to encode basic axioms (in any of their many equivalent forms). Similarly, these models quite literally learn correlations (as seen in the desired properties of word embeddings - similarity comes from the dot product) rather than causation.
The heavy dependence of these models on data (without much emphasis on hypotheticals) may make these models able to learn “common sense” based on the data distribution they have seen, but without letting them extrapolate in out-of-sample hypothetical situations reasonably.
Another reason, which might be a bit annoying to the people who think that the ends justify the means, is that I believe that blackbox methods are not supposed to be trusted when it comes to reasoning, especially in adversarial/complex situations where the reasoning is lengthy and highly non-linear. The internet is a pretty easy dataset for machine learning (remember what I said earlier about pattern-recognition being much easier as a goal than reasoning?), and that is because what humans say on the internet is easily (easily within all the context of the text, for example) predictable on the average case, and the average case is what matters for these tasks (which are also very well-behaved in general, i.e., thin-tailed in some sense). In such a case, blackbox methods will converge pretty fast to a reasonable minimum, because we are already overparameterizing the network so much that there is very little to be gained by making the model more reasonable and precise - and all that remains is to train it on a super large amount of data to ensure that the model learns most behaviour that will ever be seen completely (there has been criticism of LLMs and other large machine learning models as data-laundering machines, and this has led to copyright lawsuits). There was also a lot of controversy here related to concerns over test-data leakage for AlphaCode 2.
Meanwhile, if you take any tasks (research or even academic competitions like the IMO) that involve multi-hop complex reasoning that requires sophistication of the level of making conjectures/generalizations, LLMs currently fail to do well there, perhaps solely because of the reason that encoding this reasoning systematically is non-trivial in LLMs (and no matter how much we hypothesize about the functionality of different parts of these models, what the model learns is what the learning algorithm finds to be the easiest optimizer of the loss function at hand). Approaches that reduce complexity by (indirectly) embedding some approach similar to the initial approaches to AI should be way more successful, in particular because they recognized the importance of explicitly incorporating logic into systems. Modern machine learning theory also recognizes the importance of this - regularizing models using priors exists, and people design their machine learning models in a domain-specific manner too. There are also specialized approaches in machine learning that design models in ways that allow them to be trained in ways that incorporate certain logical constraints, but these methods are mostly very domain-specific, regardless of how effective they may be in their domain of applicability. All hope is not completely lost though. Even if naive LLMs are completely naive when it comes to hard problems, we can sometimes force them to use their context as scratch paper for “solving problems” in multiple steps (thinking step-by-step, or chain of thought in AI terms, and more recently, tree of thoughts). This is at times reminiscent of Lisp (homoiconicity of the inputs and outputs in natural language) and has the potential to make models based on language quite powerful. The “ability” to “think” step-by-step is also learned (and only approximately) from the internet. However, by the probabilistic logic argument again, we can argue that this is not going to be more than a hack that would not work in fault-intolerant applications unless LLMs become verifiable, since current LLMs can’t reliably stop generating output when they don’t have the answer to a question (something to think about if you know about the halting problem), let alone verify their claims.
All in all, my personal opinion is that LLMs should be thought of as what they are - language models and nothing else. People who think that intelligence is merely word processing are not capturing the general part of AGI by deliberately ignoring logic, math and creativity, or are deliberately being disingenuous due to conflicts of interest. Of course languages allow you to express your ideas (and historically they have evolved to succinctly express ideas that humans find useful), so it is a bad idea to say that language is not going to be necessary - in fact, our claim is that we need a better representation of knowledge and reasoning, which is itself talking in the domain of language and its neighborhood. However we should be careful in remembering that current models can’t really think, no matter how much their “proficiency” in language and “reasoning” based on their input data distributions seems to be real intelligence (consider approximately-censored models for example). A generic solution to AGI does not seem in sight for now, though small steps towards it are what will eventually guide us there, if it is ever possible.
From my experience as a past math Olympiad/competitive programming contestant, I have had an informal mechanical-ness ordering of the typical MO subjects for quite a while, as anyone who has been competing for a while will quickly develop.
It is the most mechanical of them all - so mechanical that there are multiple coordinate systems developed to allow contestants to “bash” their way through the problems, as it is affectionately called. This fact has been known for at least a few centuries - Cartesian coordinates were developed in 1637 by Rene Descartes, and there was an informal understanding that all 2D geometry problems can be solved using it.
A decidability-friendly formalization of geometry came in 1959 in the form of Tarski’s axioms for Euclidean geometry. The good part is that there is an algorithm (via the Tarski–Seidenberg theorem) to prove/disprove any geometrical statement that can be written in Tarski’s formulation - however, the bad part is that the algorithm is computationally infeasible on a computer.
There have been multiple independent approaches, mostly by Olympiad geometry enthusiasts, to make this much more viable. From a contest perspective, most (non-combinatorial) geometry problems can be proposed in the following form:
Given certain points and “curves” (more specifically, algebraic varieties), with a constant number of independent constraints, prove that a certain property is true.
In most practical cases, the number of independent constraints never exceeds 10. Informally speaking, independent constraints are those constraints which can’t be naively encoded in constructions of the diagram.
So the general approach for these is to use any of the multiple coordinate systems (Cartesian/complex/barycentric/trilinear/tripolar), determine the equation of each object, and go through each constraint one by one to convert it into an equation. Then the conclusion should follow from the equations written down, and this can usually be done via a Computer Algebra System (CAS), or by hand, if you devote enough time (and practice) to it. This sort of approach is also taken by the Kimberling Encyclopedia of Triangle Centers, to maintain a list of properties of triangle centers that have ever been encountered in the wild. There are ways documented in literature, that make this explicit - for example Wu’s method, which ends up being more general than is usually needed, and the computational complexity suffers because of such generality.
This approach clearly lacks beauty, and is often frowned upon (but Olympiad graders must give credit to the solution if it is correct math, no matter how begrudgingly). However, it is very effective, and I have yet to find a pure Euclidean geometry problem (i.e., without hints of combinatorics, algebra or number theory) that can’t be approached using such a method.
Recently, an alternative approach to automated geometry problem-solving has come up. Instead of mindlessly bashing the problem, the idea is to try to get to a logical synthetic proof instead (i.e., one that does not involve heavy algebraic manipulations), much closer to what one would see in a high school geometry class.
One simple way is to combine Wu’s method with basic geometrical facts (the latter is something that has always been known by MO contestants, and the former has been around since the late 1970s) - as it turns out, it comes pretty close to the much more recent AlphaGeometry when performance under a time-limit is considered (in fact, the method will always come up with a solution).
The other approach along these lines is a neuro-symbolic one, which has also been investigated for quite a while before AlphaGeometry was a thing. Multiple surveys on this general method can be found in this GitHub repository. It is noteworthy that AlphaGeometry is also such an approach, and I’ll elaborate on the approach in its own section.
There are also different things that are not exactly related to solving geometry problems, like GeoGen, a system to auto-generate Olympiad geometry problems, and OkGeometry which helps in finding properties of dynamic configurations.
The second-easiest thing to approach mechanically, after geometry, is the field of algebraic inequalities. Very early on, the main arsenal of a math Olympiad consisted of basic inequalities like AM-GM, power-means, Cauchy Schwarz, Rearrangement, Holder, Jensen, Karamata, Schur and Muirhead. They were expected to apply these inequalities in creative ways to prove a certain inequality, which was either the main problem or a small part of it.
However, things started changing rather quickly as inequalities became more popular. Methods like SOS (sum of squares), Mixing Variables, uvw, Buffalo Way, and so on started becoming more and more popular, mechanizing the approach to proving inequalities. I know of at least 3 different (private) pieces of software that were independently developed by people on AoPS, in order to semi-automatically solve inequalities posted on the forum. It was so well-known that there are still collections like this that collect humorous proofs of inequalities via horribly tedious calculations.
We are now stepping into more perilous territory. Note that we are confining ourselves to math Olympiads only - Riemann’s hypothesis, Collatz conjecture and the like (which can also incidentally be written in the language of inequalities) are off-limits.
The reason behind this is that constructions and non-obvious lemmas become way more important now, compared to what we have already seen so far. And the involved theorems and lemmas come from a much larger space than one would find in geometry. For example, you might want to apply a lemma on a certain kind of infinite set, but in geometry, any interesting set will generally be finite. So even if the set of rules might be smaller, the set of applicable instances might be infinitely larger. This is intuitively part of the reason why there is no analogue to the decidability theorem for Tarski geometry for anything involving the natural numbers, for instance.
Generally, the overlap between these subjects and combinatorics is also much higher than for geometry and combinatorics, and logic starts becoming more and more non-linear (i.e., the most natural path of deduction is not a chain, but more like a graph). Non-linearity of deduction is a harder thing for LLMs to reason about than linear deduction, presumably because of the context being linear. Proofs are generally written linearly - and while LLMs can be made to work with the topologically sorted (linearized) proof, it is definitely not the most natural way of learning how to prove things.
Combinatorics in math Olympiads has the reputation of being the most chaotic subject, and is often referred to as the dumpyard of problems that can’t be categorized elsewhere. Judging from this reputation alone, it should be the hardest subject for AI to train for.
The following are some of the reasons which make MO combinatorics a relatively high bar for AI compared to the other subjects
In a later section, I’ll point out approaches that I feel can help AI tackle combinatorics problems, based on these insights. No guarantees of course, combinatorics is hard.
There were two versions of AlphaCode - the first one with some discussion here (funnily this also had some discussion on math olympiad geometry problems vs AI, in which I was involved, which preceded AlphaGeometry by quite a bit and predicted that geometry would be the first to be conquered using AI), and the second one with some discussion (and a lot of controversy) here.
Quoting the article: “We pre-train our model on selected public GitHub code and fine-tune it on our relatively small competitive programming dataset. At evaluation time, we create a massive amount of C++ and Python programs for each problem, orders of magnitude larger than previous work. Then we filter, cluster, and rerank those solutions to a small set of 10 candidate programs that we submit for external assessment. This automated system replaces competitors’ trial-and-error process of debugging, compiling, passing tests, and eventually submitting.”
I’ll be blunt - I don’t like this approach. Code generation is as much of a precise thing as math is, and especially so in competitive programming. As I already said earlier, combinatorics problems and informatics olympiad problems have a lot of correlation, so it does not make sense that an approach that does not use logic will perform well at competitive programming problems either.
It works (at the level it works) because the simple problems on Codeforces are really just to encourage people to submit something and start doing competitive programming. A problem at the level of Div2A/B is something anyone who has elementary math aptitude (read: word problems) can solve immediately, and Div2C is something that someone who has additionally seen a lot of code can solve with enough effort. And Div2A/B is the level at which their AI solves problems semi-reliably. You can expect the limited capabilities of a non-fine-tuned LLM to match this performance. In fact, if you used ChatGPT back then, solving Div2A/B was within the reach of a participant who only knew how to prompt engineer (and had some basic common sense).
There were additional concerns that there was test data leakage, which is a very serious issue, considering the fact that LLMs are basically data-laundering machines, especially with small datasets. Their submission to the hardest problem of a contest, which they cherry-picked and displayed very proudly was also highly suspected to be due to a test-data leak - the solution matches too much with the editorial solution of that problem for it to be a coincidence.
This led to a lot of discussion over the data source, and there was a lot of evidence (for example, in this thread) that Google had used more recent training data than the evaluation data. This was also verified when we saw some random Egyptian names in the boilerplate code people use for their solutions, that the AI had copied as well - consider this bot submission and this human submission (which was the first submission on Codeforces/GitHub, chronologically, that had this function name). At this point, it is worth noting that there are more than a thousand public Codeforces submissions made by AlphaCode (but through different accounts, which, fortunately, some helpful users on Codeforces aggregated under another comment in the parent post of the linked thread) - there might be quite a bit to learn from these submissions, especially considering that the list of attempts made for each problem should be a good representative of what the model does towards the end of its search. Unfortunately, the AI did not participate in any live contests, so we might never know how it performs on real out-of-sample data.
This culminated with a member of the community (who is very well-respected in the ML community too) getting on a call with Google, in which they confirmed that the data was sourced from a third party company called remotasks, and some of their training data was more recent than their test data. Google’s plan for no leakage in evaluation was that they’ve checked that no problem description or the same solution worked for a training dataset problem. While it protects against fine-tuning test data leakage, it does not prevent the base model (Gemini) from being trained on the editorial.
The thing I find disingenuous about their press (I will talk more about it later) is the over-sensationalization of everything they do, and the lies involved. They also vehemently opposed any test data leak, which was something they could not admit, else their AI project (which was in the initial stages back then) would have been completely doomed. And they were struggling with putting out their own AI too back then (a necessity for such companies to build the shiny new thing that everyone is crazy about), so I would not put it past them to do anything that could save their company in the field of AI. This desperation is also visible in how they take highly optimistic quotes from top competitive programmers (who probably don’t know how these models work) and possibly remove all skeptical comments.
We will see more examples of such disingenuous (but a bit less harmful) press later too, but we’ll try to stick to the technical parts a bit more.
So, how do we fix this anyway? I would argue that, if you go down into the details, doing this is probably going to take a very different approach to generating algorithms than in the math part.
After all, half of all of AlphaCode’s submissions were basically hardcoding the first test-case, showing true dedication to test-driven development, ignoring the problem statement at hand completely.
One issue with competitive programming problems is the lack of formal definitions in literature. For example, consider a basic data structure that is very common in the competitive programming community, called a segment tree. It was formalized as recently as 2020 (public documentation). Similar documentation exists for only a very small subset of data structures used in competitive programming.
Let’s say we have defined mathematical properties of these structures. Then finding a framework to convert these into math problems (that fill an algorithm in the blank and then prove its correctness) is the major problem that needs to be solved first (and in this sense, this is harder than just solving a math Olympiad problem). Doing automated problem-solving is undecidable due to the Halting problem, so we will definitely need heuristics, and will have to be satisfied with the possibility of not getting an answer. This is probably the reason why people don’t think about it too much and plug in their problem into a GPT and manually tinker with the solution or keep prompting until they get tired.
To reiterate, LLMs work well on easier problems (but may not work well on harder problems due to improper logic) because there are way too many similar problems on the internet, and syntax of the programming languages fed into the LLM during training is uniform enough that the recall of the LLM turns out to be decent for repeated problems.
So what is the AlphaGeometry approach? This is not as much of a shitshow as AlphaCode, fortunately. But it is not as eye-catching either, in the sense of having fancy models generating stuff. The paper is here.
The main ideas are the following:
The conclusion is that out of the 30 problems, the DD+AR approach with human-designed heuristics was the best one that did not involve any transformer, and they were able to solve 18 problems with it. Training the transformer gave improvements - when it was trained only on auxiliary constructions, it solved 21 problems, when only pre-trained, it solved 23 problems, and after both, it solved 25 problems.
Notably, they found that asking GPT4 for hints on auxiliary constructions did not help (it only added one extra problem to the vanilla DD+AR approach and solved 15 problems).
This table summarizes their results.
Intuitively, how the transformer helps is by patching up the parts of the proof that a symbolic engine can’t complete; it knows the general sequence of the proofs (by learning them almost completely, of course), and is fine-tuned to tell the proofs what kinds of constructions are statistically the most likely to help, by extrapolating from training data. Perhaps this solidifies the reputation of Euclidean geometry (incidentally also my strongest MO subject, and one I used to be one of the strongest human contestants in the world at for quite a while before I retired) as the most mechanical among all the MO subjects.
But the story does not end here. As pointed out earlier, a new paper that considered Wu’s method as a helper for AlphaGeometry claims that Wu’s method alone solves 15 problems in a time limit of 5 minutes per problem on a CPU-only laptop (which casts some doubt on the AlphaGeometry paper’s claims that Wu’s method took them 48h to solve only 10 problems). And in this paper, they also show that DD+AR with Wu’s method solves 21 problems, matching AlphaGeometry’s score of 21 when the transformer was trained only on auxiliary constructions. And these problems include 2 problems that were not solved by AlphaGeometry, so when stacking these two methods gives an AI that solves 27 problems on IMO-AG-30. Adding these 2 problems to the DD+AR with human heuristics approach gives a decent score of 23 problems, which is not that far off from AlphaGeometry, and still does not use transformers or any other machine learning approach.
Again, the AlphaGeometry publicity just focused on the LLM part, because of course they need their LLM funds.
One thing to note, by the way, is how capturing constructions here shows that we definitely need more structured ways to reason about problems automatically, or as in my comment that I linked in the beginning, “As a natural generalization, it would greatly help the AI guys if they identify certain patterns in human problem-solving — naturally because humans have very limited working memory, so our reasoning is optimized for low context and high efficiency. Patterns like simplification/specialization, coming up with conjectures, wishful thinking and other such common patterns that one can find in books like Polya’s How to Solve It should probably be captured during training these models. […] Of course, after we have all these models, the control flow is just the traversal of the graph where these models are all nodes.”
Some people pointed out the bitter lesson to me as opposing this view, in this context. I agree with the article, even if it is sometimes just used as an excuse to be lazy and throw compute at a problem. However, I argue that the argument in that blog is consistent with my comment. The ability to reason can come from higher abstractions that we may never be able to completely understand, that arise at a scale incomprehensible to our minds, and I completely agree with that (Inter-universal Teichmüller theory anyone?). We are not baking in any knowledge. We are only ensuring that our architectures (in the most general sense of the word, regardless of whether they are classical ML models or not) are able to capture fundamental solving strategies and reduce the complexity of the search space. Quoting from the article: “[Simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries] are not what should be built in, as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity. Essential to these methods is that they can find good approximations, but the search for them should be by our methods, not by us. We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done.” We never overlook sensible transformer architectures, so why overlook sensible and scalable architectures that reduce our search space complexity by adding provably complete blocks (that also perform computational search, which is also scalable)? In fact, LLMs as of now in the context of reasoning perhaps violate the ideas in that article - by simply scaling breadth-wise and not depth-wise (in the sense of having all their reasoning parroted off from the internet, and not seeming capable of original, directed and precise mathematical thought).
So AlphaGeometry was decent, even though we noted that we can do just fine without transformers too, at the current stage. How do we go to solving something that is non-geometry from here?
A disclaimer: AlphaProof does not seem to have a paper published with the details yet (and as a consequence, it is not reproducible), so the legitimacy of their results is low. There are a lot of details that are either ambiguous or straight-up missing. I understand that this might be because they want to go ahead and be the first to solve the IMO Grand Challenge (and such secrecy is common in this case), but that leaves us with a lot less to go forward with, and potentially robs other people of the opportunity to claim that they had the first silver-level performance (and as Kevin Buzzard jokes in the Zulip chat, is it really a silver medal performance if it took the AI 3 days? And may I add, even with the amount of compute Google could throw at the problem?).
All the cynicism aside, the details of the below are based on the details mentioned in their blog post here.
Turns out, the AlphaProof architecture is more or less similar to the AlphaGeometry one. There are a few major differences:
They claim that this solved 3 problems (non-geo) in 3 days (unclear whether this was per problem or total time), and the geo problem was solved by AlphaGeometry 2 in 19 seconds. Meanwhile, the two combinatorics problems are unsolved.
More on this hopefully when there is more info/a paper.
The proof search strategy generally seems pretty naive, and seems to lack depth, in the face of a tremendous branching factor that proofs generally have. Working on improving the search strategy should help reduce the involved compute.
I suspect that there is a good amount of imbalance in the training data. When I was browsing some of the manually written out proofs at this repo of manually written out Lean solutions, I observed one thing - most combinatorics proofs were incomplete (in the Lean code, you can check this by searching for sorry
)! And perhaps for good reason - combinatorics is generally harder to fit in a formal framework than the other three subjects (for example, if someone at the IMO implicitly assumes ZFC instead of ZF, what will the graders do? And what if the formulations need many more definitions than in the other three subjects?).
Maybe we could say the same for combinatorics problems, considering that even when the model ran RL training loops (during inference) on variations of the combinatorics problems, they still remained unsolved (despite P5 being relatively easy)?
Even if these problems are auto-formalized well, there is a high chance that the language model learns combinatorics problems pretty slowly, or is simply unable to solve problems (thereby not affecting the relevant weights in a meaningful way).
One way to fix this is formalizing combinatorics solutions by hand, but it would be a gigantic effort, not worth the manpower. This brings us to the other solution.
One (perhaps slightly far-fetched) reason can be that Lean might not be the perfect language for combinatorics, at the level of complexity we have. One possibility is that the model is simply unable to learn the structure, and that the model requires the right level of abstraction to think at. In this respect, there has been research on proof assistants (like Cur, GitHub) that are somewhat Lispy in nature - they allow metaprogramming and building up to a domain specific language rather easily. With the right abstractions being used, it might be feasible for the model to learn more high level ideas (of course, having a different part of the solver for bringing the abstraction to a lower level might be necessary). This follows from common sense - if you have a large hierarchical database of theorems, then a complex problem might just be applications of a couple of high level theorems in a certain manner. Similarly, it frees us from the low-level constraints that are in some way an extra constraint that is as bad as imposing our own knowledge on the system.
Another reason might be, as I pointed out earlier, that our proof search mechanism is very naive and probably can’t constrain the search space enough to be able to prove theorems on itself. In AlphaGeometry, the novel idea was to fine-tune a model on the construction parts of the proofs. However, it does not seem to be the case for AlphaProof, and it does not seem to be very feasible here either.
The lack of context is also a significant issue - proofs can become arbitrarily deep (so we should probably use an architecture that has long enough context windows, or a hybrid model that captures long-term implications via elimination rules directly instead of relying on the model). Word embeddings do not carry the intended meaning as much as they did in the intended article, and the tokens need to be worked on more carefully as well.
In my comment, I proposed different model nodes (which can be arranged in an arbitrary digraph-like architecture) corresponding to different basic building blocks of a potential search strategy, that act as helper nodes to the architecture, helping it find exogeneous terms in proofs (and increase the visibility of proofs that have such terms) both during inference and during adversarial training:
Now that we have a description of all these nodes, all that remains is information flow between these nodes. Even the structure of this graph is not 100% clear. One possible information flow model can be via a graph that is still visible in many approaches these days - KANs and liquid time-constant networks to name a few - the worker nodes each output something that must be used by the other nodes, and the most important parts are in the nodes themselves. And the transition probabilities come from some sort of policy model itself, sort of learning the model itself.
This list of nodes is nowhere near exhaustive, but for most combinatorics problems, this sort of approach generally works, for a human mind at least. Just as in the attention mechanism, it might be possible that these models learn completely different things than what we are expecting. But giving it extra degrees of freedom will always give the model enough breathing space to do more than it was designed for.
Of course, there must be ways to encode these nodes directly in a reinforcement learning setting - I hope there is more research on such results. The field of meta reinforcement learning comes to mind, and from my experience with graph learning, I am pretty sure that such approaches are worth trying.
Alluding to the earlier comment that hard Olympiad combinatorics is closer to research more than the other three subjects, putting in effort in this direction should help the AI improve on much more generic benchmarks, even if the right unifying framework towards AGI does not seem obvious at the moment. In some sense, combinatorics is the thing to beat.
]]>The other day I had a discussion with someone about how to implement FFT-based convolution/polynomial multiplication - they were having a hard time squeezing their library implementation into the time limit on this problem, and it soon turned into a discussion on how to optimize it as much as possible. It turned out that the bit-reversing part of their iterative implementation was taking a pretty large amount of time, so I suggested not using bit-reversal at all, as is done in a few libraries. Since not a lot of people turned out to be familiar with it, I decided to write a post on some ways of implementing FFT and deriving these ways from one another.
I won’t be going into implementing FFT using SIMD since the constant war (pun intended) of constant optimization over at this problem will probably teach you much more than what I can fit in a tutorial, and it deserves a post on its own anyway. Rather, this blog post will look at how you can think about FFT in ways that you might have not thought about before. Whenever it makes sense, I would link to some nice tutorials on FFT that could be helpful in understanding the ideas in greater detail.
Let’s start out with a crash course on what FFT aims to do (for a more detailed tutorial, you might want to read this well-written tutorial).
Essentially, you have a sequence of numbers \(a_0, a_1, \dots, a_{n-1}\) and you make a polynomial \(P(x) = \sum_{i=0}^{n-1} a_i x^i\) out of it.
Let \(\omega = e^{2 \pi i / n}\) be the \(n\)-th root of unity. Then FFT aims to map the sequence \(a_0, a_1, \dots, a_{n-1}\) to \(P(\omega^0), P(\omega^1), \dots, P(\omega^{n-1})\). Just to be clear, this mapping is called the discrete Fourier transform (DFT) of order \(n\) but an algorithm to do this efficiently is called the fast Fourier transform (FFT). The inverse of this transform is called the inverse DFT (IDFT), and the corresponding fast algorithm is called inverse FFT (IFFT).
Now there are a couple of points that need to be clarified here:
For the first of these questions, we will take motivation from the polynomial multiplication perspective; there are many other reasons to use this transform across fields (and there are mathematical/philosophical reasons for a lot of this).
Note that for a polynomial of degree \(n\), we know that its values at \(n + 1\) distinct points determines it completely. So having implementations of FFT and IFFT on hand allows us to run the following algorithm for multiplying 2 polynomials \(A(x) = \sum_{i=0}^n a_i x^i\) and \(B = \sum_{i=0}^n b_i x^i\) of degree \(n\) and \(m\):
Essentially, we are evaluating \(A\) and \(B\) at \(n + m + 1\) points, getting the values of their product, and interpolating \(C(x) = A(x)B(x)\) from its values on \(n + m + 1\) distinct points.
Why specifically are we evaluating the polynomials on the first few consecutive powers of a root of unity, and not just something simple like the first few positive integers? As you will find out later on, using these numbers will help us in making FFT efficient, as opposed to the naive quadratic polynomial multiplication algorithm that they teach in high school.
If we run the above algorithm on two polynomials of degree \(n - 1\) and do a DFT of order \(n\) without extending any of the polynomials to the right by zeros, the resulting interpolated polynomial \(C\) will be the remainder of dividing \(A(x)B(x)\) by \(x^n - 1\). This holds only with our choice of evaluation points. In general, if the evaluation points are \(x_i\) instead of \(\omega^i\), then the result will be the remainder of dividing \(A(x)B(x)\) by \(\prod_{i=0}^{n-1} (x-x_i)\). The proof is easy and left as an exercise to the reader.
And the remaining question: why does an inverse exist in the first place? That is due to the fact that all these powers of the root of unity we use are distinct, and values of a polynomial \(P\) on \(\deg(P) + 1\) distinct points determines \(P\) uniquely. For a more mathematical proof, think of each equation \(P(x_i) = y_i\) as a linear equation in the coefficients of \(P\). Then we have \(\deg(P) + 1\) equations in \(\deg(P) + 1\) variables. So we only need to show that the value of the determinant of the following matrix is non-zero for distinct \(x_i\)-s:
\[\begin{bmatrix} 1 & x_0 & x_0^2 & \dots & x_0^n\\ 1 & x_1 & x_1^2 & \dots & x_1^n\\ 1 & x_2 & x_2^2 & \dots & x_2^n\\ \vdots & \vdots & \vdots & \ddots &\vdots \\ 1 & x_n & x_n^2 & \dots & x_n^n \end{bmatrix}\]
This is called a Vandermonde matrix and its determinant is given by the product of \((x_i - x_j)\) over all possible distinct \(i, j\) (proof in the link). Since all \(x_i\)-s are distinct, the determinant of this matrix is non-zero.
Now that we have the basics out of the way, we need to figure out how to apply the DFT of order \(n\) to a polynomial. For polynomial multiplication specifically, we don’t really need DFT of an arbitrary order - we can make do with an order that is a power of 2 and is at least \(n + m + 1\) - let’s say \(2^k\). This will work out because even though our system of equations (in the coefficients of the polynomial) is over-constrained, since the result of polynomial multiplication is a valid solution to this system of equations, we will have exactly one solution. Another way of thinking about it is that we can think of the interpolated polynomial as having a degree of \(2^k - 1\), and \(C(x) - A(x) B(x)\) will be \(0\) at \(2^k\) points, so it will be zero identically.
Our implementation will be a function that takes an array \(A\) of length \(2^k\) and returns the DFT of order \(2^k\) when applied to it.
def fft(A, k):
# assert len(A) == 2 ** k
if k == 0:
return A
else:
pass
The base case is trivial. We now look at what happens when \(k \ge 1\).
Let’s write \(A(x) = A_0(x^2) + x A_1(x^2)\) where \(A_0(y) = a_0 + a_2 y + \dots + a_{2^k-2} y^{2^{k-1}}\) and \(A_1(y) = a_1 + a_3 y + \dots + a_{2^k - 1} y^{2^{k-1}}\). Note that the \(2^{k-1}\)-th root of unity is the square of the \(2^k\)-th root of unity.
Let \(A_0’ = \mathtt{fft}(A_0, k-1)\) and \(A_1’ = \mathtt{fft}(A_1, k-1)\). We need to compute \(A’ = \mathtt{fft}(A, k)\).
Note that \(A’[i] = A(\omega^i) = A_0((\omega^2)^i) + \omega^i A_1((\omega^2)^i)\). If \(i < 2^{k-1}\), this gives \(A’[i] = A_0’[i] + \omega^i A_1’[i]\). For \(i \ge 2^{k-1}\), note that \(\omega^{i} = -\omega^{i - 2^{k-1}}\), so \(A’[i] = A_0’[i - 2^{k-1}] - \omega^{i - 2^{k-1}} A_1’[i - 2^{k-1}]\).
And that’s it - we have all the steps necessary for implementing FFT.
import cmath
def fft_internal(A, k, omega):
# assert len(A) == 2 ** k
if k != 0:
half_len = len(A) // 2
f_A_0, f_A_1 = fft_internal(A[0::2], k - 1, omega * omega), fft_internal(A[1::2], k - 1, omega * omega)
power = 1
for i in range(half_len):
A[i] = f_A_0[i] + power * f_A_1[i]
A[i + half_len] = f_A_0[i] - power * f_A_1[i]
power *= omega
return A
def fft(A):
k = len(A).bit_length() - 1
assert len(A) == (1 << k)
omega = cmath.exp(2 * cmath.pi * 1j / len(A))
return fft_internal(A, k, omega)
This implementation is not ready in terms of being numerically stable - if you try and apply this algorithm to any reasonably large array, you will find that there are noticeable errors. There are ways to make it stable, like computing all powers of the highest-order \(\omega\) in the beginning itself by using \(\exp(i \theta) = \cos \theta + i \sin \theta\), and avoiding computing them on the fly. Also, in competitive programming, generally you would not want to use floating point numbers. You would rather want to work in a field modulo a prime that admits a \(2^k\)-th root of unity for a large \(k\), for instance, \(998244353\) (in such cases, FFT is called NTT - number theoretic transform), and for computing polynomial products without any modulus, use a couple of such NTT-friendly primes and invoke the Chinese remainder theorem while keeping in mind bounds on the coefficients of the resulting polynomial.
We will also not focus too much on code-golfing or implementation efficiency that comes from saving a couple of arithmetic operations here and there - it should be fairly simple to do these optimizations.
How do we compute the IFFT? The most straightforward way is to invert all the operations that we used in FFT - this leads to roughly the same amount of extra code, though.
Another way is to use the properties of \(\omega\).
Let’s try to apply FFT to the array \(P(\omega^0), \dots, P(\omega^{n-1})\), except that this time, in the FFT computation, we use \(\omega^{-1}\) instead of \(\omega\). The \(i\)-th value in the resulting sequence would be \(\sum_{j=0}^{n-1} P(\omega^j) \omega^{-ij}\). Note that this is a linear combination of the \(a_k\)-s. Let’s look at the coefficient of \(a_k\). It is \(\sum_{j=0}^{n-1} \omega^{j(k-i)}\). If \(k = i\), this simplifies to \(n\). Otherwise, \(\omega^{k-i}\) is an \(n\)-th root of unity that is not \(1\), since \((\omega^{k-i})^n = (\omega^n)^{k-i} = 1^{k-i} = 1\). So it satisfies \(x^n - 1 = 0\), and since \(x \ne 1\), we must have \(1 + x + \dots + x^{n-1} = 0\). In other words, we have \(\sum_{j = 0}^{n - 1} \omega^{j(k - i)} = 0\).
In other words, the resulting transformation maps \(P(\omega^0), \dots, P(\omega^{n-1})\) to \((na_0, \dots, na_{n-1})\). We are almost there - if we divide the resulting array by \(n\), then we would be done. In other words, the following is an implementation of IFFT:
def ifft(A):
k = len(A).bit_length() - 1
assert len(A) == (1 << k)
omega = cmath.exp(-2 * cmath.pi * 1j / len(A))
return [x / len(A) for x in fft_internal(A, k, omega)]
Let’s look at the time and space complexity of this algorithm. The recurrence becomes \(T(k) = O(2^k) + 2 T(k - 1)\), which becomes \(T(k) = O(k \cdot 2^k)\). In terms of \(n = 2^k\), the length of the array, the complexity is \(O(n \log n)\).
So our polynomial multiplication algorithm beats the naive \(O(n^2)\) multiplication in terms of asymptotic complexity.
What about the space complexity? We get \(S(k) = O(2^k) + S(k - 1)\) (not \(2 S(k - 1)\) because the first call can free up its memory before we go ahead with the second call), and this gives \(S(k) = O(2^k)\), and in terms of the length of the array, the space complexity is \(O(n)\).
Making it iterative can be done in multiple ways - the most common way is to do so via bit-reversal. I recommend reading this section to understand how to perform bit-reversal and modify FFT accordingly. This has the added benefit that the algorithm becomes in-place (i.e., no memory allocations at all). In the next section, we’ll look at how to get to an iterative FFT variant that does not require us to use bit-reversal.
The most naive way is to try and unroll the recursion by looking at the call stack. Doing this while ensuring that the indices are convenient enough to write a simple iterative algorithm corresponds to the bit-reversal algorithm.
However, we know that FFT with \(\omega\) is almost the same as IFFT with \(-\omega\). And we have another way of writing IFFT that we mentioned earlier, which was to just invert the operations.
By inverting the operations, we would have a pre-process step where we invert the last loop, and then 2 calls to invert the FFTs we did on the two halves. So, IFFT looks like this:
import cmath
def ifft_internal_modifying(A, omega, start, stride):
if stride == len(A):
return
# we will be modifying A[start::stride], since at any point, the array passed to the fft_internal call is of this form
half_len = len(A) // (2 * stride)
# inverting the last loop
power = 1
l = [0] * (2 * half_len)
for i in range(half_len):
# corresponding element of original is A'[i] = A[start + stride * i]
# we need to restore the elements at A'[2 * i] and A'[2 * i + 1] using elements at A'[i] and A'[i + half_len]
l[2 * i] = (A[start + stride * i] + A[start + stride * (i + half_len)]) / 2
l[2 * i + 1] = (A[start + stride * i] - A[start + stride * (i + half_len)]) / (2 * power)
power *= omega
for i in range(half_len):
A[start + stride * (2 * i)] = l[2 * i]
A[start + stride * (2 * i + 1)] = l[2 * i + 1]
omega2 = omega * omega
ifft_internal_modifying(A, omega2, start, stride * 2)
ifft_internal_modifying(A, omega2, start + stride, stride * 2)
def ifft(A):
A = A.copy()
k = len(A).bit_length() - 1
assert len(A) == (1 << k)
omega = cmath.exp(2 * cmath.pi * 1j / len(A))
ifft_internal_modifying(A, omega, 0, 1)
return A
Now note that this is a tail-recursive algorithm. Also note that the positions changed by the algorithm in the two recursive calls are completely disjoint, so we can convert it to an iterative algorithm naively - consider all the operations at a given depth and do them in a single loop.
So the final algorithm looks like this:
import cmath
def ifft_iterative_modifying(A, omega):
double_stride = 1
b = A.copy()
while double_stride < len(A):
stride = double_stride
double_stride *= 2
half_len = len(A) // double_stride
power = 1
for i in range(half_len):
for start in range(stride):
b[start + stride * (2 * i)] = (A[start + stride * i] + A[start + stride * (i + half_len)]) / 2
b[start + stride * (2 * i + 1)] = (A[start + stride * i] - A[start + stride * (i + half_len)]) / (2 * power)
power *= omega
A, b = b, A
omega = omega * omega
return A
def ifft(A):
A = A.copy()
k = len(A).bit_length() - 1
assert len(A) == (1 << k)
omega = cmath.exp(2 * cmath.pi * 1j / len(A))
return ifft_iterative_modifying(A, omega)
Now how do we get from IFFT to FFT? Simple - multiply by \(2^k\) and replace \(\omega\) by \(\omega^{-1}\). By accounting for multiplication by \(2^k\) in each step, we get the following implementation of FFT:
import cmath
def fft_iterative_modifying(A, omega):
double_stride = 1
b = A.copy()
while double_stride < len(A):
stride = double_stride
double_stride *= 2
half_len = len(A) // double_stride
power = 1
for i in range(half_len):
for start in range(stride):
b[start + stride * (2 * i)] = (A[start + stride * i] + A[start + stride * (i + half_len)])
b[start + stride * (2 * i + 1)] = (A[start + stride * i] - A[start + stride * (i + half_len)]) * power
power *= omega
A, b = b, A
omega = omega * omega
return A
def fft(A):
A = A.copy()
k = len(A).bit_length() - 1
assert len(A) == (1 << k)
omega = cmath.exp(2 * cmath.pi * 1j / len(A))
return fft_iterative_modifying(A, omega)
The most common way of making FFT in-place (avoiding extra memory allocation) is to introduce bit-reversal. So, most polynomial multiplication implementations end up doing bit reversal during the algorithm.
However, there is a way to do in-place polynomial multiplication without using bit-reversals or memory allocations, which I’ll explain below.
Consider the bit-reversal algorithm linked above. Here’s an implementation for it:
import cmath
def fft_iterative_modifying_with_bit_reversed_input(A, omega):
double_stride = 1
while double_stride < len(A):
stride = double_stride
double_stride *= 2
half_len = len(A) // double_stride
omega2 = omega ** half_len
for i in range(0, len(A), double_stride):
power = 1
for start in range(stride):
pos_x, pos_y = i + start, i + start + stride
A[pos_x], A[pos_y] = A[pos_x] + power * A[pos_y], A[pos_x] - power * A[pos_y]
power *= omega2
def fft_with_bit_reversed_input(A):
A = A.copy()
k = len(A).bit_length() - 1
assert len(A) == (1 << k)
omega = cmath.exp(2 * cmath.pi * 1j / len(A))
fft_iterative_modifying_with_bit_reversed_input(A, omega)
return A
To get IFFT with bit-reversed output, we will just reverse the order of all operations (and invert them), to get something as follows:
import cmath
def ifft_iterative_modifying_with_bit_reversed_output(A, omega):
stride = len(A)
while stride > 1:
double_stride = stride
stride //= 2
half_len = len(A) // double_stride
omega2 = omega ** half_len
for i in range(0, len(A), double_stride):
power = 1
for start in range(stride):
pos_x, pos_y = i + start, i + start + stride
A[pos_x], A[pos_y] = (A[pos_x] + A[pos_y]) / 2, (A[pos_x] - A[pos_y]) / (2 * power)
power *= omega2
def ifft_with_bit_reversed_output(A):
A = A.copy()
k = len(A).bit_length() - 1
assert len(A) == (1 << k)
omega = cmath.exp(2 * cmath.pi * 1j / len(A))
ifft_iterative_modifying_with_bit_reversed_output(A, omega)
return A
To get FFT with bit-reversed output, we will multiply by \(2^k\) and replace power
by its inverse, to get
import cmath
def fft_iterative_modifying_with_bit_reversed_output(A, omega):
stride = len(A)
while stride > 1:
double_stride = stride
stride //= 2
half_len = len(A) // double_stride
omega2 = omega ** half_len
for i in range(0, len(A), double_stride):
power = 1
for start in range(stride):
pos_x, pos_y = i + start, i + start + stride
A[pos_x], A[pos_y] = A[pos_x] + A[pos_y], (A[pos_x] - A[pos_y]) * power
power *= omega2
def fft_with_bit_reversed_output(A):
A = A.copy()
k = len(A).bit_length() - 1
assert len(A) == (1 << k)
omega = cmath.exp(2 * cmath.pi * 1j / len(A))
fft_iterative_modifying_with_bit_reversed_output(A, omega)
return A
So our plan is to do the following:
The only thing remaining is IFFT with bit-reversed input, but that is trivial - we do FFT with bit-reversed input with \(\omega\) replaced by \(1/\omega\) and divide throughout by the length of the array.
All in all, our implementation becomes
import cmath
def fft_iterative_modifying_with_bit_reversed_output(A, omega):
stride = len(A)
while stride > 1:
double_stride = stride
stride //= 2
half_len = len(A) // double_stride
omega2 = omega ** half_len
for i in range(0, len(A), double_stride):
power = 1
for start in range(stride):
pos_x, pos_y = i + start, i + start + stride
A[pos_x], A[pos_y] = A[pos_x] + A[pos_y], (A[pos_x] - A[pos_y]) * power
power *= omega2
def fft_iterative_modifying_with_bit_reversed_input(A, omega):
double_stride = 1
while double_stride < len(A):
stride = double_stride
double_stride *= 2
half_len = len(A) // double_stride
omega2 = omega ** half_len
for i in range(0, len(A), double_stride):
power = 1
for start in range(stride):
pos_x, pos_y = i + start, i + start + stride
A[pos_x], A[pos_y] = A[pos_x] + power * A[pos_y], A[pos_x] - power * A[pos_y]
power *= omega2
def multiply(A, B):
if not A or not B: return []
final_len = len(A) + len(B) - 1
result_len = 1 << (final_len - 1).bit_length()
A = A + [0] * (result_len - len(A))
B = B + [0] * (result_len - len(B))
k = len(A).bit_length() - 1
assert len(A) == (1 << k)
omega = cmath.exp(2 * cmath.pi * 1j / len(A))
fft_iterative_modifying_with_bit_reversed_output(A, omega)
fft_iterative_modifying_with_bit_reversed_output(B, omega)
norm = 1./len(A)
for i in range(len(A)):
A[i] *= B[i] * norm
fft_iterative_modifying_with_bit_reversed_input(A, 1./omega)
return A[:final_len]
Note that this implementation returns a list of complex numbers, and this is intended - nowhere in the implementation did we assume that the inputs were real numbers.
The initial FFT algorithm that we talked about is called the Cooley-Tukey algorithm, and the bit-reversal-based algorithm is the iterative implementation of the Cooley Tukey algorithm.
I am not aware of any reference for the buffer algorithm for FFT (and consequently for IFFT). The only reference I came to know of, for any resembling algorithm, is this comment by bicsi, however, the algorithm seems pretty different. I am not completely sure, but there should be a way to derive it from some of the other algorithms, though the way of deriving it mentioned in the comment is quite different. The idea behind it is that the body of the innermost loop is what is called a butterfly transform, and if you look at it from matrix algebra perspective (for instance, something like in this tutorial), then this transform corresponds to a multiplication by a matrix.
Also, this idea of implementing bit-reversal-free convolution is not novel. For instance, it appears in the AtCoder Library implementation. They take this one step further - in this blog post, we only discussed a radix-2 implementation. However, it is possible (and better for a computer implementation) to use a radix-4 implementation, where the butterfly transform has 4 inputs and 4 outputs instead of 2 each, falling back to a radix 2 implementation for the final iteration in case the length of the array is a power of 2 with an odd exponent.
Talking about the approach of the blog post - note that we used the duality between FFT and IFFT as well as their involution property pretty freely. In this sense, decimation in time (DIT) and decimation in frequency (DIF) algorithms are dual to one another. This answer does a good job of explaining what they precisely do. Another nice fact is that DIT algorithms need bit reversal in the beginning, while the DIF algorithms need bit reversal in the end. This is why we were able to use DIF first and then DIT in our algorithm to avoid bit-reversal completely.
Cooley-Tukey is a DIT algorithm, while the algorithm described in this is a DIF approach. See if you can figure out why. Conversely, inverting the operations in Cooley Tukey gives a DIF algorithm and the that for the linked algorithm gives a DIT algorithm.
Note that in the above discussion, we only considered polynomial multiplication modulo \(x^n - 1\). Using cyclic convolution modulo \(x^n - i\) helps avoid the \(0\) padding that is sometimes required, but at the downside of not being applicable to NTT directly. The comment under that blog post also presents another optimization over naive convolution.
We also assumed that all our DFTs will be of order being a power of 2. However, this is sometimes not desirable for certain problems. For example, problem F of this contest requires you to compute a DFT of an arbitrary length. In order to do that, you need the Chirp z-transform.
For more content on FFT, I recommend going through the Codeforces Catalog - it contains some great content on the topic.
Finally, I’d like to thank pajenegod and adamant for productive discussions on this topic. Please feel free to let me know if you have any comments!
]]>I first got to know about the existence of RSS feeds back when I was in middle school, but didn’t figure out their appeal and promptly forgot about them.
Fast-forward to around a year ago - I was starting to realize that I was reading way too many interesting tech blogs, so I set out to find a way to aggregate all these updates together. A quick search later, it seemed like RSS feeds were the perfect fit for the job. In fact, it also ended up helping me keep up with research in some niche fields that I was following on arxiv!
So, in terms of link/notification aggregation, RSS feeds have been a life-saver for me.
If this has made you even the slightest bit curious, I would urge you to try it out for yourself.
And it’s pretty simple too - you just need to do the following if you’re on a phone:
If you’re on a desktop machine or a laptop, I recommend using Thunderbird (I have not used any other RSS feed reader since it worked out pretty well). You can set up RSS feeds using the instructions here.
Now that you have set up your RSS feed apps, you can start searching for other blogs to subscribe to - you’ll find that a lot of blogs have this feature. This includes a lot of news websites, both traditional and for niche topics. You can find some of these here (note that this is not the same Feeder as mentioned above).
Some other interesting trivia:
This post was originally written on Codeforces; relevant discussion can be found here.
Use the following template (C++20) for efficient and near-optimal binary search (in terms of number of queries) on floating point numbers.
template <std::size_t N_BITS>
using int_least_t = std::conditional_t<
N_BITS <= 8, std::uint8_t,
std::conditional_t<
N_BITS <= 16, std::uint16_t,
std::conditional_t<
N_BITS <= 32, std::uint32_t,
std::conditional_t<
N_BITS <= 64, std::uint64_t,
std::conditional_t<N_BITS <= 128, __uint128_t, void>>>>>;
// this should work for float and doubles, but for long doubles, std::bit_cast will fail on most systems due to being 80 bits wide.
// to handle this, consider using doubles instead or std::bit_cast the long double to an 80-bit bitset and convert it to a 128 bit integer using to_ullong.
/*
* returns first x in [a, b] such that predicate(x) is false, conditioned on
* logical_predicate(a) && !logical_predicate(b) && logical_predicate(-inf) &&
* !logical_predicate(inf)
* here logical_predicate is the mathematical value of the predicate, not the
* machine value of the predicate
* it is guaranteed that non-nan, non-inf inputs are passed into the predicate
* if NaNs or infinities are passed to this function as argument, then the
* inputs to the predicate will start from smallest/largest representable
* floating point numbers of the input type - this can be a source of errors
* if you multiply the input by something > 1 for example
* strictly speaking, the predicate should also be perfectly monotonic, but if
* it gives out-of-order booleans in some small range [a, a + eps] (and the
* correct order elsewhere), then the answer will be somewhere in between
* the same holds for how denormals are handled by this code
*/
//
template <bool check_infinities = false,
bool distinguish_plus_minus_zero = false,
bool deal_with_nans_and_infs = false, std::floating_point T>
T partition_point_fp(T a, T b, auto&& predicate) {
static constexpr std::size_t T_WIDTH = sizeof(T) * CHAR_BIT;
using Int = int_least_t<T_WIDTH>;
static constexpr auto is_negative = [](T x) {
return static_cast<bool>((std::bit_cast<Int>(x) >> (T_WIDTH - 1)) & 1);
};
if constexpr (distinguish_plus_minus_zero) {
if (a == T(0.0) && b == T(0.0) && is_negative(a) && !is_negative(b)) {
if (!predicate(-T(0.0))) {
return -T(0.0);
} else {
// predicate(0.0) is guaranteed to be true because b = 0.0
return T(0.0);
}
}
}
if (a >= b) return NAN;
if constexpr (deal_with_nans_and_infs) {
// get rid of NaNs as soon as possible
if (std::isnan(a)) a = -std::numeric_limits<T>::infinity();
if (std::isnan(b)) b = std::numeric_limits<T>::infinity();
// deal with infinities
if (a == -std::numeric_limits<T>::infinity()) {
if constexpr (check_infinities) {
if (predicate(-std::numeric_limits<T>::max())) {
a = -std::numeric_limits<T>::max();
} else {
return -std::numeric_limits<T>::max();
}
} else {
a = -std::numeric_limits<T>::max();
}
}
if (b == std::numeric_limits<T>::infinity()) {
if constexpr (check_infinities) {
if (!predicate(std::numeric_limits<T>::max())) {
b = std::numeric_limits<T>::max();
} else {
return std::numeric_limits<T>::infinity();
}
} else {
b = std::numeric_limits<T>::max();
}
}
}
// now a and b are both finite - deal with differently signed a and b
if (is_negative(a) && !is_negative(b)) {
// check 0 once
if constexpr (distinguish_plus_minus_zero) {
if (!predicate(-T(0.0))) {
b = -T(0.0);
} else if (predicate(T(0.0))) {
a = T(0.0);
} else {
return T(0.0);
}
} else {
if (!predicate(T(0.0))) {
b = -T(0.0);
} else {
a = T(0.0);
}
}
}
// in the case a and b are both 0 after the above check, return 0
if (a == b) return T(0.0);
// start actual binary search
auto get_int = [](T x) { return std::bit_cast<Int, T>(x); };
auto get_float = [](Int x) { return std::bit_cast<T, Int>(x); };
if (b > 0) {
while (get_int(a) + 1 < get_int(b)) {
auto m = std::midpoint(get_int(a), get_int(b));
if (predicate(get_float(m))) {
a = get_float(m);
} else {
b = get_float(m);
}
}
} else {
while (get_int(-b) + 1 < get_int(-a)) {
auto m = std::midpoint(get_int(-b), get_int(-a));
if (predicate(-get_float(m))) {
a = -get_float(m);
} else {
b = -get_float(m);
}
}
}
return b;
}
It is also possible to extend this to breaking early when a custom closeness predicate is true (for example, min(absolute error, relative error) < 1e-9 and so on), but for the sake of simplicity, this template does not do so.
It is well known that writing the condition for continuation of the
binary search as anything like while (r - l > 1)
or while (l < r)
is
bug-prone on floating point integers, so people tend to write a binary
search by fixing the number of iterations. However, this is usually not
the best method — you need \(O(\log L)\) iterations where \(L\) is the
length of the range, and the range of floating point numbers is very
large.
So, I came up with a way to binary search on (IEEE754, i.e., practically
all implementations of) floating point numbers that takes
bit_width(floating_type) + O(1)
calls to the predicate for binary
search. I am pretty sure that this method has been explored before (feel
free to drop references in the comments if you find them). Regardless of
whether this is a novel algorithm or not, I wanted to share this since
it teaches you a lot about floating points, and it also has the
following advantages:
I started out by thinking about this: floating point numbers have a very large range. Can we do better than \(O(\log L)\) where \(L\) is the length of the range? Recall that floating point numbers also have a fixed width that is much smaller than the log of the range they represent, and between two consecutive representable floating point numbers, it is their ratio that is nicely bounded (ignoring the boundary of 0 and inf and nans) instead of their difference. Now since it is well-known that for getting to a certain relative-error, it is optimal to use \(\sqrt{lr}\) instead of \(\frac{l + r}{2}\) in binary search.
So the first idea that comes to mind is to use sqrt instead of midpoint in the usual binary search. However, it has a lot of issues — for example, if \(l = 0\), then you never progress in the binary search (this is fixable by doing a midpoint search on 0 to 1 and sqrt search on 1 to \(r\)). One more issue is how to deal with negatives (this is again doable by splitting the input range into multiple ranges where sqrt works) and with overflows/underflows. And the main issue with this approach is that sqrt is expensive — if you are doing a problem where the predicate is pretty fast and you need to do a lot of binary searches, most of your computation time would be due to sqrt.
Inspired by these, we decide to try to approximate sqrt in a way that
preserves monotonicity. Here comes the main point — note that the
IEEE754 implementation of floating point numbers separates the mantissa
from the exponent, and the exponent part of \(\sqrt{lr}\) is roughly the
mean of that of \(l\) and \(r\). Since these are the top few bits
(excluding the sign bit), we can just do the following (for positive
floating point numbers at least): read the floating point representation
as an integer (this can be done in a type-safe manner by using
std::bit_cast
), take the midpoint of these integers, and convert it
back to a floating point number. This clearly preserves monotonicity —
this can be verified easily by hand. Note that this same thing works for
when both range endpoints are negative numbers too — for this we
simply invert the sign bit. The case where both are of opposite signs is
also simple (if you disregard the fact that +0 and -0 are equal but have
distinct representations and reciprocals) — check the predicate on
\(0\) (both the zeroes if it matters for your predicate).
Now if we want to do some more handling (infinities, NaNs, denormals), we can do it as we wish. In the implementation in the TL;DR above, we decide to replace NaNs with the infinity in the correct direction, and since there can be many infinities, we try to bring down the range endpoint to the largest representable floating point instead.
In all, there are at most \(w + O(1)\) calls to the predicate, where \(w\) is the length of the complement of the longest common prefix of the floating point representations of endpoints.
This implementation is also robust to predicates which are noisy near the boundary (i.e., near the boundary, there is a small range where the true values of the predicate can come after the false values) — in this case the algorithm returns something in this range near the boundary.
Note that we did not need to hardcode the number of iterations, nor did we require some carefully-written predicate for loop termination.
Also, since std::bit_cast
is practically free compared to std::sqrt
,
it is much faster in cases where you need to do a lot of binary
searches.
As a usage example, this submission uses the usual binary search with fixed number of iterations, and this submission uses the template above.
The following are some problems that use binary search on floating points, and should be solvable using this template. If you encounter any bugs in the template while solving these problems, do let me know in the comments below. Thanks to jeroenodb and PurpleCrayon for problem suggestions.
]]>This post was originally written on Codeforces; relevant discussion can be found here.
As of May 2024, the bug has been fixed in GCC 14, but has not been ported to Codeforces yet.
MikeMirzayanov added a new compiler in response to the bug mentioned here. However, it does not come without a catch.
Namely, any pragma that is of the form #pragma GCC target(...)
would
NOT work with this new compiler. The
issue is
well-known by people who use up-to-date compilers but there has not been
much progress towards a fix.
One of the ways to fix it is to add the target pragma AFTER the STL includes. This should make your code compile, but a lot of code would simply not be optimized, so use it at your own risk.
The other way is to use g++-12, but risk a TLE due to the linked post (it’s easy to write hacks for most problems). Fortunately, there is a fix for this hack, but, echoing the words of the post author, use that at your own risk.
For completeness, the first two of these don’t compile on g++-13 but the last one does.
#pragma GCC optimize("O3")
#pragma GCC target("avx")
#include <vector>
int main() { std::vector<int> a; }
#pragma GCC target("avx")
#include <vector>
int main() { std::vector<int> a; }
#pragma GCC optimize("O3")
#include <vector>
#pragma GCC target("avx")
int main() { std::vector<int> a; }
This post was originally written on Codeforces; relevant discussion can be found here.
Someone asked me about my template that used a “cache wrapper” for lambdas, so I decided to write a post explaining how it works. For reference, here is a submission of mine from 2021 that uses that template.
Here’s what you will find implementations for in this post:
Jump to the Usage section if you only care about the template, though I would strongly recommend reading the rest of the post too since it has a lot of cool/useful things in my opinion.
You will notice that the implementation in this post uses the self
pattern (non-standard terminology, just what I like to call it), just
like the y_combinator
pattern uses it.
This should remind you of the Continuation-passing style, and the cache implementation is like a monad that does this for you, but for this specific case.
There are two major ways in which dynamic programming is implemented — recursive and iterative.
Recursive DP has the following advantages:
However, people switch to iterative DP as their default DP implementation, because of the following reasons:
The template talked about in the above comment aims to tackle the third point by reducing boilerplate code — it does this by automatically managing memoization for you.
In Python, you have nice things called decorators. For example, the
functools.cache
decorator is precisely what we would use for such a
purpose:
from functools import cache
@cache
def factorial(n):
return n * factorial(n - 1) if n else 1
If you call factorial(10)
, then the internal cache will be populated
with the results of all the recursive calls that were required to
compute factorial(10)
. As a result, you can call factorial(5)
and it
will just look it up in the internal cache and return the value to you
without calling the function body.
So I thought, why not try to do the same thing in C++?
Let’s get a couple of things out of the way first:
So let’s start out by trying to design it.
We need the following two parts:
Let’s solve the first problem first. Hashing arbitrary structs in C++ is quite hard (not impossible, though). But what you can always do (for structs whose intent is to just store elements) is to supply a function that makes them tuples (i.e., provide a type conversion operator). For our problem, we identify the most important types that we will ever need to hash:
The last two are uniformly recursive on their inputs (except for
vector<bool>
, but our implementation doesn’t need to treat it
separately), so we can do recursive metaprogramming to be able to hash
such types.
All in all, we need a hashing function that hashes the “base case” and a hash combination function that aggregates over the sequence/tuple types.
The implementation becomes something like this:
namespace hashing {
using i64 = std::int64_t;
using u64 = std::uint64_t;
static const u64 FIXED_RANDOM = std::chrono::steady_clock::now().time_since_epoch().count();
#if USE_AES
std::mt19937 rd(FIXED_RANDOM);
const __m128i KEY1{(i64)rd(), (i64)rd()};
const __m128i KEY2{(i64)rd(), (i64)rd()};
#endif
template <class T, class D = void>
struct custom_hash {};
// https://www.boost.org/doc/libs/1_55_0/doc/html/hash/combine.html
template <class T>
inline void hash_combine(u64& seed, const T& v) {
custom_hash<T> hasher;
seed ^= hasher(v) + 0x9e3779b97f4a7c15 + (seed << 12) + (seed >> 4);
};
// http://xorshift.di.unimi.it/splitmix64.c
template <class T>
struct custom_hash<T, typename std::enable_if<std::is_integral<T>::value>::type> {
u64 operator()(T _x) const {
u64 x = _x;
#if USE_AES
// implementation defined till C++17, defined from C++20
__m128i m{i64(u64(x) * 0xbf58476d1ce4e5b9u64), (i64)FIXED_RANDOM};
__m128i y = _mm_aesenc_si128(m, KEY1);
__m128i z = _mm_aesenc_si128(y, KEY2);
return z[0];
#else
x += 0x9e3779b97f4a7c15 + FIXED_RANDOM;
x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9;
x = (x ^ (x >> 27)) * 0x94d049bb133111eb;
return x ^ (x >> 31);
#endif
}
};
template <class T>
struct custom_hash<T, std::void_t<decltype(std::begin(std::declval<T>()))>> {
u64 operator()(const T& a) const {
u64 value = FIXED_RANDOM;
for (auto& x : a) hash_combine(value, x);
return value;
}
};
template <class... T>
struct custom_hash<std::tuple<T...>> {
u64 operator()(const std::tuple<T...>& a) const {
u64 value = FIXED_RANDOM;
std::apply([&value](T const&... args) { (hash_combine(value, args), ...); }, a);
return value;
}
};
template <class T, class U>
struct custom_hash<std::pair<T, U>> {
u64 operator()(const std::pair<T, U>& a) const {
u64 value = FIXED_RANDOM;
hash_combine(value, a.first);
hash_combine(value, a.second);
return value;
}
};
}; // namespace hashing
Now for the hash table, we could just use std::unordered_map
. However,
it often turns out to be slower than the GNU policy based data structure
gp_hash_table
, so we also define some aliases that are helpful.
#include "ext/pb_ds/assoc_container.hpp"
#include "ext/pb_ds/tree_policy.hpp"
namespace pbds {
using namespace __gnu_pbds;
#ifdef PB_DS_ASSOC_CNTNR_HPP
template <class Key, class Value, class Hash>
using unordered_map = gp_hash_table<Key, Value, Hash, std::equal_to<Key>, direct_mask_range_hashing<>, linear_probe_fn<>,
hash_standard_resize_policy<hash_exponential_size_policy<>, hash_load_check_resize_trigger<>, true>>;
template <class Key, class Hash>
using unordered_set = pbds::unordered_map<Key, null_type, Hash>;
#endif
#ifdef PB_DS_TREE_POLICY_HPP
template <typename T>
using ordered_set = tree<T, null_type, std::less<T>, rb_tree_tag, tree_order_statistics_node_update>;
template <typename T>
using ordered_multiset = tree<T, null_type, std::less_equal<T>, rb_tree_tag, tree_order_statistics_node_update>;
template <class Key, class Value, class Compare = std::less<Key>>
using ordered_map = tree<Key, Value, Compare, rb_tree_tag, tree_order_statistics_node_update>;
#endif
} // namespace pbds
Now the only thing that remains is to actually decide the design of the wrapper.
We make the following choices:
operator()
(call operator) that looks into the cache and returns the answer if
found, else calls the lambda recursively and updates the cache (both
recursively and for the current call).This is done as follows:
template <typename Signature, typename Lambda>
struct Cache;
template <typename ReturnType, typename... Args, typename Lambda>
struct Cache<ReturnType(Args...), Lambda> {
template <typename... DummyArgs>
ReturnType operator()(DummyArgs&&... args) {
auto tied_args = std::tie(args...);
auto it = memo.find(tied_args);
if (it == memo.end()) {
auto&& ans = f(*this, std::forward<DummyArgs>(args)...);
memo[tied_args] = ans;
return ans;
} else {
return it->second;
}
}
template <class _Lambda>
Cache(std::tuple<>, _Lambda&& _f) : f(std::forward<_Lambda>(_f)) {}
Lambda f;
using TiedArgs = std::tuple<std::decay_t<Args>...>;
pbds::unordered_map<TiedArgs, ReturnType, hashing::custom_hash<TiedArgs>> memo;
};
template <class Signature, class Lambda>
auto use_cache(Lambda&& f) {
return Cache<Signature, Lambda>(std::tuple{}, std::forward<Lambda>(f));
}
The usage is very simple.
Here’s the whole template for reference:
namespace hashing {
using i64 = std::int64_t;
using u64 = std::uint64_t;
static const u64 FIXED_RANDOM = std::chrono::steady_clock::now().time_since_epoch().count();
#if USE_AES
std::mt19937 rd(FIXED_RANDOM);
const __m128i KEY1{(i64)rd(), (i64)rd()};
const __m128i KEY2{(i64)rd(), (i64)rd()};
#endif
template <class T, class D = void>
struct custom_hash {};
// https://www.boost.org/doc/libs/1_55_0/doc/html/hash/combine.html
template <class T>
inline void hash_combine(u64& seed, const T& v) {
custom_hash<T> hasher;
seed ^= hasher(v) + 0x9e3779b97f4a7c15 + (seed << 12) + (seed >> 4);
};
// http://xorshift.di.unimi.it/splitmix64.c
template <class T>
struct custom_hash<T, typename std::enable_if<std::is_integral<T>::value>::type> {
u64 operator()(T _x) const {
u64 x = _x;
#if USE_AES
// implementation defined till C++17, defined from C++20
__m128i m{i64(u64(x) * 0xbf58476d1ce4e5b9u64), (i64)FIXED_RANDOM};
__m128i y = _mm_aesenc_si128(m, KEY1);
__m128i z = _mm_aesenc_si128(y, KEY2);
return z[0];
#else
x += 0x9e3779b97f4a7c15 + FIXED_RANDOM;
x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9;
x = (x ^ (x >> 27)) * 0x94d049bb133111eb;
return x ^ (x >> 31);
#endif
}
};
template <class T>
struct custom_hash<T, std::void_t<decltype(std::begin(std::declval<T>()))>> {
u64 operator()(const T& a) const {
u64 value = FIXED_RANDOM;
for (auto& x : a) hash_combine(value, x);
return value;
}
};
template <class... T>
struct custom_hash<std::tuple<T...>> {
u64 operator()(const std::tuple<T...>& a) const {
u64 value = FIXED_RANDOM;
std::apply([&value](T const&... args) { (hash_combine(value, args), ...); }, a);
return value;
}
};
template <class T, class U>
struct custom_hash<std::pair<T, U>> {
u64 operator()(const std::pair<T, U>& a) const {
u64 value = FIXED_RANDOM;
hash_combine(value, a.first);
hash_combine(value, a.second);
return value;
}
};
}; // namespace hashing
#include "ext/pb_ds/assoc_container.hpp"
#include "ext/pb_ds/tree_policy.hpp"
namespace pbds {
using namespace __gnu_pbds;
#ifdef PB_DS_ASSOC_CNTNR_HPP
template <class Key, class Value, class Hash>
using unordered_map = gp_hash_table<Key, Value, Hash, std::equal_to<Key>, direct_mask_range_hashing<>, linear_probe_fn<>,
hash_standard_resize_policy<hash_exponential_size_policy<>, hash_load_check_resize_trigger<>, true>>;
template <class Key, class Hash>
using unordered_set = pbds::unordered_map<Key, null_type, Hash>;
#endif
#ifdef PB_DS_TREE_POLICY_HPP
template <typename T>
using ordered_set = tree<T, null_type, std::less<T>, rb_tree_tag, tree_order_statistics_node_update>;
template <typename T>
using ordered_multiset = tree<T, null_type, std::less_equal<T>, rb_tree_tag, tree_order_statistics_node_update>;
template <class Key, class Value, class Compare = std::less<Key>>
using ordered_map = tree<Key, Value, Compare, rb_tree_tag, tree_order_statistics_node_update>;
#endif
} // namespace pbds
template <typename Signature, typename Lambda>
struct Cache;
template <typename ReturnType, typename... Args, typename Lambda>
struct Cache<ReturnType(Args...), Lambda> {
template <typename... DummyArgs>
ReturnType operator()(DummyArgs&&... args) {
auto tied_args = std::tie(args...);
auto it = memo.find(tied_args);
if (it == memo.end()) {
auto&& ans = f(*this, std::forward<DummyArgs>(args)...);
memo[tied_args] = ans;
return ans;
} else {
return it->second;
}
}
template <class _Lambda>
Cache(std::tuple<>, _Lambda&& _f) : f(std::forward<_Lambda>(_f)) {}
Lambda f;
using TiedArgs = std::tuple<std::decay_t<Args>...>;
pbds::unordered_map<TiedArgs, ReturnType, hashing::custom_hash<TiedArgs>> memo;
};
template <class Signature, class Lambda>
auto use_cache(Lambda&& f) {
return Cache<Signature, Lambda>(std::tuple{}, std::forward<Lambda>(f));
}
Let’s first try to replicate the python example from above:
auto factorial = use_cache<int(int)>([](auto&& self, int n) -> int {
if (n) return n * self(n - 1);
else return 1;
});
std::cout << factorial(10) << '\n';
You can do more complicated things with this template too. Let’s say you
want to have a recursive function that takes a char
and
pair<int, string>
as a DP state, and stores a bool. It also requires
values of some array you have taken as input, so you need to capture it
as well.
You write it like this:
vector<int> v;
// read v
auto solve = use_cache<bool(char, pair<int, string>)>([&](auto&& self, char c, pair<int, string> p) -> bool {
// note that v was captured by &
// use c, p, v as needed
// let's say you have c1 and p1 as arguments for a recursive call
auto x = self(c1, p1); // recursive call - do as many of them as needed
// use c, p, v as needed
return y; // return something
});
std::cout << solve('a', pair{1, "hello"s}) << '\n';
The most important caveat is that since it uses a hash-table under the hood, you will often miss the performance that you get from just using multidimensional arrays. You can adapt the template to do things like that, but I didn’t, for the sake of clarity and ease of implementation.
Since it doesn’t make a lot of sense to store function calls with arguments that can change over time, the template doesn’t support lambdas that take things by reference either. This is done in order to avoid any possible bugs that might come with references being modified on the fly somewhere in your code.
However, if for some reason (maybe for performance you want const refs of large structs/vectors instead of copies) you really want to use references, you can remedy that by doing the following:
std::reference_wrapper_t
(for example,
rather than int&
, use std::reference_wrapper_t<int>
) — this
makes a reference behave like a value (and is copyable for example).std::reference_wrapper_t<T>
.Do let me know if you know of a better way of doing these things, and if there are any errors that might have crept into this post!
]]>This post was originally written on Codeforces; relevant discussion can be found here.
A lot of people shy away from solving (mathematical) recurrences just because the theory is not very clear/approachable due to not having an advanced background in math.
As a consequence, the usual ways of solving recurrences tend to be:
But this doesn’t have to be the case, because there is a nice method you can apply to solve equations reliably. I independently came up with this method back when I was in middle school, and surprisingly a lot of people have no idea that you can solve recurrences like this.
The method is more principled and reliable than the first two guesswork methods, the third method might fail to apply in some cases, and the fourth method requires knowing what an infinite series is. An added bonus is that it prevents you from forgetting special cases that often plague recurrences! Meanwhile, as we show in the last section, having intuition about the method also allows you to apply the ideas developed to things that are only vaguely related to recurrences.
The best thing about the method is that it requires nothing more than school-level algebra and potentially some intuition/common sense.
I’m not saying that you should only use this method, just that keeping it in mind is pretty helpful whenever you’re faced with a recurrence (or even more fundamentally, linear combinations of numbers).
Note: I recommend working through this section (with pen and paper) with the examples \(x_k = 2x_{k-1} + 3\) and \(x_k = x_{k-1} + 5\). Using symbols can be intimidating for some and thus examples should be used while presenting new ideas, but it is also important to understand the intuition behind the underlying ideas in the form of algebra.
Let’s first consider a recurrence \(x_k = ax_{k-1} + b\). We employ wishful thinking: what is there was no \(b\) in the recurrence? In that case, we would have had \(x_k = x_0 \cdot a^k\), and that would be the end of the story. So can we reduce the general case to one where \(b = 0\)?
If you define \(y_k = x_k + c\) (i.e., vertically shift the graph of \((i, x_i)\) by some offset), then we have \(y_k = a(y_{k-1} - c) + b + c\). Now we have one degree of freedom, and we can choose \(c\) such that the constant term \(-ac + b + c\) is \(0\), i.e., \(c = \frac{b}{a-1}\). Note that we immediately run into a special case here: \(a = 1\) — in this case, such a \(c\) is impossible to choose. So we make two cases to finish off the problem:
Exercise: Try to solve the recurrence \(x_k = 2 x_{k-1}^2\) using this method.
Note: I recommend working through this section with two separate examples in mind: \(x_k = 3x_{k-1} + 10x_{k-2}\) and \(x_k = 4x_{k-1} - 4x_{k-2}\).
Let’s now consider the recurrence \(x_k = a x_{k-1} + b x_{k-2} + c\). For the sake of simplicity of presentation, we note that we can do something like the previous example to get rid of the \(c\), so let’s assume \(c = 0\). So we want to solve \(x_k = a x_{k-1} + b x_{k-2}\).
Now how do we deal with the two lagging \(x_{k-1}\) and \(x_{k-2}\) terms instead of just one? The trick is to try and find \(d\) such that \(y_k = x_k - d \cdot x_{k-1}\) becomes a geometric progression (we can also use a \(+d\) instead of \(-d\) like we did in the previous solution, but that does not matter).
We can also look at it in this way: we want to subtract a suitable multiple of \(x_{k-1}\) from both sides, so that \(x_k - d x_{k-1}\) (for some \(r\)) becomes a multiple of its lagged sequence.
For that to happen, notice that we must have \(y_k = e y_{k-1}\). This translates to \(x_k - d x_{k-1} = e (x_{k-1} - d x_{k-2})\). Comparing it with the original recurrence, we must have \(d + e = a\) and \(de = - b\). So we have \(|d - e| = \sqrt{(d + e)^2 - 4de} = \sqrt{a^2 + 4b}\) (note how this relates to solving a quadratic equation: \(x^2 - ax - b\) needs to be factorized into \((x - d)(x - e)\) — this will be helpful later on). Using this, we can solve for \(d\) and \(e\).
Now once that we have \(d\) and \(e\), we have \(y_k = e y_{k-1}\), so \(y_k = e^{k-1} y_1\). Let’s now, as earlier, go back to \(x\)-space. We have \(x_k - d x_{k-1} = e^{k-1} y_1\): note that \(y_1 = (x_1 - d x_0)\) is a constant we already know.
We now have two methods of solving this: one is to simply add these equalities for different \(k\) in a telescoping manner, after dividing the whole equation by \(d^k\). Just for the sake of showing the usage of this method a bit more, I will go with the other one.
Looking at the new recurrence we got for \(x\), we again want to define some \(z_k\) as a function of \(x_k\), such that \(z_k = d z_{k-1}\). Since there a term proportional to \(e^{k-1}\) on the right, we will choose \(z_k = x_k + f \cdot e^{k-1}\).
Going back to the recurrence in \(x\), we get \(z_k - f \cdot e^{k-1} - d \cdot z_{k-1} + d \cdot f \cdot e^{k-2} = e^{k-1} \cdot y_1\).
To get rid of the non-\(z\) terms, we must have \(f \cdot (d - e) = ey_1\). We are again faced with two cases, as in the previous example.
The case with \(d \ne e\) (the “distinct-root” case) is easy: in this case, we can just solve for the problem analogously to what we did to solve the simpler example.
The only remaining case is \(d = e\) (the “repeated-root” case): let’s look closely at why we failed here. The two terms coming from \(x_k\) and from \(x_{k-1}\) cancel out completely. What can we do to be able to have different contributions of \(e^{k-1}\) from both of these terms? Let’s choose \(z_k = x_k + g(k) \cdot e^{k-1}\) instead. Then we must have \(g(k - 1) - g(k) = y_1\). This means \(g(k)\) is in fact a linear function of \(g\). That is, we have \(g(k) = hk + i\) for some \(h, i\): here \(h = -y_1\) and \(i\) can be set to anything arbitrary (for now let’s say \(i = 0\)). So we choose \(g(k) = -y_1 k\), i.e., \(z_k = x_k - y_1 \cdot k \cdot e^{k-1}\).
Since we know \(z_k = dz_{k-1}\), we know that \(z_k = d^k z_0\). Going into \(x\)-space (expressing \(z\) in terms of \(x\)), we have a solution to the equation.
In this section, just by generalizing the approach in the previous section, we will show how you can in fact derive the method of characteristic equations while solving recurrences.
Remember the note about solving for \(d\) and \(e\). It looks like finding a factorization of \(x^2 - ax - b\) in the form \((x - d)(x - e)\).
When you try to use the method above for third order recurrences, you will quickly realize that we are trying to find a factorization of \(x^3 - ax^2 - bx - c\) into \((x - d)(x^2 - ex + f)\) for the first step, and as you go along solving the lower order recurrence at every step, you will realize that you also need to factorize \(x^2 - ex + f\) into two linear polynomials.
If the roots \(r_i\) (\(i = 1\) to \(3\)) of \(x^3 - ax^2 - bx - c\) are all distinct, you will get \(x_k\) in the nice form \(x_k = \sum_{i=1}^3 c_i r_i^k\). But if the roots have multiplicity more than 1 (i.e., \((x - r)^s\) divides the polynomial for some \(s > 1\)), then the \(c_i\) becomes a polynomial in \(k\) with degree \(s - 1\).
Using induction, you can show that a result of this form is indeed true for higher order recurrences as well.
There’s a funny story related to this method. Back when I was testing [contest:1909], problem D was not in the problemset. When it was added (with \(k = 1\) as the easier version, the problem was supposed to be problem B in the set), I solved the problem instantly because I used this method for solving recurrences before, and thought that its difficulty is supposed to be not more than A. A lot of people disagreed with me on this, and it was eventually agreed upon to use general \(k\) as a harder problem — I still disagreed with this opinion for a while, but then I realized that not a lot of people know this method of approaching recurrences/linear combinations.
This was partly the motivation behind writing this post (the other part was the existence of posts like this).
Here’s how you solve that problem using this method.
In this problem, you are given a reverse recurrence of the form \(y + z = x + k\). Inspired by the method developed above, we will just subtract \(k\) from both sides, and let \(x’ = x - k, y’ = y - k, z’ = z - k\), to get \(y’ + z’ = x’\).
This means that if we shift all the array elements vertically by \(-k\), we are just splitting elements into other elements \(\ge 1 - k\).
(Arguably) the hardest part of the problem is to realize that you can shift numbers and deal with them as usual, after the transformation. But to a person who knows this method well, it might seem like this problem was made specifically to be solved by it.
In this post, we saw that we can solve recurrences of the form \(a_k = f(a_{k-1}, \dots, a_{k - l})\) by trying to impose easy-to-analyze structure on expressions of the form \(b_k = g(a_{k}, a_{k-1}, \dots, a_{k - l + 1})\). In each example, the structure was of the form \(b_k = r b_{k-1}\), which gives us a way to compute \(b_k\) directly, thereby reducing the “order” of the problem from \(l + 1\) to \(l\). If you tried the exercise, the structure was supposed to be of the form \(b_k = b_{k-1}^2\).
In a way, this is an application of the divide and conquer way of thinking (in the general sense). This kind of divide and conquer, but in a bottom-up way rather than a top-down way, is also used when you derive things like Hensel’s lifting lemma.
]]>This post was originally written on Codeforces; relevant discussion can be found here.
On a computer science discord server, someone recently asked the following question:
Is the master theorem applicable for the following recurrence? \(T(n) = 7T(\lfloor n / 20 \rfloor) + 2T(\lfloor n / 8 \rfloor) + n\)
There was some discussion related to how \(T\) is monotonically increasing (which is hard to prove), and then someone claimed that there is a solution using induction for a better bound. However, these ad hoc solutions often require some guessing and the Master theorem is not directly applicable.
Similarly, in the linear time median finding algorithm (or the linear time order statistics algorithm in general), the recurrence is \(T(n) = T(\lfloor 7n/10 \rfloor + \varepsilon) + T(\lfloor n/5 \rfloor) + n\) where \(\varepsilon \in \Theta(1)\) Also, in merge sort and creating a segment tree recursively, the time complexities are \(T(n) = T(\lfloor n/2 \rfloor) + T(\lceil n/2 \rceil) + f(n)\) where \(f(n)\) is \(\Theta(n)\) for merge sort and \(\Theta(1)\) for segment tree construction.
Even the general form of the recurrence that the master theorem handles is just \(T(n) = aT(n/b) + f(n)\), and it doesn’t handle cases of the form \(T(n) = aT(\lfloor n/b \rfloor) + f(n)\) and \(T(n) = aT(\lceil n/b \rceil) + f(n)\), which is what we find in a lot of algorithms.
To cleanly handle all these cases at once, there is a generalization of the Master theorem for recurrences, called the Akra-Bazzi theorem. It is not really well-known, and since it is not trivial to prove (the proof is pretty nice, I recommend looking at it), it is often skipped over in CS courses on algorithms (sometimes even the instructors are not aware of this theorem). This is why I wanted to write this post — to provide a better way of reasoning about your divide and conquer algorithms.
Theorem: Consider \(T(x)\) given by a sufficient number of base cases (for instance, bounded for \(x < x_0\) for \(x_0\) defined later), and defined recursively by \(T(x) = g(x) + \sum_{i=1}^k a_i T(b_i x + h_i(x))\) for large enough \(x\) (i.e., for all \(x \ge x_0\) for some \(x_0\)).
Then, subject to the following conditions:
We have, if \(p\) is the unique solution to \(\sum_{i=1}^k a_i b_i^p = 1\), then
\[ T(x) \in \Theta \left(x^p \left(1 + \int_{1}^{x} \frac{g(u)}{u^{p+1}} \mathrm{d}u \right)\right) \]
There are some more pedantic conditions on \(x_0\) and \(g\) for this to hold, but for all practical purposes, those conditions (like some definite integrals with finite upper and lower limits are finite) are true. For more, you can refer to this paper.
This has the following advantages:
Let’s apply this theorem to the few examples mentioned above (although there won’t be any non-trivial results among the above examples, an important special case will be illustrated in the following examples).
A nice exercise could be to try analyzing the complexity of the brute force in this problem.
]]>This post was originally written on Codeforces; relevant discussion can be found here.
Disclaimer: This post (and all of my other posts, unless specified otherwise) is 100% ChatGPT-free — there has been no use of any AI/ML-based application while coming up with the content of this post.
There is a lot of AI-generated content out there these days that sounds plausible and useful but is absolute garbage and contributes nothing beyond a list of superficial sentences — even if it has content that is true (which is a big IF, by the way), it generates content that you could have just looked up on your favorite search engine. The reason why such current AI-generated content is like this is multi-fold, but I won’t get into it because I need to write the content of this post, too, and I don’t see copy-pasting this kind of content as anything but a stupid maneuver that wastes everyone’s time.
I hope other people who write posts refrain from using this stuff for generating content that is supposed to be meaningful and realize that the current state of AI is just a memorization machine that is trained to excrete human-sounding sentences. When I saw someone from Div 1 posting useful-sounding content on CF, which was clearly generated by ChatGPT, I couldn’t help but despair at what the knowledge-acquiring mechanisms of the future will look like, other than knowledge coming almost entirely from unreliable sources that consist of matrix multiplications at their core.
The contents are as follows:
Prerequisites: knowing a bit about structs/classes in C++, member functions, knowing that the STL exists.
If you feel something is not very clear, I recommend waiting for a bit, because I tried to ensure that all important ideas are explained at some point or another, and if you don’t understand and it doesn’t pop up later, it is probably not that important (and should not harm the experience of reading this post). Nevertheless, if I missed out on explaining something that looks important, please ask me in the comments — I’d be happy to answer your questions!
I would like to thank (in alphabetical order) AlperenT, bestial-42-centroids and kostia244 for discussions that helped improve the presentation and teachability of this post.
A few years back, I co-authored a blog on pragmas because there were very few people who understood what pragmas do. I noticed that the same is true, albeit to a lower extent, for lambdas (C++ and otherwise) — some using them as a minor convenience but nothing more; others having an irrational (and ignorant) disdain for them, mainly arising either out of some experiences that come from not using them properly or a love of all things “minimalistic”.
Hence, part of the motivation behind this post is to show and convince people that lambdas can be very useful, both for your code AND your thought process, and that they will make your life easier if you do any amount of programming in a language that has decent support for lambdas (yes, not only C++). Thus, a significant part of the post would also talk about some topics related to models of computation and some history. And, of course, I will make a few philosophical arguments along the way, which you can feel free to skip, but they are helpful as a first step to realizing why such a change in perspective is quite important for any learning.
The other reason I am writing this post is to explain C++ lambdas on a much deeper level than people usually go into, without being as terse and unreadable for a beginner as the cppreference page — this will involve rolling our own lambdas! This activity will also help us understand why certain behaviors are intuitive for C++ lambdas. And in the end, as everyone expects a tutorial on CF to be, we will also show some tricks that are especially useful in the competitive programming context.
Examples of usage of lambdas in code doesn’t come until much later in this post, because we develop an understanding of what role lambdas actually play in a language, instead of just introducing them as nifty tricks that can help you shorten your code. If you want to only look at applications, please skip to the last few sections.
Hopefully, this post will also get people interested in programming language theory — especially functional programming and parts that deal with language design.
Finally, no matter how experienced you are with lambdas/C++/programming languages, there is a high chance that you will find some piece of code/idea in this post that will look very new. I recommend checking out the two sections on patterns (in one of them we implement OOP using lambdas, for example), and the discussion on untyped v/s typed lambda calculus and implementing recursion.
I recommend reading this section again after reading the rest of the blog since it’ll make more sense after you understand what lambdas are. But since the reader is probably growing restless about why they should be investing their time in lambdas, here’s a list of reasons why you should learn about lambdas:
The unusual name lambda comes from lambda calculus, which is a mathematical system for expressing computation, much like what a computer program is supposed to be written in. In this computation system, we have only three types of expressions, which are supposed to be the representation of each computation:
return
\(E\).And that’s it. Note that there are no numbers and no restriction on what \(E_2\) evaluates to. In particular, \(E_2\) can be a function too! The same holds for \(E\) and \(E_1\) as well, so there are many kinds of expressions that you can write that will make sense in this “language”. Evaluation of these expressions is done using certain rules called reductions (which resemble simplification of mathematical expressions), though at this stage, it is not really necessary to understand what they do (it is sufficient to take some amount of intuition from any language of your choice to develop an intuition for it, though care should be taken to avoid taking the intuition too far, because there are certain core concepts in languages that are simply not present in this lambda calculus — for instance, types).
It turns out that this system of computation is Turing complete, i.e., any program that can be run on a Turing machine can be simulated using this method of computation. But why is this relevant to the discussion?
The second type of term here is called a lambda term. This is where the lambda terminology comes from — because lambdas play the role of functions in this system. Except in lambda calculus, you’re allowed to do a large variety of things by virtue of the lack of constraints in the definition. For example, you can pass around functions to other functions (remember that \(E_2\) is not constrained to be of a particular “type”).
There is no concept of a type in this system of computation. Even the reduction mechanism for a standard lambda calculus is something that doesn’t care about types. It turns out that if we assign types to any of these terms, then it forces all valid reductions to terminate, and some expressions end up being untypable even though there can be a possibly infinite sequence of reductions that can be applied to them.
But note that we are not allowed to manipulate functions in C++ as freely as the above definition dictates. For example, we are not allowed to return functions that depend on the input because function pointers in C/C++ are supposed to be pointers to things known at compile time.
To illustrate this, let’s temporarily add numbers and the \(+\) operator to the lambda calculus (by the way, they can be semantically simulated using just lambda expressions using what are called Church numerals, but that is a completely different story for another day). The following is a valid lambda expression:
\[ \lambda x. \lambda y.(x + y) \]
This function takes a value \(x\) and returns a lambda that takes an input \(y\) and returns \(x + y\). Note that the output of the inner lambda depends on the value of \(x\). For example, \((\lambda x. \lambda y.(x + y)) (42)\) is just \(\lambda y.(42 + y)\). So we have something analogous to a function that returns a function, such that the returned function is dependent on the input to the outer function. (By the way, this is what I meant by lambdas reducing the dependence on “tagged” functions, because you have a great tagging mechanism for lambdas — just store the data inside it).
The corresponding in C++ style syntax should be something like (but this isn’t legal C++ code, so I took the liberty of using non-standard syntax here):
int(int) lambda1(int x) {
int lambda2(int y) {
return x + y;
}
return lambda2;
}
So, there is a clear difference between C++ functions and lambda terms from the lambda calculus despite all the other semantic similarities. This discussion already hints to us what a lambda function (as we will call it from now on) should look like in C++.
Let’s digress for the rest of this section into the discussion of functional programming languages.
Note that the imperative programming model (let’s think about C for the sake of concreteness) is largely based on the Turing machine model of computation — you have tapes (memory), pointers, and states. The main focus is on evolving the state of the program and the memory in order to get to an end goal.
The functional programming model, on the other hand, is largely based on lambda calculus. You rarely have any state in a functional programming language like Haskell or OCaml (especially in a purely functional language like Haskell). Instead, you express your code in the form of function applications and some basic inbuilt types (like integral types) and functions (like numerical operators).
Both of them are equally powerful. However, functional programming languages are often at a much higher level than imperative languages because of the power of abstraction and resemblance to mathematics, which rarely talks about state in any of its fundamental parts.
Also, a fun fact — the sharp eyed reader would have noticed that the lambda function we wrote is semantically equivalent to a function that takes two numbers \(x\) and \(y\) and returns their sum, except that we have partial application for free, by “binding” \(x\) to the lambda after the first application and returning a function that adds \(x\) to anything. This style of writing lambda terms is called currying and can be used to implement multi-argument functions in a language where functions only have one argument.
This is just scraping the surface of what is otherwise a crazy set of things that can be done with lambda calculus. For example, one interesting thing you can do with it is to show how weird computation systems like the SKI combinator calculus are in fact Turing complete. Also, since the lambda calculus is Turing complete, there is a way (a quite elegant one at that) to represent recursion called a fixed point combinator, which can be implemented in multiple ways, one of which is the Y-combinator.
Since C++ is Turing complete, should we expect there to be a way to implement (a typed variant of) lambda calculus in C++? Definitely, though we probably shouldn’t expect it to be very elegant. Or can we? Read on to find out.
Let’s think about what a lambda function (shortened to lambda for convenience) should behave like in C++.
It should be what is called a “first-class citizen” of the language in programming language design — i.e., all operations that are supported for other entities should also be supported for lambdas. In other words, it should have the following properties at the very least:
The most general kind of thing that C++ supports that adheres to these rules is an object. Let’s try to implement a lambda as an object in that case.
So for now, a lambda looks like this for us:
struct Lambda {
// ???
};
Lambda lambda{};
Note that since structs can be defined in a local scope, we will be able to do whatever follows inside a local scope as well as a namespace scope.
Now according to its definition, we should be able to call a lambda as
well. The most obvious way of associating a function with a struct is to
implement a member function, let’s say call
. However, one cool
functionality that C++ provides is the ability to overload operator()
for a struct — this is the function call operator, and it will help us
reduce the syntactic boundaries between a function and an object. So
rather than using the name call
, we will overload operator()
, which
does the same thing but with a better syntax.
So, something like the following prints 2
in a new line:
struct Lambda {
int operator()(int x) const {
return x + 1;
}
};
Lambda lambda;
std::cout << lambda(1) << '\n';
At this point, we should note that such a struct which overloads
operator()
is called a function object in the C++ standard (and is
frequently informally referred to as the horribly named “functor” which
has a very different meaning in category theory, so it will irk a lot of
people if you call a function object a functor in front of them).
Now that we can wrap functions into objects, we can do whatever an object can — for example, we can now return something that behaves like a function, so it might seem that the story ends here. However, note that we didn’t implement anything that allows us to “bind” data to functions, as we wanted to do in our addition example. It is necessary to implement this, in order to make \(E\) aware of the value of \(x\) somehow, when applying \(\lambda x. E\) to some \(x\).
Since we want to emulate a mathematical language, for now we don’t really care about C++ semantics, and we will choose value semantics for \(x\) (which will make a lot of copies, but will make reasoning easier). This basically means that whenever we store \(x\), it will be, just like in mathematics, completely independent of wherever its value was taken from.
Let’s try our hand again at the \(\lambda x. \lambda y.(x + y)\) example. Let’s write it in a more computer-program-style syntax with parentheses at all the right places (which some people might recognize as being one infix-to-prefix transformation away from being Lisp code):
(lambda x.
(lambda y.
(x + y)))
Since there are two \(\lambda\)-s, we would need two structs this time — one for the outer lambda and one for the inner lambda. The inner lambda needs to “capture” \(x\), so we store a copy of \(x\) in it whenever it is constructed.
struct LambdaInner {
int x_; // our binding
LambdaInner(int x) : x_{x} {}
int operator()(int y) const {
return x_ + y;
}
};
struct LambdaOuter {
LambdaInner operator()(int x) const {
return LambdaInner(x);
}
} lambda_outer;
LambdaInner lambda_inner = lambda_outer(1);
std::cout << lambda_inner(2) << '\n'; // or lambda_outer(1)(2)
But this starts looking less and less like actual lambda functions now. The trick is to use the fact that we are able to declare structs in local scope.
struct LambdaOuter {
auto operator()(int x) const {
struct LambdaInner {
int x_; // our binding
LambdaInner(int x) : x_{x} {}
int operator()(int y) const {
return x_ + y;
}
};
return LambdaInner(x);
}
} lambda_outer;
std::cout << lambda_outer(1)(2) << '\n';
Okay, this now resembles lambda expressions a bit more than before. But
there is still a large amount of noise in the code — for example,
having to declare x_
and naming the structs for the lambdas. Turns out
that we have exactly the tools for the job with C++ lambdas, which
provide syntactic sugar for this kind of a thing:
auto lambda_outer =
[](int x) {
return [x](int y) {
return x + y;
};
};
std::cout << lambda_outer(1)(2) << '\n';
Here, the lambda being returned has “captured” or “bound” the variable
x
by storing a copy of it inside itself.
Don’t worry if you’re a bit baffled with the syntax going from that monstrosity of object oriented code to this simple of a code, for now is the time to look at the syntax of a C++ lambda.
A C++ lambda expression (or lambda function), in its simplest form, has the syntax
[capture_list](parameter_list) specifiers -> trailing_return_type {
// function body
};
where the trailing return type and specifiers are both optional (we will get to them in a moment, they’re not super critical for now).
Given that lambda functions aim to replicate a lot of C++ function
functionality, there is a lot of noise you could introduce here too,
with syntax that I have not mentioned in the above — you can make it
something as complex as the following as of C++20 (even ignoring the
possible *this
and variadic captures):
int x = 1;
int y = 0;
auto lambda = [&, x = std::move(x)]
<std::integral T>
requires(sizeof(T) <= 8)
(T a, auto b, int c)
mutable constexpr noexcept [[deprecated]]
-> std::uint64_t {
y = x;
return static_cast<std::uint64_t>(c) + a + sizeof(b) + x + y;
};
However, we won’t be really concerned with a lot of this, and will only deal with a few special cases (besides, if you understand the rough correspondence between function objects and lambdas, it becomes more intuitive to guess how this syntax works).
In the syntax mentioned earlier, the lambda roughly corresponds to this struct (don’t worry if you don’t understand this yet, we will explain the terms used here in detail):
struct UnnamedLambda {
// 1. variables corresponding to the capture list go here
// 2. constructor for storing captured variables
// 3. call operator - with some specifiers if specified:
auto operator()(parameter_list) -> /* trailing_return_type, if any, goes here */ const {
// lambda function body
}
};
Note that the member function is marked const regardless of what we specify for the lambda — this means that it is not allowed to do anything that mutates/is-allowed-to-mutate-but-doesn’t the member variables. We can get past this restriction as we will see later.
Warning: this is only a representative correspondence and works well for almost all purposes (for example, the copy-assignment constructor of lambdas is deleted, but we didn’t show it for the sake of simplicity). It is possible for the C++ lambda behaviour to diverge in some edge cases from the behaviour of this struct, but I haven’t seen any such divergences.
For example, a lambda like the following:
int y = 1, x = 0;
auto lambda = [y, &x](auto a, int b) constexpr mutable -> int {
return sizeof(a) * y + b * x;
};
corresponds to
struct UnnamedLambda {
// captured variables
int y_;
int& x_;
// constructor - constexpr by default if possible
constexpr UnnamedLambda(int y, int& x) : y_{y}, x_{x} {}
// call operator - the constexpr specifier marks this function as constexpr (a bit useless since it is constexpr whenever possible anyway), and the mutable specifier removes the const
constexpr auto operator()(auto a, int b) -> int {
return sizeof(a) * y + b * x;
}
}
It is possible to template a lambda starting from C++20, but it is not something that is very useful for competitive programming, so I will omit it for the sake of simplicity, since the syntax has quite a lot of parts already.
Now let’s look at every part in a bit of detail.
Firstly, global/static variables and global constants don’t need to be captured (and can’t be) — they are accessible without any captures and will always be usable inside a lambda like a normal variable. It is similar to being captured by reference by default for those variables.
Let’s say we want to capture a variable y
by value and x
by
reference (so in the struct, we store a copy of y
and a reference to
x
). In that case, the capture list becomes y, &x
or &x, y
. We can
do this for any number of variables in any order — &x, y, z
,
x, &y, &z, w
etc.
Now let’s say you don’t want to manually capture all variables that you’re going to be using.
Let’s say you want to capture all variables by value. The capture list
becomes =
.
And if you want to capture all variables by reference, the capture list
becomes &
.
But you’d say, I don’t want to be stuck using the same semantics for all my captured variables. C++ has a way of allowing you to specify variables for which the default semantics are not followed.
If you want to capture all variables by value, but g
by reference
since it is a large graph, your capture list becomes =, &g
(the
default should be at the beginning of the capture list).
If you want to do the opposite and capture all variables by reference,
but capture g
by value, your capture list becomes &, g
.
Now let’s say that rather than capturing a variable directly, you want
to store a value computed by a function, and all other variables should
be captured by reference. Then your capture list becomes
&, var = func_call(/* as usual */)
. This can be done for multiple
variables. The semantics are the same as if you had done
auto var = func_call()
.
Similarly, if func_call
returns a reference, you can also use
&var = func_call()
. The semantics are the same as if you had done
auto& var = func_call()
.
Note: when capturing a variable by reference and using lambdas after the variable has been destroyed, you will end up with dangling references where accessing that variable in the lambda is undefined behaviour, just like with any other usage of dangling references — this is one of the reasons why I chose to capture by value instead of by reference when showing the 2 lambda example. Keep this in mind while using lambdas.
This is pretty much what you would use for any other function. In particular, there is also support for default parameters.
Also, it is noteworthy that the support for auto
parameters was added
to lambdas back in C++14, but was added for usual functions only in
C++20. This shows how much simpler lambdas are for the language itself,
when compared to functions. In the corresponding struct, this
corresponds to a templated operator()
, one for each instantiation that
is needed.
So your lambdas can look like this (while compiling in C++14 as well):
bool f(int a, int b) { return a < b; }
int main() {
int x = 1;
int z = 0;
auto lambda = [&x, y = f(1, z)](int a, auto b, int c = 1) {
if (y) {
x = c;
return a + b;
} else {
return x - b + c;
}
};
}
For this section, it would be helpful to keep in mind our model of
structs corresponding to lambdas, in order to understand how the
specifier mutable
helps, which we will explain below.
You might have noticed that we have not changed the captured value of
any variable that we captured by value (the ones we capture by reference
can mutate the variables they refer to, so a = b
is possible for a
being a reference captured variable). The reason is that with the
correspondence, the call operator is const
by default.
So, how would we mutate these variables? The answer is the usage of the
mutable
specifier — the same keyword that helps you mutate a
variable in a class when you are calling a function that is marked
const.
We’re now able to do something like this:
auto alternate_even_odd = [i = 0]() mutable {
i ^= 1;
return i;
};
When mutable
is specified in a lambda, the const appended to the
operator()
is removed. This in turn allows us to mutate all captured
variables (in this context it means calling functions/operators/member
functions on them which require a non-const reference to them).
There are other specifiers — constexpr
(since C++17), consteval
(since C++20) and static
(since C++23), but they would not be very
useful in a competitive programming context, so we omit them.
Notice that the default syntax for a lambda doesn’t have a return type, but the syntax for C++ functions has a return type.
In fact, you can have auto
return type for C++ functions starting from
C++14, and the type is deduced using some type deduction rules. By
default, these rules apply for lambdas as well.
However, it is possible to have certain ambiguities/underspecification, like the following:
auto f = [](int64_t x) {
if (x < 0) return 0;
return x;
};
or
auto f = [](auto g, int x) {
return g(g, x);
};
In the first, the compiler is unable to deduce whether the lambda’s
return type is supposed to be int
or int64_t
.
In the latter, since the type of g
is unknown, it is also unknown what
g
returns.
In such cases, we must either cast the return values to a certain type, or specify the return type on our own, like the following:
auto f = [](int64_t x) -> int64_t {
if (x < 0) return 0;
return x;
};
auto f = [](auto g, int x) -> int {
return g(g, x);
};
Okay, now that we have finally shown what the lambda syntax can do in C++, we are yet to use them in any application.
In order to use them, we need to know how to do the following, since the type of the corresponding structs of two different lambdas can be different (and the types of two separately defined lambdas are indeed different in C++):
This is very easy, as we have already seen before:
auto f = [](){};
And a cool trick that I didn’t mention earlier is that if your parameter
list is empty, you can simply drop the ()
.
So the shortest lambda definition looks like this:
auto f = []{};
In C++20, you can do this:
void print_result(int x, auto f) {
std::cout << f(x) << '\n';
}
void print_result_1(int x, auto& f) {
std::cout << f(x) << '\n';
}
void print_result_2(int x, auto&& f) {
std::cout << f(x) << '\n';
}
A small nitpick: the last one is better if you want to pass a lambda by
reference if it is an lvalue (informally, something that has an
address), but by value if it is not. For example
print_result_2(2, [](int x) { return 2 * x; });
and
print_result_2(2, some_lambda);
both work, but the first of these
doesn’t work with the print_result_1
.
When passing a lambda to a lambda, the above syntax works even for something as old as C++14.
In older standards of C++, like C++11/14/17, you can do this (the &
and &&
variants still work, but I omitted them for the sake of not
having too much code):
template <typename Func>
void print_result(int x, Func f) {
std::cout << f(x) << '\n';
}
This is something that competitive programmers rarely do, but when they do, a lot of them add unnecessary overheads mostly due to lack of understanding of lambdas/templates.
Firstly I will show how people end up storing lambdas:
struct Storage {
template <typename Func>
Storage(Func f_) : f{f_} {}
// or Storage(auto f_) : f{f_} {}
// or Storage(std::function<int(int)> f_) : f{f_} {}
std::function<int(int)> f;
};
Storage storage([](int x) { return x * 2; });
Other people use this very limited version (we’ll get to why it is limited):
struct Storage {
using F = int(*)(int);
Storage(F f_) : f{f_} {}
F f;
};
Storage storage([](int x) { return 2 * x; });
// Storage storage(+[](int x) { return 2 * x; }); also works, relies on decaying from a lambda to function pointer
The most natural way, however, is to do either this:
template <typename F>
struct Storage {
Storage(F f_) : f{f_} {}
F f;
};
Storage f([](int x) { return 2 * x; })
or
auto make_struct = [](auto f_) {
struct Storage {
std::remove_reference_t<decltype(f_)> f;
};
return Storage{f_};
};
auto s = make_struct([](int x) { return 2 * x; });
For the sake of convenience, let’s name these v1, v2, v3 and v4.
v1 seems to be the most generic way, but has unnecessary overhead if you
don’t need to do something like making a vector of lambdas. The reason
it works is that std::function
“erases” the type of the lambda (with a
technique called type erasure), but to do that it needs heap allocation
and other overheads, and often is unable to optimize the code in the
lambda that interacts with the outside (for instance, recursion).
Meanwhile, lambdas are able to be optimized as much as, if not faster
than, usual functions, since structs are zero-cost abstractions. I have
seen spectacular slowdowns with v1 (50x sometimes) due to lack of
optimizations, and I’m of the opinion that unless necessary, type
erasure should never be used in a context where performance matters. Of
course, no kidding, type erasure is an engineering marvel that deserves
its own post, and std::function
, as an extension does so too.
v2 falls back to the concept of a function pointer, which exists since the old days when C was all the rage. It works decently fine in practice, but has a couple of issues:
std::function
in terms of speed, though.v3 solves all the above problems, in the case you don’t need to be able to store the container in a vector. The only drawback is that since it is a templated struct, it can not be defined in local scope, as much as we want to. The reason is that in many implementations, symbols in a template must have external linkage, and this is simply impossible for something in a local scope. There is another minor drawback, which is an inconvenience at best — since partial template specialization is not allowed in C++, you must rely on CTAD (class template argument deduction) to make your code shorter. Otherwise you would have to do something like
auto f = [](int x) { return 2 * x; };
Storage<decltype(f)> s(f); // or std::remove_reference_t<decltype(f)> to remove the reference that comes attached with decltype on an lvalue
Coming to v4 — it does all the things that the other solutions do, and since it is a lambda, it doesn’t suffer from the issues that templates do.
As it turns out, v4 is also the most “functional” implementation out of all of them, because you are literally just using functions and nothing object oriented is required.
Is the fact that v4 is the most convenient out of the above 4 just coincidental? I think not.
Anyway, partly since all my library code had been written many years ago, I use v3 for my competitive programming library code — for instance, my implementation that modifies and generalizes the AtCoder library code for lazy segment tree can be found here, starting at line 393. As an aside, it is one of the fastest implementations for that problem, and it used to be the fastest for a very long time until a couple of months ago, and is off from the fastest solution by a few ms. The current fastest solution also uses the v3 pattern for making things generic.
But for anything new, I like using v4 — it is functional and flexible.
The verdict:
As promised, here are some applications of lambdas with some of the STL algorithms/data structures.
Since STL algorithms are implemented mainly as functions or function-like objects, the main use of lambdas is as callbacks to the functions. We are able to pass lambdas as one of the parameters of the STL functions in question, and the lambda is called inside the body, which is why it is a callback.
It is also possible to use them in STL containers during construction.
For the precise functionality of the functions I mention below, I would recommend reading about them on cppreference.
Lambdas are used in multiple forms:
A comparator is something that returns the result of comparison of its inputs. It is used in the following:
std::sort(l, r, comparator);
where the comparator tells
whether \(a < b\) or not for inputs \(a\) and \(b\).A predicate is something that decides whether some condition is true or not, based on its inputs. It is used in the following:
std::generate(l, r, [i = 0]() mutable { return i++; });
// the same as std::iota(l, r, 0);
There is a more idiomatic solution using coroutines, but coroutines in C++ allocate on the heap.
Consider this situation: you have to initialize some data (let’s say some result of precomputation), but you want to
The concept of an immediately invoked lambda comes in here.
In a nutshell, the idea is that we would like to make a new nested scope for it, but scopes are not expression while lambdas are.
So your solution would earlier look like this:
std::vector<int> a;
{
// process and modify a
}
and it becomes this:
auto a = [&] {
std::vector<int> a;
// process and modify a
return a;
}(); // notice these parentheses
There are a couple more benefits to this, arising out of the fact that we have now managed to convert a scope into an expression.
So we can do the following easily:
struct A {
A(size_t n, int val) :
a{[&] {
std::vector<int> v(n);
std::fill(v.begin(), v.end(), val);
return v;
}()} {}
std::vector<int> a;
};
Make variables with potentially complicated initialization const if mutation is not needed — this is somewhat of an extension of the previous point, and it often leads to optimized code, even if we ignore the number of errors we would save at compile time.
Precompute and assign at compile time — if we have an array to precompute, we can do it at compile time globally like this:
auto precomputed_array = [] {
std::array<int, 10000> a{};
// preprocess
return a;
}();
Since lambdas are constexpr by default, if everything inside the lambda is constexpr-friendly, the compiler will try to perform this lambda call at compile time, which will in general be faster than computing stuff at runtime, and the compiler can also reason about the data that it has generated at compile time, so it even makes the rest of the program faster.
One small drawback (for Codeforces) is that if there are too many operations in the lambda, you will risk getting a compile time error (telling you to increase your compiler’s constexpr operations limit — for which I have not yet been able to find a way from within the source code).
The compiler is doing the best it can (and the assembly for constexpr-ed code will be much faster than the assembly for non-constexpr-ed code, simply because what is left to run is the bare minimum possible), and for the compiler, there is no limit on computational resources other than the ones due to the machine. Unfortunately, this doesn’t work out for practical purposes like contests, where compile-time and system limitations (which include compiler limits that are built into it, but can be changed).
One hacky workaround is:
auto something = [] {
int a; // note that it is not initialized, only declared, so it is not constexpr friendly. so the compiler will not constexpr it. it is definitely a code smell, but what can we do?
// rest of the function
}();
This is a workaround that I have been using for a long time now, as a way to prevent things from getting constexpr-ed at compile time. Don’t do this in real-life code though.
We can implement recursion in a Turing machine, but can we do so with lambdas? Recall that we noted that the Y-combinator is a way to do precisely this thing in an untyped lambda calculus.
We can write a somewhat similar analog for it in C++ lambdas too, using the C++14 feature that allows us to have auto arguments.
auto fib = [](auto self, int n) -> int {
if (n <= 1) return n;
return self(self, n - 1) + self(self, n - 2);
};
std::cout << fib(fib, 10) << '\n';
(If you want to be more efficient in the general case, consider using auto&& instead of auto).
Let’s look at how we came up with this implementation.
We don’t have a way to refer to the unnamed lambda inside itself with C++ syntax (and the lambda calculus syntax).
So what is the next best thing we can do? Yes, just make a dummy parameter that will (whenever it should be used) be itself, and remember to call the defined lambda with itself whenever it needs to be used. This is precisely what the above code does.
As a fun exercise, let’s try writing a function that takes a lambda, which has 2 arguments — one referring to itself, and the other being what it should call (and will eventually become itself). Our target would be to do something like the following:
auto lambda = [](auto self, int n) {
// something
// the recursive call is of the form self(n - 1)
};
auto rec_lambda = add_recursion(lambda);
std::cout << rec_lambda(10) << '\n';
We will ignore all considerations of performance for now, and make copies of everything, for ease of presentation.
Let’s make a function object that provides an overload for operator()
,
and abstracts out the self(self, ...)
calls into just self2(...)
calls — then we can use self2
as the self
parameter in the
original lambda’s definition, and the double recursion (which should
ideally be optimized away) will allow us to do what we want.
template <typename F>
struct add_recursion_t {
F f;
template <typename X>
auto operator()(X x) {
return f(*this, x);
}
};
template <typename F>
auto add_recursion(F f) { // could have as well been implemented as a lambda
return add_recursion_t{f};
}
This is also what the
std::y_combinator
proposal does, but with better optimizations that sadly renders the
code less than readable. This paper is redundant with the C++23 feature
called “Deducing this
” (which also solves a lot of other problems, not
limited to lambdas).
To avoid writing self
so many times, it allows us to refer to the
lambda itself in its body, and write
auto fib = [](this auto&& self, int n) { // notice that the -> int is gone
if (n <= 1) return n;
return self(n - 1) + self(n - 2);
};
std::cout << fib(10) << '\n';
You can test this code out on the latest Clang release.
But can we do this in just lambdas? The answer is
As we pointed out, the Y-combinator allows us to do this in a lambda calculus (which is untyped by default).
But, there is no way to use C++ lambdas in the exact same manner to implement a Y-combinator, or in fact any fixpoint combinator, because “pure” lambda expressions in C++ follow a simply typed lambda calculus, and simply typed lambda calculus does not admit a fixpoint combinator.
The existence of strong typing is something that prevents recursion to be implemented in C++ lambdas itself (at least without relying on other features in the function body), which is why even though untyped lambda calculus is Turing complete, simply typed lambda calculus is not.
In fact, it can be shown via induction that there are no programs that don’t terminate in a simply typed lambda calculus. However, if a fixpoint combinator existed, applying it to itself would lead to infinite recursion, a contradiction. More generally, if there was a general recursion mechanism that could emulate a Turing machine, the existence of an infinite loop would lead to a contradiction to the guaranteed termination of an expression in a simply typed lambda calculus. The keyword here is “strongly normalizing”.
If it was allowed to capture this
in a lambda, then this would have
been possible, and indeed this is how C++23 handles this problem. People
use std::function
to implement recursive lambdas just because they can
capture the std::function
that they assign the lambda to. The fact
that it is a concrete type but allows us to typecast from a structure
that is of an arbitrary type is quite a feat, which is why anyone
serious about programming language design should learn about how type
erasure is implemented.
This is a great resource on how to implement recursion in OCaml and Haskell.
Have you ever wanted a Python generator in your code? Turns out that with mutable state, you can implement stateful lambdas that work in a similar manner.
It should be abundantly clear with the std::generate example we used a while back that stateful lambdas are nothing but objects with a single member function accessible via the call operator, and some mutable state represented by the data captured by them. It is helpful to remember the struct-to-lambda correspondence here.
For instance, let’s say you want to print something that looks like a
tuple, and std::apply
works on it.
How would you print its space separated members?
std::apply([i = 0ULL] (const auto&... x) mutable -> void {
(((i++ ? (std::cout << ' ') : void()), std::cout << x), ...);
}, tuple_object);
Here we have made abundant use of fold expressions, but the main point
is that after expansion, this mutates the state of i
inside the body
each time there is an i++
, and a space is printed before every element
except the first element.
Similarly, you can write a random number generator as follows (taking the example from here):
auto next_random = [
splitmix64 = [](uint64_t x) {
x += 0x9e3779b97f4a7c15;
x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9;
x = (x ^ (x >> 27)) * 0x94d049bb133111eb;
return x ^ (x >> 31);
},
FIXED_RANDOM = chrono::steady_clock::now().time_since_epoch().count(),
random = 0
] () mutable {
return random = splitmix64(random + FIXED_RANDOM);
};
while (n--) std::cout << next_random() << '\n';
If you want to, it is also possible to write all of your code in a single immediately-invoked lambda, but that is something I discourage you from writing, unless you want to obfuscate code or just want to challenge your brain.
For example, this code for the latest Div2A works, but I definitely do NOT recommend that you write code this way. However, it is quite instructive to understand why this code works and why certain variations do not, so I definitely DO recommend reading and understanding this — it’ll help cement your understanding of how captures work, what is called when, and so on.
#include "bits/stdc++.h"
int main() {
[rd = [] {
int x; std::cin >> x;
return x;
},
wt = [](auto x) { return (std::cout << x, 0); },
sorted = [](auto x) {
std::sort(x.begin(), x.end());
return x;
},
forn = [](auto f, int n) {
while (n--) f();
return 0;
}] {
[&, _ = forn(
[&]() {
auto n = rd();
auto k = rd();
[&, a = std::vector<int>(n)]() mutable {
[&, _ = std::generate_n(a.begin(), a.size(), rd)]() {
[&, ans = (k == 1 && a != sorted(a) ? "NO\n" : "YES\n")]() {
[_ = wt(ans)]() {
};
}();
}();
}();
},
rd())]() {
}();
}();
}
Note that we claimed that lambda calculus is Turing complete. But can it implement classes? The answer is of course yes.
Can C++ lambdas implement classes? Also yes.
Let’s let go of the implementation of lambdas in C++ for now, and think of implementing objects in terms of functions.
For implementing classes, we need data and member functions. For data, we can just store the data as a capture in a lambda. But what does the lambda do? We should be able to call something using the captured data, so the lambda should take as a parameter whatever the implementation of the member function should be.
So, once we have an object, we can call the data with the function, instead of calling the function on the data.
// constructor
auto range = [](size_t l, size_t r) {
return [l, r](auto f) {
return f(l, r);
};
};
// (free) member functions
auto len = [](size_t x, size_t y) { return y - x; };
auto right = [](size_t x, size_t y) { return y; };
auto left = [](size_t x, size_t y) { return x; };
// usage
auto a = range(1, 5); // this is an object
auto len_a = a(len); // compare with a.len() in usual C++ syntax
And if we want to support recursive functions, we can change the definitions a tiny bit.
auto range = [](size_t l, size_t r) {
return [l, r](auto f) {
return f(f, l, r);
};
};
auto len = [](auto self, size_t x, size_t y) { return y - x; };
auto right = [](auto self, size_t x, size_t y) { return y; };
auto left = [](auto self, size_t x, size_t y) { return x; };
auto depth = [](auto self, size_t x, size_t y) {
if (y - x <= 1) return 0;
return 1 + self(self, x, x + (y - x) / 2);
};
auto a = range(1, 5);
auto len_a = a(len);
Mutual recursion can also be implemented similarly, but would require much more effort. Do let me know if you find some way to implement mutual recursion in a clean manner!
template <typename... T>
struct S : T... {
template <typename... L>
S(L&&... l) : T(std::forward<L>(l)...) {}
using T::operator()...;
};
This can be used to implement a somewhat generic lambda by providing different implementations like this:
S generic_lambda = {
[](int x) { return 1; },
[](double d) { return 0.0; }
};
std::cout << generic_lambda(0) << '\n';
std::cout << generic_lambda(1.0) << '\n';
But we already have generic lambdas in C++20, and they’re pretty interesting in themselves.
Finally, let’s come to some useful code patterns involving lambdas that I use (and used to use) while solving problems.
This seems to be the most popular usecase for lambdas, and for good
reason — you can avoid code duplication by literally making a code
block executable (the &
capture does wonders for this usecase). It
ends up being particularly useful in the following scenarios:
For example, although the intent is not very clear in this example (unless we try solving the problem itself), it is clear that there would have been a lot of code duplication without the use of a lambda — just count the number of calls to the lambda.
Most people implement DFS as a function, and to do that, they make their graphs global too.
What happens when we want to do a multi-test problem? Clear graphs each time? Of course that has the potential to be buggy.
So what some other people do is to make graphs local, but pass them as a parameter. Similarly, if they want to update any arrays/vectors/integer, like visited/color/component/n_components, they also pass them as a parameter.
Why is this bad? If you swap out two arrays/variables in the references, it would not be a very obvious bug. And it just increases your context switch, because now you have to care about every function call. So if you are implementing a particularly involved DFS, you will end up being more prone to bugs than you could bargain for.
Some smart people realized this and started using local lambdas for DFS, but they used the following trick (which is still not the best trick, I’ll explain why in a moment):
std::function<int(int, int)> dfs = [&](int u, int p) {
int sum = 1;
for (auto v : g[u])
if (v != p) sum += dfs(v, u);
return sum;
};
int sz = dfs(0, -1);
Why it is not the best approach:
Here’s what I use:
auto dfs = [&](auto self, int u, int p = -1) -> int { // can also be auto&& if you want to be careful
int sum = 1;
for (auto v : g[u])
if (v != p) sum += self(self, v, u);
return sum;
};
int sz = dfs(dfs, 0);
Some very high rated programmers use this pattern too, for example, this submission, and for good reason — it pretty much always compiles down to the most optimal code (according to the compiler). The only case I know of this being slower than a usual implementation was because the compiler was vectorizing the DFS(!) and was leading to instruction cache misses. That was a very peculiar case, and I don’t expect it to encounter it again in the wild, especially not in a problem with a tight TL.
Others use the idea outlined in the now-redundant std::y_combinator
proposal, and do something like this:
// template code
template <class Fun>
class y_combinator_result {
Fun fun_;
public:
template<class T>
explicit y_combinator_result(T &&fun): fun_(std::forward<T>(fun)) {}
template<class ...Args>
decltype(auto) operator()(Args &&...args) {
return fun_(std::ref(*this), std::forward<Args>(args)...);
}
};
template<class Fun>
decltype(auto) y_combinator(Fun &&fun) {
return y_combinator_result<std::decay_t<Fun>>(std::forward<Fun>(fun));
}
// actual code
auto dfs = y_combinator([&](auto self, int u, int p = -1) -> int {
int sum = 1;
for (auto v : g[u])
if (v != p) sum += self(v, u);
return sum;
});
int sz = dfs(0);
For instance, this submission.
In fact, this is such a popular pattern that some high-rated people put it on top of their bare minimum templates as well, for instance, here.
The one drawback for this template is that sometimes the compiler is not able to reason through all the abstraction, and the code generated might be slightly slower. This is fortunately not the case for most cases, but there is a visible overhead difference in a few cases I have seen.
Library code should ideally be hidden away in a header file, but since most online judges don’t have header file support, we are forced to write library code before any of the actual code begins.
It is also important to ensure that after copypasting library code into your submission, there is absolutely no change that you need to do in the library code unless it is a fundamental difference from the logic of the data structure/algorithm, and you just happen to have all the code at hand. Otherwise, you would be editing a large piece of code that you might have forgotten the details/quirks of, and it is easy to end up with bugs.
To do this, it is quite important to implement your library in a way that is as generic as possible. And there are few things more generic than allowing your data structure to execute arbitrary code.
Hence, it makes for a really good choice to use a lambda as a customization point. In fact, the C++ standard calls certain types of function objects “customization points” which goes on to show that interfacing into library code with functions/function objects is a well-understood/well-adopted design practice that is also very convenient for library usage.
As an example of such a thing, my implementation that modifies and generalizes the AtCoder library code for lazy segment tree can be found here, and it uses lambdas as possible customization points. It not only implements lazy segment trees (with some opt-in optimizations), but it also has functionality for normal segment tree, with barely any overhead compared to a usual segment tree implementation.
This is in fact one of the most common applications. There are countless lines of code among codeforces submissions that essentially sorts an array according to a given predicate: for instance, this code.
There are also a very large number of binary search submissions that
follow the template “define a predicate, do binary search on it” — for
instance,
this code.
Note that the manually written binary search could have been avoided by
using std::partition_point
as well, but it is not very popular for
some reason.
Prefix and suffix sums (usual addition or multiplication or xor) can be
implemented using std::partial_sum
using lambdas as well.
Please let me know if there is anything I missed out, and constructive comments on the content in general are welcome!
]]>This post was originally written on Codeforces; relevant discussion can be found here.
This post is not about the binary GCD algorithm, but rather about exploring theoretical guarantees for an optimization on top of the standard Euclidean algorithm. The STL version is faster than both these algorithms, because it uses the binary GCD algorithm, but these algorithms are of interest from theoretical considerations.
In particular, my conjecture is that the second algorithm takes at most as many iterations as the first one, and if true, it’d be a pretty surprising claim, given how hard it is to bound Euclidean algorithm runtimes. So, it would be really cool if someone could prove this property.
UPD: thanks VArtem for affirming that this is the case and pointing to some relevant references.
The usual gcd algorithm looks something like this:
auto gcd = []<std::integral T>(T a, T b) -> T {
while (b) {
a = std::exchange(b, a % b);
}
return std::abs(a);
};
Consider the following implementation, that makes the second argument greedily smaller.
auto gcd2 = []<std::integral T>(T a, T b) -> T {
a = std::abs(a);
b = std::abs(b);
while (b) {
T r = a % b;
if (r > (b >> 1)) r = b - r;
a = std::exchange(b, r); // a = std::exchange(b, std::min(a % b, b - a % b); also works fine
}
return a;
};
On random data, gcd2
seems to be around 30% faster than gcd
on my
machine, for int32_t
and int64_t
arguments.
And it doesn’t seem immediately obvious, but on all inputs I
brute-forced, it seems gcd2
always takes an equal or smaller number of
iterations compared to gcd
. A proof is outlined in the comments.
Here’s the benchmarking code I used:
#include "bits/stdc++.h"
template <typename C = std::chrono::steady_clock, typename T1 = std::chrono::nanoseconds, typename T2 = std::chrono::milliseconds>
struct Stopwatch {
std::string_view name;
std::chrono::time_point<C> last_played;
T1 elapsed_time;
bool running{};
Stopwatch(std::string_view s) : name{s}, running{true} { reset(); }
Stopwatch() : Stopwatch("Time") {}
void reset() {
last_played = C::now();
elapsed_time = T1::zero();
}
void pause() {
if (!running) return;
running = false;
elapsed_time += std::chrono::duration_cast<T1>(C::now() - last_played);
}
void play() {
if (running) return;
running = true;
last_played = C::now();
}
int64_t elapsed() const {
return std::chrono::duration_cast<T2>(elapsed_time + (running ? std::chrono::duration_cast<T1>(C::now() - last_played) : T1::zero())).count();
}
void print() const { std::cerr << name << ": " << elapsed() << " ms\n"; }
~Stopwatch() { print(); }
};
auto gcd = []<std::integral T>(T a, T b) -> T {
while (b) {
a = std::exchange(b, a % b);
}
return std::abs(a);
};
auto gcd2 = []<std::integral T>(T a, T b) -> T {
a = std::abs(a);
b = std::abs(b);
while (b) {
T r = a % b;
if (r > (b >> 1)) r = b - r;
a = std::exchange(b, r);
}
return a;
};
template <typename T, typename F>
void run(const std::vector<std::pair<T, T>>& queries, F&& f, std::string_view name) {
Stopwatch sw{name};
T ans = 0;
for (auto& [a, b] : queries) ans ^= f(a, b);
std::cout << ans << '\n';
}
template <typename T>
auto get_random(size_t n) {
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<T> dist(0, std::numeric_limits<T>::max());
std::vector<std::pair<T, T>> v;
while (n--) {
auto n1 = dist(gen), n2 = dist(gen);
v.emplace_back(n1, n2);
}
return v;
}
int main() {
constexpr int N = 1e8;
auto a = get_random<int32_t>(N);
auto b = get_random<int64_t>(N);
run(a, gcd, "gcd_32");
run(a, gcd2, "gcd2_32");
run(b, gcd, "gcd_64");
run(b, gcd2, "gcd2_64");
}
And the timing results I get are:
190237720
gcd_32: 14054 ms
190237720
gcd2_32: 11064 ms
13318666
gcd_64: 47618 ms
13318666
gcd2_64: 36981 ms
When I compare this new algorithm with std::gcd
on similar data, there
seems to be barely any slowdown with 32 bit ints, but the % operator
makes it much slower on 64 bit ints.
Do let me know what you think about it in the comments, and a proof/counterexample would be appreciated!
]]>This post was originally written on Codeforces; some relevant discussion can be found here.
Note: for those who don’t like using the post-increment operator
(i++
), hopefully this post convinces you that it is not just a
convention that C programmers coaxed the world into following out of
tradition. Also, all of the following discusses the increment operators
in C++, and not C, where the semantics are slightly different.
Disclaimer: I use ++i
much more often in code. But i++
has its own
place, and I use it — and the generalization I mention — quite
frequently wherever it is a sane choice.
++i
better than i++
?A lot of people are taught that since i++
also returns the old value
of i
before the increment, it needs a temporary, and hence it must be
slower. Good news: almost all compilers today optimize this part away
whenever this behaviour is not needed, and the difference between i++
and ++i
doesn’t matter.
So what’s the difference? In C++, turns out that apart from this old/new value difference, there is another difference.
Any expression in C++ that looks like a op
b= where op
is a binary
operator in fact “returns” a reference to the variable a
after
completing the operation. As a result of this, you can do things like
this:
// replaces x by a * x % p
void mod_mul(int64_t& x, int a, int p) {
(x *= a) %= p;
}
Turns out that since you only need the value after the increment for the pre-increment operator, it is possible to have similar behaviour there too, and that is precisely what C++ chose to do:
// replaces x by (x + 1) % p
void mod_inc(int64_t& x, int p) {
++x %= p;
}
However, the post-increment doesn’t enjoy any of these properties. Perhaps this is why the Google C++ Style Guide (which I don’t really like all that much, but will just quote here) recommends
Use the prefix form (++i) of the increment and decrement operators unless you need postfix semantics.
Another part of the language where you can see this distinction in practice is when you overload pre-increment and post-increment operators for a class:
struct Int {
int value{};
Int& operator++() {
std::cout << "pre-incrementing\n";
++value; return *this;
}
Int operator++(int) {
std::cout << "post-incrementing\n";
Int temp = *this;
++value;
return temp;
}
};
If you notice closely, the return type of the pre-increment overload is a reference, which is not the case with the post-increment overload. Using a post-increment also makes a copy, so if you’re using a post-increment on a struct/class object that defines both, it is likely that there is a potential performance improvement lying right there.
Indeed, there is. Notice how in the post-increment code in the previous example, we had to store a temporary. This gives us a hint as to where exactly post-increment is useful: when you need to avoid temporary variables! And one of the places where temporary variables are not needed is when an operation (in this case, increment) must be done after any use of a value.
An example to illustrate this point comes when you want to implement copying of a C-style, null-terminated string (without copying the null terminator so that you’re able to, let’s say, concatenate strings into a buffer). Let’s say we didn’t have post-increment in the language at all. The algorithm looks like this:
void copy_c_string(const char* input, char* output) {
while (*input) {
*output = *input;
++input;
++output;
}
}
The first thing one would say is that there are too many lines of code
— you need to check input
, need to separately handle input
and
output
pointers, and if someone is refactoring this code and
accidentally deletes one of the increments, they will introduce a bug
that is hard to find in a large codebase.
Contrast that with the following:
void copy_c_string(const char* input, char* output) {
while (*input) {
*output++ = *input++;
}
}
The intent of moving the pointer forward immediately after every use (and using it exactly once, too) is perfectly conveyed, we don’t have too many lines of code, and all is well. Of course, it takes some time getting used to writing/reading this kind of code, but thinking of it in terms of “immediate operation after use” shows that it works analogously to what RAII is to a manual destructor call.
In fact, the usage of post-increment and pre-decrement (not pre-increment) was (and is) so popular (partly due to widespread use of half-open ranges), that certain architectures in the past had special functionality for them (for example, this). Thus, in some rare cases, it is imperative to use post-increment instead of pre-increment for performance.
op=
style (compound assignment) generalization for post-increment?Yes, we do. It is called
std::exchange
.
And sometimes it functions as an analogue, in a philosophical sense, for
std::swap
when it comes to idioms like copy-and-swap. std::exchange
is ideal for use in implementing move constructors and move assignment
operators, if you are of the opinion that std::move
should perform
cleanup as much as possible.
What it does is this — std::exchange(a, b)
returns the old value of
a
after setting it to b
. So i++
is the same as
std::exchange(i, i + 1)
in terms of semantics.
Similarly, a = std::exchange(b, a)
swaps the values of a and b, which
shows that it is more general than std::swap
, though maybe at a small
performance hit for types more complicated than a simple integer. Adding
a bunch of =std::move=s should fix that to some extent, though.
I consider this to be a naming disaster, and believe that if there was a better succinct name that does NOT imply bidirectional flow of data, the C++ committee would probably have tried to do it.
The way I remember it is a = std::exchange(b, f(a, b))
is (morally)
equivalent to std::tie(a, b) = std::make_pair(b, f(a, b))
— that is,
the assignments are done at the same time. So somewhat like storing
temporaries for everything that is to be assigned, and then performing
the assignments at the same time.
A more general example is a = g(a, std::exchange(b, f(a, b)))
which
does std::tie(a, b) = std::make_pair(g(a, b), f(a, b))
— in a
mathematical sense, it allows you to do two parallel assignments in a
linear chain-like syntax. I believe this is what brings about the
exchange
naming (if not the CMPXCHG instruction).
A more intuitive way to read A = std::exchange(B, C)
is that A becomes
B and B becomes C at the same time.
Writing a = std::exchange(b, a + b)
is much easier and less
error-prone than auto tmp = a; a = b; b = tmp + b;
(in fact, someone
pointed out that the initial version of this code was wrong). The same
goes for any other linear recurrence in 2 variables.
Since the Euclidean GCD algorithm is also an affine transformation (especially one where you swap after update), the implementation goes from the less readable/intuitive
std::array<T, 3> extgcd(T a, T b) {
std::array x{T{1}, T{0}};
while (b) {
std::swap(x[0] -= a / b * x[1], x[1]); // x[0] = x[1], x[1] = x[0] - a/b * x[1]
std::swap(a %= b, b); // a = b, b = a % b
}
return {x[0], x[1], a};
}
to
std::array<T, 3> extgcd(T a, T b) {
std::array x{T{1}, T{0}};
while (b) {
x[0] = std::exchange(x[1], x[0] - a / b * x[1]);
a = std::exchange(b, a % b);
}
return {x[0], x[1], a};
}
It is usually error-prone to clear a buffer of, let’s say, updates that you want to process, in the very end of the processing function, as follows:
void flush() {
for (auto x : updates) {
// do something
}
updates.clear(); // easy to forget
}
Sometimes this might lead to a MLE as well, if this flush
function is
in fact a DFS that is called in a loop after processing the updates,
where the graph is built during the function itself, and the memory
limit is low or the structure you store inside the buffer has quite a
lot of information.
To be able to deal with the first of these issues, people used something of the following sort:
void flush() {
std::vector<Update> updates_tmp;
using std::swap; swap(updates_tmp, updates);
for (auto x : updates_tmp) {
// do something
}
}
This faces multiple issues
updates
variable (can you see where we are going
with this?)Consider the following solution using std::exchange
:
void flush() {
for (auto x : std::exchange(updates, {})) {
// do something
}
}
Which of these two implementations mirrors the intent of the original function? And which one would you choose to write yourself?
While defining move semantics for your class, it is usually a good idea
to keep the moved-from object in defined state. Sometimes this might
even be necessary, for instance, when you want to implement something
with semantics like std::unique_ptr
.
It is possible to avoid having to implement post-increment when
pre-increment is already available by just doing
return std::exchange(*this, ++Type{*this})
The “immediately do after use” idea also carries over to when you’re
using locks — it is clearly optimal to use locks for as short a time
as possible, when comparing across the same kinds of software
architecture. One way that std::exchange
can be used to this effect is
to use a scoped lock inside a lambda, that returns
std::exchange(resource, resetted_resource)
.
In fact, there is a common primitive in lock-free programming that uses the exchange idiom (compare-and-swap, test-and-set, std::atomic_exchange).
Sometimes the only way to increment an iterator is std::next
. Using
std::exchange
, we can implement our own version of post-increment to
be able to write generic code using iterators. Since iterators are very
general, you might find yourself using this trick in very unexpected
places too.
Please let me know what you think about this C++ feature and if there are any mistakes in the above post. Comments are welcome!
]]>This post was originally written as a feature request for Codeforces; some relevant discussion can be found here.
I believe that adding the Boost library on Codeforces would be a great addition for C++ competitive programmers. AtCoder already supports it (alongside GMP and Eigen, but Boost has replacements/wrappers for those: Boost.Multiprecision and Boost.uBLAS). CodeChef supports it too (or at least used to support it at some point, not sure now). I’ve seen at least 3 posts by people trying to use Boost on Codeforces and failing, and on thinking about it, I couldn’t really come up with a good reason (in my opinion) that Boost should not be supported on Codeforces.
Some might argue against adding a library on Codeforces, but for them, I recommend viewing this post as an educational resource on some of the library’s features that I find cool.
There are a lot of things in the STL in the newer C++ standards, whose implementations were originally inspired from Boost, so you could be able to use some features that might not be available until the next few C++ standards.
As of now, the latest version of Boost is 1.83.0, and its documentation can be found here.
Here are a few features from Boost that are pretty useful for competitive programming:
Algorithms from
Boost.Algorithms
— this includes algorithms like KMP and binary exponentiation
(power
for arbitrary associative operations) as well. It also makes
for much cleaner code and with significantly less effort. It also
includes some “mini” algorithms that are annoying to write, that one
might use in some technical contexts (for example, gather
is an
algorithm like this, and gets used a lot in divide and conquer
problems). There are some algorithms that ended up in the STL, like
partition_point
for doing binary search.
Containers from
Boost.Container
— this includes a multitude of containers that are not in the C++
standard yet. My personal favourites among these containers are the
flat_map
and flat_set
containers, which are much better
implementations for the corresponding associative data structures
than the STL ones, and small_vector
, which does similar
optimizations as the basic_string
STL container, but has less
stringent conditions on the type of things it can contain.
Better bitsets using the dynamic_bitset
library
— there was a recent
post about dynamic bitsets in Python, but I mentioned a way of
getting there using a hack with C++ STL bitsets. However, dynamic
bitsets are supported in Boost. In fact, the Boost implementation
also exposes functions like find_first
, find_next
, range set
,
range flip
and so on.
Graph algorithms from the Boost Graph Library — it contains practically every useful graph algorithm that we encounter in competitive programming.
Different kinds of heaps with Boost.Heap — this contains multiple types of heaps with a lot of useful specialized functions. Ever wanted to try out a Fibonacci heap? Or just any heap with decrease-min or deletion operations? This library will cater to all your needs.
Interval handling using Boost.Icl — often we want to handle ranges of integers in competitive programming problems. This library gives us ready-made data structures to solve such problems, without having to think about corner cases that arise with interval sets. This solution is a prime example that would be much easier to implement with this library.
Intrusive containers using
Boost.Intrusive
— there are times when we want to implement a list manually, with
some additional information inside the node. For example, there is an
implementation of adjacency lists that is quite popular in China,
which makes an array of nodes, and inside those nodes, it stores
pointers/index of next/previous edges. Doing that using an
std::list
would be quite painful, and there is a reasonable
compromise using intrusive lists. In general, intrusive containers
tend to be faster, so if you like constant optimization, this library
is pretty nice to know about.
Boost.Math — this contains a ton of useful math features. Unfortunately, the polynomial library they provide doesn’t have fast operations (polynomial multiplication is quadratic for example), but the rest of the functionality is pretty good. It has things like continued fractions, root finding and quaternions, all of which I have seen in competitive programming contests.
BigInt using Boost.Multiprecision — self-explanatory.
Linear algebra using
Boost
uBLAS — this is a linear algebra library that has pretty much
all the functionality one would ever need on a competitive
programming problem. However, this library is quite old, and it is
usually recommended to use other libraries just for that reason
(though I don’t find this to be off-putting). The one thing you need
to squeeze out performance is to define the macro
BOOST_UBLAS_NDEBUG
.
Hopefully this was enough to convince some people that Boost is quite a handy library, that might be worth putting effort into learning, and that it convinces MikeMirzayanov to add it on Codeforces.
]]>This post was originally written on Codeforces; relevant discussion can be found here.
Disclaimer: Please verify that whichever of the following solutions you choose to use indeed increases the stack limit for the execution of the solution, since different OSes/versions can behave differently.
On Linux (and probably MacOS), running ulimit -s unlimited
in every
instance of the shell you use to execute your program will work just
fine (assuming you don’t have a hard limit). On some compilers, passing
-Wl,--stack=268435456
to the compiler options should give you a 256MB
stack.
But a much easier way from my comment a couple of years ago, originally from here, is to do this in your Meta Hacker Cup template:
#include <bits/stdc++.h>
using namespace std;
void main_() {
// implement your solution here
}
static void run_with_stack_size(void (*func)(void), size_t stsize) {
char *stack, *send;
stack = (char *)malloc(stsize);
send = stack + stsize - 16;
send = (char *)((uintptr_t)send / 16 * 16);
asm volatile(
"mov %%rsp, (%0)\n"
"mov %0, %%rsp\n"
:
: "r"(send));
func();
asm volatile("mov (%0), %%rsp\n" : : "r"(send));
free(stack);
}
int main() {
run_with_stack_size(main_, 1024 * 1024 * 1024); // run with a 1 GiB stack
return 0;
}
It apparently works with Windows as well, as a commenter confirmed, and I believe it should work with MacOS as well.
UPD: To use this on 32 bit systems/compilers, replacing %rsp with %esp works. I haven’t tried changing this for ARM, so any solutions which do the same thing for ARM would be appreciated.
]]>This post was originally written on Codeforces; relevant discussion can be found here.
TL;DR
Most people doing competitive programming who learn about the Floyd-Warshall algorithm learn it in the context of computing the lengths of the shortest paths between all pairs of vertices in a graph on \(n\) vertices in \(O(n^3)\), which is better than naively running Dijkstra or other kinds of shortest-path algorithms (other than Johnson’s algorithm for the sparse case), and can also identify negative cycles.
However, this is not what happened historically. Kleene’s algorithm for converting a DFA to a regular expression was published in 1959, and the algorithm resembles Floyd-Warshall’s algorithm in many aspects. The transitive closure problem was first solved in a paper by Bernard Roy in 1959 and an algorithm to solve the all-pairs-shortest-paths (APSP) problem was mentioned as an extension of the algorithm, but this publication went unnoticed. Stephen Warshall solved the transitive closure problem in 1962, and Robert Floyd solved the APSP problem in a paper in 1962 — all of these used essentially the same algorithm. Peter Ingerman came up with the three nested loops algorithm that we know about and use today.
As it turns out, the basic idea behind these algorithms is the same. They’re all dynamic programming algorithms, and they use the same states as well. We’ll see a couple of these applications and then put forth a generalization that can be black-boxed. Feel free to skip through these sections if you already know about the connections between them.
The actual statement of all-pairs-shortest-paths is slightly different, but we rephrase it here in order to fit Floyd-Warshall’s capabilities.
Given a weighted directed graph \(G = (V, E, w)\), construct a matrix \(D\) with \(d_{ij}\) being the length of the shortest path from \(i\) to \(j\), where the length of a path is defined as the sum of weights of the edges in the path.
The Floyd-Warshall algorithm is applicable to the special case when the edge weights are non-negative.
Statement: Given a weighted directed graph \(G = (V, E, w)\), construct a matrix \(D\) with \(d_{ij}\) being the length of the shortest walk from \(i\) to \(j\), where the length of a walk is defined as the sum of the weights of the edges (with multiplicity) in the walk.
There are a few natural questions that might pop up, and they’re addressed as follows:
An undirected graph is a special case of this setup — for an undirected edge \(\{u, v\}\), add directed edges \((u, v)\) and \((v, u)\).
When \(w\) is such that there is a negative cycle (i.e. a cycle with sum of weights of its edges being negative), then for some pairs of vertices, it is possible that \(d_{ij}\) is not defined (i.e., given a walk between two vertices, we can always find a shorter walk). For example, consider vertices \(u\) and \(v\) such that the negative cycle is reachable from \(u\) and \(v\) is reachable from the negative cycle. Then we can go around the negative cycle any number of times to get an arbitrarily short path. It can also be shown that this is the only case where such a thing happens. Fortunately, it is possible to detect this scenario using the output of the algorithm.
Informally, a walk is a sequence of edges connected at their endpoints, and all edges (when connected at the endpoints) point in the same direction along the sequence. By most definitions, a path is a walk without any repeated vertices.
So the shortest path problem would need a shortest walk without any repeated vertices. However, when negative edge weights are allowed, even in an undirected setting, you can reduce the Hamiltonian path problem to the decision version of this problem, and hence the decision version of the shortest path problem here is NP complete. For the reduction, this would be a good resource.
However, in the common case of non-negative edge weights, shortest path is equivalent to shortest walk (by eliminating any cycles).
Now that these are out of our way, we will make the following simplifying assumption, that will make the output of the algorithm the answer to the problem:
Note that in the case of undirected graphs, this also rules out the possibility of having a negative edge, since the edges \((u, v)\) and \((v, u)\) form a negative cycle. If you want a shortest path with a negative edge, that is NP complete too, as mentioned earlier.
Now let’s look at how we would solve the problem using a recursion. Let’s order the vertices arbitrarily (say \(1, 2, \dots, n\)). Define \(d(i, j, k)\) as the length of the shortest walk between \(i\) and \(j\) subject to the constraint that all vertices between \(i\) and \(j\) (excluding both) on the walk must be indexed with something \(\le k\). Let’s call such a shortest walk a \(k\)-good walk.
The base case is \(d(i, j, 0)\) — which corresponds to a single edge or none. By our assumption that there are no negative cycles (in particular, no negative self-loops), \(d(i, i, 0) = 0\). Similarly, \(d(i, j, 0)\) is \(\infty\) or \(w(i, j)\) depending on whether or not there is an edge between \(i\) and \(j\). Later, we will again use the fact that it is always optimal to ignore self-loops.
Let’s say we have \(d(i, j, k - 1)\) for all \(i, j\). How do we extend this to \(d(i, j, k)\)? Consider the following two cases:
Hence, an algorithm would look like this:
Initialize d(i, j, 0) for all i, j
for k = 1 to n:
for i = 1 to n:
for j = 1 to n:
d(i, j, k) = min(d(i, j, k - 1), d(i, k, k - 1) + d(k, j, k - 1))
We can get away with just \(O(n^2)\) memory by maintaining two matrices \(d\): one for \(k - 1\) and the other for \(k\). We can also get away with a single matrix in this special case, because the upper cap on \(d(i, j, k)\) by \(d(i, j, k - 1)\) implies that we will never make the answer to any future value of \(d\) worse, and such a value corresponds to a valid walk from \(i\) to \(j\) with all internal vertices of the walk being \(\le k\) — hence the value isn’t smaller than the actual value either. So this is indeed the correct value.
So the algorithm reduces to the following:
Initialize d(i, j) for all i, j
for k = 1 to n:
for i = 1 to n:
for j = 1 to n:
d(i, j) = min(d(i, j), d(i, k) + d(k, j))
I would refer you to this cp-algorithms article for negative cycle detection and shortest path reconstruction using the same algorithm.
Statement: The transitive closure of a binary relation \(R\) on a set \(X\) is the smallest relation \(S\) on \(X\) that contains \(R\) and is transitive, i.e., if \(x S y\) and \(y S z\), then \(x S z\). Compute the transitive closure of a given binary relation \(R\) on a finite set \(X\) of \(n\) elements.
In other words, if we treat the relation as a graph, with vertices being the elements of \(X\) and the directed edges \((u, v)\) existing iff \(uRv\), then we just need to compute the pairs \((u, v)\) of vertices such that there is a path from \(u\) to \(v\) of length at least \(1\).
All of the following approaches need a few minor changes for paths of length at least \(1\), but they work without any changes for the reflexive transitive closure problem (where \(S\) is also required to be reflexive).
This can independently be solved using the APSP setup we developed before, as well as some other ad-hoc algorithms. Let’s describe a few approaches:
This is equivalent to the following:
for k = 1 to n
for i = 1 to n
for j = 1 to n
reachable(i, j) = reachable(i, j) or (reachable(i, k) and reachable(k, j))
BFS: Running a BFS for every vertex leads to an \(O(VE)\) algorithm.
SCCs and finding reachable nodes in a DAG: As the name suggests, we preprocess the graph by condensing strongly connected components into one vertex each, and then apply the first technique here.
Algebra: Let’s consider one iteration of a level-wise traversal (rather than a BFS traversal) beginning at a vertex \(i\). Suppose we have a boolean vector whose \(j\)-th element tells us whether it is reachable from \(i\) or not using the currently traversed edges. Consider the variant of “matrix multiplication” where \(+\) is replaced by boolean OR and \(\cdot\) is replaced by boolean AND. If we “multiply” this boolean vector by the adjacency matrix \(M\) of the graph (with diagonal elements being \(1\)), then the resulting vector corresponds to the next level of the level-wise traversal. Starting with the vectors \((0, 0, \dots, 1, \dots, 0)\) where \(1\) is at the \(i\)-th position, and multiplying them by \(M^{n-1}\) (exponentiation done by this “matrix multiplication” — which is also associative) gives us the “reachability vector” from vertex \(i\) (since there are at most \(n-1\) levels). In other words, the answer is just \(M^{n-1}\).
Take a moment to consider how the reduction to APSP and the matrix exponentiation algorithm are related — it is left as an exercise for the reader.
Statement: Given a DFA \(D\), find a regular expression that accepts the same language as \(D\).
A very short description of what we want to do in this problem (after reducing some obvious cases to fit the current context): we’re given a graph \(G\) with labelled edges (such that labels for outgoing edges from a vertex are distinct), and there is a source and a sink. We want to find a regular expression that recognizes all possible strings formed by sequentially reading labels on edges of a walk from the source to the sink.
At this point, it should be obvious how we’ll approach this problem: let \(r(i, j, k)\) be a regular expression that recognizes all strings formed by a walk from vertices \(i\) to \(j\) with all internal vertices on the path being \(\le k\). The DP transitions are left as an exercise to the reader, and if you get stuck somewhere, refer to this. Remember that self-loops are now important.
In all of these problems, you might have noticed a pattern. There are
aggregations across paths and aggregations along paths. These also
need to be compatible in certain ways (i.e., we can combine the results
of aggregations along paths with some kind of distributivity) — for
instance, in the Floyd-Warshall algorithm, we needed the min
operation
to be compatible with the +
.
Kleene algebras are the underlying structures on which the Floyd-Warshall algorithm works. For the ease of presentation, we will first look at the definition (you can pick up a proof of the Floyd Warshall algorithm as applied to the problems above, and derive these conditions yourself, but it’ll be quite a bit of work):
A Kleene algebra is a set \(K\) with two binary operators \(+, \cdot : K \times K \to K\), and a unary operator \(* : K \to K\), that satisfies the following axioms:
Note that given a structure without \(*\) but satisfying the first two properties (which is called an idempotent semiring, and in general, a structure satisfying the first property is called a semiring), we can derive a natural closure operator \(*\) as the following (it is easy to check that it satisfies the above properties):
\[a^* = \sum_{i=0}^\infty a^i = 1 + a + aa + aaa + \dots\]
Obviously, this definition is subject to the expression on the right being well-defined (you need a concept of an infinite sum, which inevitably brings the idea of a limit into the picture, but I won’t go into those details).
Now that we have this definition of a Kleene algebra, let’s look at the underlying Kleene algebras for the above problems:
As we will see below, the algebraic path problem will capture all of these problems into one single problem. Note that you don’t really need idempotence in defining the problem (or even defining the natural closure), but it is needed for the algorithm to work.
Consider the following problem on an idempotent semiring:
Given a weighted directed graph with \(n\) vertices and with edge weights coming from an idempotent semiring, find the sum of weights of all walks from a vertex \(i\) to another vertex \(j\). The weight of a walk is defined as the product of weights along the walk.
Note that \(n \times n\) matrices with entries coming from an idempotent semiring also form an idempotent semiring, as should be easy to verify. Now the adjacency matrix \(M\) of this graph is a member of this idempotent semiring too (and correspondingly, its natural Kleene algebra).
Using induction, we see that sum of weights of walks of length \(k\) from vertex \(i\) to vertex \(j\) are the \((i, j)\)-th entries of \(M^k\). So the sum of weights over all walks corresponds to the matrix \(\sum_{i=0}^\infty M^i = M^*\).
This works for all idempotent semirings, and Kleene’s algorithm works as a drop-in replacement for the Kleene algebra induced by any idempotent semiring.
As a consequence, we also get the fact that for any idempotent semiring, Kleene’s algorithm leads to an efficient algorithm for computing the natural closure of any matrix on that idempotent semiring.
Application to transitive reduction: Note that this is quite useful for reasoning about transitive reduction as well. \(M \cdot M^*\) in the Boolean algebra tells us whether there is a path of length at least 2 between two vertices, so the transitive reduction is just \(M \land \lnot(M \cdot M^*)\).
To avoid making this post too long and technical, I would recommend this paper and this paper for treatments that consider matrix algorithms reminiscent of Gaussian elimination to solve path problems using matrix algebra in the semiring context. You can refer to this paper that treats some computational aspects of it. The idea of a valuation algebra is a further generalization that shows connections to more problems of various types.
There aren’t really many problems based on this topic (let’s call them Kleene-type algorithms) that I know of, but here are a few that require similar methods. Let me know in the comments if you know some more related problems.
This post was originally written on Codeforces; relevant discussion can be found here.
It’s been about 5 years since the round took place, and since there was no editorial all this time, I decided to write one myself.
Thanks to tfg for discussing problem F with me and coming up with the permutation interpretation that makes it much easier to reason about the problem. The idea in F was also inspired by other submissions.
Think about two cases: when there are very few 8s, and when there are a lot of 8s.
Let the number of \(8\)-s given to us be \(m\). We can not make more than \(\lfloor n / 11 \rfloor\) phone numbers, because every phone number has \(11\) digits, and the number of phone numbers can’t be more than \(m\). So an upper bound on the number of phone numbers is \(\min(m, \lfloor n / 11\rfloor)\).
We claim that this is indeed achievable. Indeed, we make the following two cases:
\(m \le n / 11\): this means \(n \ge 11 m \implies n - 10m \ge m\). So we have sufficient cards to assign \(10\) cards (without an \(8\) on them) to each card with an \(8\) on it, and this leads to \(m\) phone numbers.
\(m > n / 11\): this means that we can pick \(\lfloor n / 11 \rfloor\) cards with \(8\) on each of them, and assign \(10 \lfloor n / 11 \rfloor\) cards out of the rest to them in order to construct \(\lfloor n / 11 \rfloor\) phone numbers. This is possible since the total number of cards used is \(\lfloor n / 11 \rfloor + 10 \lfloor n / 11 \rfloor = 11 \lfloor n / 11 \rfloor \le n\), the last of which is true since \(n\) is non-negative.
The time complexity is \(O(n)\).
Submission: [submission:196688903]
What happens to the digit sum of the sum of two numbers (compared to their individual digit sums) when there is no carry-bit when adding two numbers? When there is one carry-bit? Two carry-bits? Can you quantify this result? One might guess that we want to maximize the number of carry-bits, and that is correct.
Firstly let’s show a claim to show that the intuition was correct.
Claim 1. Let \(a, b\) be two non-negative integers. Let the number of carry-bits encountered while adding them be \(c\). Then \(S(a) + S(b) = S(a + b) + 9c\).
Proof. We will induct on the number of carry-bits in the addition. For this proof to be made amenable to induction, we will need the following observation, and a slight restructuring of the claim.
Observation 1.1. The carry bit when adding two numbers can be at most 1.
Proof. Induction on the current position in the process of addition. At the rightmost position, the carry bit is \(0\). So the next carry bit can at most be \(\lfloor (9 + 9) / 10 \rfloor = 1\), and the base case is shown. Let’s say we are at some position \(i\) from the right. Then the carry bit is at most \(1\). So the next carry bit can be at most \(\lfloor (1 + 9 + 9) / 10 \rfloor = 1\), and we are done.
Claim 1.2. (Relaxation of claim 1): Let \(a, b\) be two non-negative integers, and let \(\varepsilon\) be one of \(0\) and \(1\). Let the number of carry bits encountered while adding \(a, b, \varepsilon\) be \(c\). Then \(S(a) + S(b) + S(\varepsilon) = S(a + b + \varepsilon) + 9c\).
Proof. Let’s denote by \(\ell(a)\) the number of digits in \(a\) for \(a > 0\) and let \(\ell(0) = 0\). We will induct on \(\max(\ell(a), \ell(b))\). The base case is trivial. For the inductive step, note that when there is a carry in the last place, then the carry bit \(h\) can be at most \(\lfloor (1 + 9 + 9) / 10 \rfloor = 1\). Applying the inductive hypothesis for \((\lfloor a/10 \rfloor, \lfloor b/10 \rfloor, h)\), we know that \(S(\lfloor a/10 \rfloor) + S(\lfloor b/10 \rfloor) + S(h) = S(\lfloor a/10 \rfloor + \lfloor b/10 \rfloor + h)\). Let’s now see what happened at the last place.
We consider the quantity \(a \% 10 + b \% 10 + \varepsilon - (a + b + \varepsilon) \% 10 - h\).
Now note that this combined with the conclusion from the inductive hypothesis proves the claim. \(\blacksquare\)
As a consequence of claim 1, it is easy to see that we want \(a, b\) satisfying \(a + b = n\) that maximize the number of carry bits when added. By induction on the length of the suffix of the decimal representation of \(n\) that consists only of \(9\)-s, we can show that it is impossible to get any carry bit. Let’s remove the largest suffix consisting entirely of \(9\)-s from \(n\). It suffices to find the answer for this \(n\) now.
So either you run out of \(9\)-s in the suffix (in which case \(n = 10^k - 1\) for some positive integer \(k\)), or there is a digit that is not \(9\). In the former case, the answer is just the sum of digits of \(n\). In the latter, note that we can’t get more than \(\ell(n) - 1\) carry bits. This is also achievable: consider \(b = 10^{\ell(n) - 1} - 1\). (or \(1\) if \(\ell(n) = 0\)).
The time complexity is \(O(\log n)\).
Submission: [submission:196689846]
Can we get a closed form for the sum of elements of the subrectangle?
Note that for a subrectangle corresponding to \(x_1, y_1, x_2, y_2\) as in the statement, \(c_{i, j}\) corresponds to the term we get when we multiply \((\sum_{i = x_1}^{x_2} a_i) \cdot (\sum_{j = y_1}^{y_2} b_j)\). This allows us to decouple the rows and the columns of the matrix.
More specifically, consider the subarray of \(a\) with indices from \(x_1\) to \(x_2\) (let’s call it \(a[x_1 : x_2]\)), and also consider the subarray of \(b\) with indices from \(y_1\) to \(y_2\) (let’s call it \(b[y_1 : y_2]\)). Define the cost of a subarray as the sum of elements in it. Then the sum of the subrectangle equals the cost of \(a[x_1 : x_2]\) times the cost of \(b[y_1 : y_2]\). So now, we focus on only the one dimensional arrays given to us.
For a given value \(C\) of the cost, let’s try to find the maximum length of a subarray of \(a\) with cost \(C\) (let’s call this \(\ell( C)\)). Using this, we can compute the maximum length of a subarray that has a cost at most \(C\), by finding the prefix maximums of \(\ell\). Let’s also do this for the array \(b\).
The problem now reduces to finding \(\max A_i \cdot B_j\) subject to the constraints \(i \cdot j \le x\), for two monotonically non-decreasing arrays \(A\) and \(B\) (the reduction follows by setting \(A\) to be the array found in the previous paragraph for \(a\), and \(B\) to be the analogue of \(A\) for \(b\)).
Now, this problem can be solved in linear time in the sizes of the arrays using two pointers: let’s iterate over \(i\) from right to left. Note that the rightmost \(j\) such that \(i \cdot j \le x\) is in fact \(\lfloor x / i \rfloor\), and this is non-decreasing as \(i\) decreases. In fact, since \(B\) is non-decreasing, two pointers are redundant, and we can simply query \(B_{\lfloor x / i \rfloor}\).
Note that there is one potential catch: the sizes of \(A\) and \(B\) might be large. But the constraints on the arrays \(a\) and \(b\) show that the maximum size is \(4 \cdot 10^6\), and the constraints on \(n\) show that we can run an \(O(n^2)\) loop to compute the arrays \(A\) and \(B\) from \(a\) and \(b\) fast enough.
The time complexity is \(O(\max(\sum_{i = 1}^n a_i, \sum_{i = 1}^m b_i) + n^2 + m^2)\).
Submission: [submission:196755317]
There are two possible approaches to this problem:
Firstly, let’s try finding a lower bound on the answer. To visualize this setup better, let’s construct a graph where there is an edge from a person to the person on their right. Then the resulting graph is a union of isolated cycles (and possibly self-loops).
Clearly, for a person, there is exactly one person to their right, and every person has exactly one person on their left. So the function that maps a person to the person on their right is a permutation. Also, note that every permutation has a cycle decomposition, and by arranging people according to the cycles in the decomposition, we can recover a valid seating plan. In other words, there is a bijection between seating plans and permutations.
Any edge from a person \(i\) to person \(j\) contributes \(\max(r_i, l_j)\) to the answer, and every person contributes \(1\). So we want to minimize \(n + \sum_{i = 1}^n \max(r_i, l_{p_i})\) over all permutations \(p\) of size \(n\). We claim that this expression is minimized when \(r_i\) and \(l_{p_i}\) are sorted the same way. The proof will be via construction.
Without loss of generality, let’s say \(r\) is sorted (reindex people if this is not the case).
Consider a permutation \(p\) with at least one inversion. Note that here an inversion is defined as \(l_{p_j} > l_{p_i}\) for \(i < j\) — in particular, we are not looking at inversions in \(p\), but in the “permutation” \(l \circ p\), or more precisely the composition of the permutation that you get when you map elements of \(l\) to their ranks (ties broken arbitrarily), and \(p\).
Every permutation with at least one inversion has at least one inversion with adjacent indices. Let’s say those indices are \((i, i + 1)\). This means \(l_{p_{i + 1}} < l_{p_i}\) and \(r_i \le r_{i + 1}\). We now construct the permutation \(p’\) by applying a swap \((i, i + 1)\) to \(p\). We claim that \(p’\) is at least as good as \(p\). In particular, we just need to show that for positive integers \(a \le b\) and \(c < d\), the following holds: \(\max(a, c) + \max(b, d) \le \max(a, d) + \max(b, c)\), or equivalently, that \(\max(a, d) - \max(a, c) \ge \max(b, d) - \max(b, c)\). Consider the function \(f(x) = \max(x, d) - \max(x, c)\). For \(x \le c\) it evaluates to \(d - c\), for \(c < x \le d\) it evaluates to \(d - x\), and for \(d < x\) it evaluates to \(0\). So, this is a non-increasing function, and we are done.
This operation (replacing \(p\) by \(p’\)) reduces the number of inversions by exactly one. By doing this process while there is at least one inversion, we come to a permutation such that \(l \circ p\) is the identity permutation — that is, \(l\) and \(r\) are sorted the same way.
Now that we are done with the proof, we have shown that there exists an optimal seating plan which requires \(n + \sum_{i = 1}^n \max(L_i, R_i)\) chairs, where \(L, R\) are the arrays you get when you sort \(l, r\) respectively.
The time complexity is \(O(n \log n)\) for sorting.
Submission: [submission:196760424]
Can we map shortest paths in the new graph to shortest paths in the tree?
Consider a shortest path between two vertices \(u\) and \(v\) in the new graph. It consists of two types of edges: edges that were originally in the tree, and edges that were added to the tree. Let’s call them old and new edges respectively.
We construct a walk in the original tree corresponding to this shortest path: for every new edge, replace it by two adjacent old edges due to which this new edge was added.
So if the number of old edges is \(n_1\) and the number of new edges is \(n_2\), the length of the considered shortest path is \(n_1 + n_2\), and the length of the walk in the original tree is \(n_1 + 2 n_2\). By reducing this walk to a path, we get a path in the tree of length \(\le n_1 + 2 n_2\). Note that this is at most twice the length of the considered shortest path.
Conversely, from a shortest path (in fact, the unique path) of length \(l\) in the tree between \(u\) and \(v\), we can construct a path of length \(\lceil l / 2 \rceil\) in the new tree. Since \(l \le n_1 + 2 n_2 \le 2 (n_1 + n_2)\), we have \(\lceil l / 2 \rceil \le n_1 + n_2\), so this is indeed the shortest path between the two vertices in the new graph.
This shows that for vertices \(u, v\), the distance between them in the new graph is \(\lceil d(u, v) / 2 \rceil\).
Now we only need to compute the sum of this expression over all \(u, v\). \(\lceil d(u, v) / 2 \rceil\) equals \(d(u, v) / 2\) if the distance between \(u, v\) is even, and \((d(u, v) + 1) / 2\) otherwise. In other words, we just need to compute the sum of \(d(u, v)\) over all vertices, and the number of pairs of vertices that are at an odd distance to each other. Since we can color a tree in 2 colors, we just need to find the product of the sizes of the two bipartitions.
Computing the sum of \(d(u, v)\) is standard: every edge contributes \(1\) to each pair \((u, v)\) of vertices such that removing that edge will disconnect these vertices. So we can root the tree, do a DFS, and then compute the number of such pairs for each edge quite easily.
The time complexity is \(O(n)\).
Submission: [submission:196767377]
Fix the vertex for which we want the answer, and call it the root. Let’s try to think about the operations sequentially. We have a total of \(2^{n-1} (n-1)!\) choices, corresponding to sequence of (edge, choice) pairs where the choice is to choose which label is assigned to the collapsed vertex. Try to count the sequences of these choices which allow the root to survive. A probabilistic interpretation works too.
Read the previous spoiler if not done already for the basic intuition behind the solution.
Let \(\sigma\) and \(\pi\) be two permutations. Applying the mapping \(x \mapsto x + |\sigma|\) to the array \(\pi\) and shuffling the elements of \(\sigma\) and \(\pi\) together, while keeping their internal order the same will generate another permutation with size \(|\sigma| + |\pi|\). We shall call this operation interleaving. Note that when \(\sigma\) and \(\pi\) vary over all permutations, and the interleavings are taken over all possible \(\binom{|\sigma| + |\pi|}{|\pi|}\) interleavings, this generates all possible permutations of size \(|\sigma| + |\pi|\). We can generalize this to adding an extra bit of information with every element of \(\sigma\) and \(\pi\), and an analogous result holds true for this setup too.
We will look at the choices as interleavings of the choices for each of the children’s subtrees (with an extra edge added on top for each subtree). The operation corresponding to adding an extra edge corresponds to interleaving the permutation \(\{1\}\) with the permutation for the subtree (and there is precisely one choice that must be taken whenever we consider the extra edge). We’ll call this edge the hanging edge for that subtree.
Now at any stage of the interleaving, note how the structure of the tree looks like. It might be helpful to consider the shrinking operations in the following way: initially all edges are represented with dotted lines (implying that they’re just a template for connecting edges, and don’t connect the edges at this stage), and every vertex is its own representative, and every shrinkage converts a dotted line into a solid line, and sets the representative of the sets of vertices connected by this edge to one of the representatives randomly. Basically how DSU works.
In this interpretation, consider the component containing the root. The choices made for the children subtrees that have only one correct choice (for the root to survive) correspond to the choices that end up in the root’s component. Let’s un-interleave these choices. Then this problem reduces to the same problem but for the child subtrees. Also, note that when interleaving these subtrees and counting the number of choices that allow the root to survive, only the number of choices made (before choosing the hanging edge) for each subtree matter. Also, note that we can do these interleavings associatively (that is, first the first subtree, then the second with the result of the first subtree, then the third with the result of the first and second subtrees and so on), and the computation remains the same.
So the merges will go as follows: for a vertex, we will compute the answers for its children, and for each child, extend the set of choices by the extra choice that needs to be made for adding a hanging edge (by interleaving with the single choice), and then interleave these choices again.
Using these ideas, we can do a DP with the following state: dp[u][k]
is the number of choices for the subtree of a vertex \(u\) when \(k\)
decisions are done before we choose the hanging edge. For the
transitions, you can refer to the submission.
Submission: [submission:196942805]
How can you exploit the fact that when there are \(k\) \(a_i\)-s to the left of a number which is not on a pocket, it gets shifted to the left by \(k\)? Try finding some properties and some nice locations that can help.
We’ll reindex \(a\) for its indices to start from \(0\).
Let’s start out with an easy to prove observation:
We call an interval \([a, b]\) of positive integers good if there are exactly \(b - a + 1\) pockets with positions \(< b\). Then after a single operation, a good interval moves to positions that form a good interval. The proof goes as follows: let’s say there are \(c\) pockets under the interval. Then the left endpoint moves by \(b - a + 1 - c\) (potentially falling into a pocket), the right endpoint moves by \(b - a + 1\), and balls that didn’t fall into a pocket remain contiguous (filling holes if formed).
Let’s try to take advantage of such good intervals. We will try to construct good intervals with some maximal right endpoints, so as to partition the integer line into some partitions.
Define the sequence \(\{r_i\}_{i = 0}^{n - 1}\) by \(r_0 = a_0 - 1\) and \(r_i = r_{i - 1} + i \cdot \left\lfloor \frac{a_i - 1 - r_{i - 1}}{i} \right\rfloor\) for \(i > 0\). This is clearly a non-decreasing sequence.
This roughly corresponds to the set of right endpoints of our intervals (though some \(r_i\) will be repeated here, and though we’ll ignore them in the current analysis, they’ll be helpful for implementation).
Observation 1. By the definition of floor, we have \(r_i + 1 \le a_i \le r_i + i\).
Observation 2. Whenever \(r_i > r_{i - 1}\), we have \(r_i > a_{i-1}\).
Proof. Since \(r_i - r_{i - 1}\) is divisible by \(i\), the difference is at least \(i\). So by the previous observation, we have \(r_i \ge r_{i - 1} + i \ge a_{i - 1} + 1 > a_{i - 1}\), as desired.
From this, we get that if \(r_i > r_{i - 1}\), the length of the moving interval \([r_i - i + 1, r_i]\) stays intact for \(\left\lfloor \frac{a_i - 1 - r_{i - 1}}{i} \right\rfloor - 1\) operations, and after one more operation, it ends up at some suffix of positions \(\le r_{i - 1}\). That is, \(r_i\) after some operations hits \(r_{i - 1}\). This sequence gives a partition of the integer line into smaller chunks that can be reasoned about locally.
Now that the main idea of the solution is complete, all that remains is to back-compute the number that ends up at a given position. For this, it is (theoretically) straightforward to compute the answer if you reason about the number that ends up at certain \(r_i\)-s close to \(x\). For the relevant book-keeping that needs to be done as a part of implementation details, a persistent segment tree can be used. Note that the queries can also be solved offline, but that needs some more work.
Submission: [submission:196921763]
What kinds of operations can we implement using addition alone? Try small cases, like \(d = 2\).
Using addition, we can implement multiplication by a “compile-time” constant \(n\) in \(O(\log n)\) steps. Since arithmetic is modular, we can implement this in \(O(\log p)\) steps (remember that multiplication by \(0\) is an edge case here, so it needs to be handled like multiplication by \(p\)).
If you try \(d = 2\), you can realize that it suffices to be able to find \(a^2\) from a variable \(a\), because then we have \(xy = ((x + y)^2 - x^2 - y^2) \cdot ((p + 1) / 2))\).
Now we only want to be able to go from \(x^d\) to \(x^2\). For this, note that there is a set of \(d + 1\) linear equations in the formal variables \(1, x, x^2, \dots, x^d\) relating them to \((x + i)^d\) for \(i = 0\) to \(d\). The determinant of the matrix is non-zero since it is a non-zero (mod \(p\) too, because \(d < p\)) factor times the Vandermonde determinant (which is non-zero mod \(p\) because \(d < p\)), so it is invertible.
By inverting this matrix, we can get \(x^2\) from \((x + i)^d\) for \(i = 0\) to \(d\). The total number of queries is about \(O(d \log p)\) in both space and memory, and it passes very comfortably.
Submission: [submission:196822812]
]]>This post was originally written on Codeforces; relevant discussion can be found here.
There were a few instances on CF according to which it seems that quite a few people aren’t very comfortable with floor and ceiling functions (and inequalities in general) — for instance, here and here. This inspired me to write a hopefully short post on how to systematically work with these topics.
Note that I will not define what a real number is, or do anything that seems too non-elementary. Also, for the foundational stuff, I will rely on some correct intuition rather than rigour to build basic ideas, and the systematic logically correct way of reasoning will be used later. The idea behind this post is to just help you understand how to reason about the topics we’ll cover.
Before we jump ahead to floors and ceilings, we need to talk about inequalities. I will assume that you know something about the number line. Let’s recap some informal definitions:
For the sake of brevity, we will abbreviate “if and only if” to iff. We will often use \(P \iff Q\) in math notation to indicate that \(P\) is true iff \(Q\) is true. For a unidirectional implication of the form “if \(P\) is true, then \(Q\) is also true”, we will write \(P \implies Q\), and read it as \(P\) implies \(Q\).
Here’s a fact that people often use intuitively, called the Law of Trichotomy:
This is useful a lot of times when you try to break things into cases, and try to look for contradictions.
A corollary of this is the following definition of the relations in terms of \(<\) and negation (you can use any relation other than \(=\) to get all other operations):
Now note that if we have some results about \(<\), using the above relations, we can come up with relations corresponding to these operations too.
We can also write these relations in terms of \(\le\) — I leave this as an exercise for you to confirm that you’ve understood the logic behind these.
I recommend understanding why these make sense, so that you can reason about directions and inequalities more easily.
Let’s say \(a R b\) and \(b R c\) are true, where \(R\) is any one of \(<, >, =, \le, \ge\). Then it is easy to show that \(a R c\) is also true. This allows us to chain inequalities, and write things like \(a < b < c\) and remembering that as we read the expression from left to right, the terms only increase.
This property is called transitivity, and is well-studied in other contexts too. But for our purposes, it just makes our life simpler.
At this point, it is very important to note two more things:
This is what many people don’t internalize very well, or are simply not exposed to. At the risk of being too verbose, I will explain some interactions of arithmetic operations and inequalities (we will focus on \(<\), but such relations can be developed for other relations as well — like \(>, \le\) and their combinations — and that is left as an exercise):
\[a < b \iff a + c < b + c\]
Multiplication by \(0\): Both sides of any inequality become \(0\), so upon multiplication by \(0\), inequalities become equalities.
Multiplication by a positive number: let’s say you have an inequality \(a < b\). This means \(0 < b - a\), or in other words, \(b - a\) is a positive number. If we multiply it by a positive number \(c\), it will remain positive. That is \(0 < c(b - a)\). Adding \(ac\) to both sides gives \(ac < bc\). So:
\[c > 0, a < b \implies ac < bc\]
\[c < 0, a < b \implies ac > bc\]
\[a < b, c < d \implies a + c < b + d\]
\[a < b, c > d \implies a - c < b - d\]
NOTE: Subtracting inequalities of the same type and adding inequalities of the opposite type does not work, and I leave it as an exercise for you to see which steps in the above proofs fail.
Multiplying and dividing inequalities is significantly more complicated, but they can be treated with the same ideas. More specifically, for multiplying two inequalities, you have to make cases on the signs of the 4 terms. For division, the idea is the same, but additionally you have to take care that some denominator is not 0.
Let’s start out by the definition of the floor function and the ceiling functions:
The floor function is a function \(\lfloor \cdot \rfloor : \mathbb{R} \to \mathbb{Z}\) such that for all real numbers \(x\), \(\lfloor x \rfloor\) is the integer \(n\) such that \(n \le x < n + 1\).
The ceiling function is a function \(\lceil \cdot \rceil : \mathbb{R} \to \mathbb{Z}\) such that for all real numbers \(x\), \(\lceil x \rceil\) is the integer \(n\) such that \(n - 1 < x \le n\).
Clearly, if such \(n\) exist, they are unique.
Why do these definitions make sense? Note that intervals of the form \([n, n + 1)\) for integer \(n\) cover the number line disjointly (this is not a rigorous proof, but I don’t want to deal with real numbers), so \(x\) will always be in one of these intervals. The same property holds for intervals of the form \((n - 1, n]\).
As a special case, when \(x\) is an integer, both the ceiling and the floor functions return \(n\).
NOTE: Remember that \(\lfloor 1.5 \rfloor = 1\), but \(\lfloor -1.5 \rfloor = -2\), and not \(-1\).
In fact, we can notice the following:
The proof follows from the definition (we will show this only for floor, the ceiling case is analogous): let’s say that this is not true, and there is some \(a > n\) such that \(a \le x\). \(a > n\) means \(a \ge n + 1\). But we know that \(x < n + 1 \le a\), so \(x < a\). This is a contradiction.
Note: As a consequence of the above fact, we have \(\lfloor x \rfloor \le \lceil x \rceil\), and the difference between them is \(1\) iff \(x\) is not an integer; when \(x\) is an integer, they are equal to \(x\).
We can rephrase these statements in the following ways that can be useful while solving problems.
This is especially important because these kinds of iff statements help us reduce the large number of cases that can arise while framing and solving inequalities. For an example, consider this comment.
Let’s try to rephrase the original definition into another way too. \(n \le x < n + 1\) is equivalent to the two inequalities \(n \le x\) and \(x < n + 1\). The latter is equivalent to \(x - 1 < n\). So combining these again shows that \(x - 1 < n \le x\). These steps are reversible, so this means this can also function as an alternate definition of the floor function. Doing this similar for the ceiling function, we have the following properties (that are also alternate definitions):
It is sometimes useful to think in terms of the fractional part of a number (i.e., distance from floor), since it is guaranteed to be a positive real number less than 1. Similarly, distance to ceiling is helpful too.
Let’s see what happens to the floor of a real number when we add (or equivalently, subtract) an integer.
Let \(n’ = \lfloor x + a \rfloor\) where \(a\) is an integer. This is equivalent to saying that \(n’ \le x + a < n’ + 1\), which is equivalent to \(n’ - a \le x < n’ - a + 1\). Since \(n’\) is an integer, \(n’ - a\) is also an integer. Hence this is equivalent to saying that \(n’ - a = \lfloor x \rfloor\). In other words, we have shown that
\[a \in \mathbb{Z} \implies \lfloor x + a \rfloor = \lfloor x \rfloor + a\]
Similarly, we can show that
\[a \in \mathbb{Z} \implies \lceil x + a \rceil = \lceil x \rceil + a\]
However, this kind of an identity doesn’t hold for multiplication by an integer in general. For example, \(\lfloor 1.5 \rfloor = 1\), and \(\lfloor 2 \cdot 1.5 \rfloor = 3 \ne 2 \cdot \lfloor 1.5 \rfloor\).
Note that by the same example, we can show that \(\lfloor x + y \rfloor\) is NOT equal to \(\lfloor x \rfloor + \lfloor y \rfloor\) in general (and something similar for ceiling). However, it can be shown that the following are true:
\[\lfloor x \rfloor + \lfloor y \rfloor \le \lfloor x + y \rfloor \le \lfloor x \rfloor + \lfloor y \rfloor + 1\]
\[\lceil x \rceil + \lceil y \rceil - 1 \le \lceil x + y \rceil \le \lceil x \rceil + \lceil y \rceil\]
Let’s now look at what happens when we multiply the argument of the functions by \(-1\).
From \(n \le x < n + 1 \iff -n - 1 < -x \le -n\), we can note that \(-\lfloor x \rfloor = \lceil -x \rceil\). By replacing \(x\) by \(-x\) in this relation, we have \(\lfloor -x \rfloor = - \lceil x \rceil\).
Since the floor and ceiling functions both return integers, applying any amount of floor or ceiling functions to the result will not change their value after one of them is already applied once.
One non-trivial property that sometimes helps is the following:
For an arbitrary positive integer \(n\) and real number \(x\):
Proving this is simple and is left as an exercise for the reader. Alternatively, it can be shown via the proof of the property below.
Another more general property is the following (reference: Concrete Mathematics, p71):
Let \(f(x)\) be any continuous, monotonically increasing function with the property that \(f(x)\) being an integer implies \(x\) being an integer. Then we have \(\lfloor f(x) \rfloor = \lfloor f(\lfloor x \rfloor) \rfloor\), and a similar thing holds for the ceiling function.
The proof is as follows. There is nothing to prove for \(x\) being an integer. Consider \(x\) being a non-integer. Since \(\lfloor \cdot \rfloor\) and \(f\) are both non-decreasing functions, we have \(\lfloor f(\lfloor x \rfloor) \rfloor \le \lfloor f(x) \rfloor\). Suppose that the inequality is strict — in that case, since \(f\) is monotonic and continuous, there is a \(\lfloor x \rfloor < y \le x\) such that \(f(y) = \lfloor f(x) \rfloor\). Then \(y\) is an integer because of the property that \(f\) has. But there can not be another integer between floor of a number and itself, which is a contradiction.
For more properties and applications, I would refer you to the Wikipedia article on these functions.
Now that we have some results in our toolbox, we’ll show how to actually solve problems cleanly without getting stuck thinking about multiple cases at once. We’ve already shown some ways of using these while deriving other results, so we’ll keep this short and I’ll just link to some problems instead.
I’ll list out some non-trivial identities and bounds, and proving them is left as an exercise.
For the sake of sanity, it is recommended to ensure that whenever you
are trying to perform integer division in any programming language, try
to convert it into a form where the denominator is positive. However,
for most languages, it is always true that (a / b) * b + (a % b)
evaluates to a
.
There are typically two behaviors when performing integer division with a positive divisor:
a // b
will give \(\lfloor a / b \rfloor\), and
a % b
will give the non-negative remainder when \(a\) is divided by
\(b\).a % b
gives the negative distance to
the corresponding multiple of \(b\).Note that to compute the ceiling of some \(a / b\) for positive integer \(b\) and integer \(a\), we can use the result mentioned earlier: \(\lceil a / b \rceil = \lfloor (a + b - 1) / b \rfloor\). Implementing code for correctly computing floors and ceilings in both C++ and Python is an instructive exercise.
Keep in mind that std::floor
and std::ceil
are operations on
floating point numbers, and their results are also floating point
numbers. Hence they might not be exact when converted to integers, and
hence it is not recommended to use them for most practical purposes.
This post was originally written on Codeforces; relevant discussion can be found here.
As was noted in this post, Google decided to discontinue its programming contests (Code Jam, Kick Start and Hash Code).
It seems they will take down all relevant servers (and hence the problems and the submission system) for all these contests. While there are some initiatives to archive past Code Jam and Kick Start problems (for example, here — also have a look at this comment), there seems to be nothing similar for Hash Code.
As a past participant of Hash Code, I believe that these types of contests are quite important too, and may have directly or indirectly inspired other such contests (like the Reply Challenge and AtCoder Heuristic Contests). Also there are some techniques like simulated annealing and using SAT/ILP solvers that might not show up in standard competitive programming websites but do show up in such optimization contests. The tutorials for them will be lacking without examples too.
So here are my questions to anyone who might have reliable information about the contests, since everything seems to be in a mess right now:
This post was originally written on Codeforces; relevant discussion can be found here.
Disclaimer: I am not an expert in the field of the psychology of learning and problem-solving, so take the following with a grain of salt. There is not much “scientific” evidence for this post, and the following is validated by personal experience and the experiences of people I know (who fall everywhere on the “success” spectrum — from greys to reds in competitive programming, from beginners in math to IMO gold medalists, and from people with zero research experience to people with monumental publications for their own fields). This post only covers one possible way of thinking about knowledge organization and retrieval that I have been using for the past decade or so, but there definitely will be other models that might work even better. Even though I just have a few courses worth of formal psychology experience, I still think the content of this post should be relevant to people involved with problem-solving.
I hope this helps both students (to make their learning efficient) as well as educators (to make their teaching methodology more concrete and participate actively in their students’ learning process). Speaking from personal experience, using these ideas as guiding principles let me study uni-level math in middle school without any guidance, so I hope others can benefit from these the same way I did. And it is important to remember to not self-impose limits on yourself by looking at modern education standards - school curriculum is designed for being able to teach at least 95% of students in a low-supervision setting, which severely limits the scope of what you can achieve.
Note that the techniques mentioned in this post (or any resource that tries to explain learning and reasoning) will not work for you if your basics behind reasoning are flawed. If you’re a beginner at reasoning in terms of proofs, I recommend this book to understand what mathematical reasoning is like. Intuition is just a guide — it is developed and applied only through multiple instances of rigorous reasoning.
Note: This post is quite long, so reading it bit-by-bit should be better than skimming it in one go. I have also added a few links to some really good resources within the text, where I felt that it would be better to link to pre-existing stuff to avoid repetition and give a more varied perspective.
Here’s what we’ll discuss in the post:
I would like to thank AlperenT, __Misuki, [-is-this-fft-](https://codeforces.com/profile/-is-this-fft-), Ari and Madboly for reviewing/discussing my post, and their constructive criticism and suggestions that made this post much better than in its initial revision.
Too many people, usually those who are starting out on their problem-solving journey in a completely new domain, ask for the right strategy to “study” and “practice” topics. Some common questions I get are:
I used to answer these questions with some averages based on how much people I know train. But the variances of these answers are non-trivially large and can’t be ignored. In fact, it does not even remain the same for the same person, even in the same topic.
In short, there is no right answer to these questions. However, if you understand your own learning process (either perceived on your own or by a coach/teacher), you will find that you don’t really need to focus on these questions at all, and you can fine tune your own learning by getting feedback from yourself.
Most “successful” people I know (in the problem-solving context) relied on their gut or sheer brute force to tune their learning process. Some simply overcompensated, and that’s fine too, though it’s a bit suboptimal. But a small fraction of people knew some characteristics of their learning process and were able to use it to make it more efficient and push its limits, although to different extents on a case-to-case basis.
Apart from fine-tuning, knowing about how you learn can also help build different “knowledge-base” building and traversal techniques from scratch, leading to very unique ways of problem-solving that you might have never seen before. For instance, a change in the way you read books/blogs can be that you try to reconcile their contents with what you already know, and how you can update your knowledge base with that information.
What follows is a set of some observations and mental models that I think should be talked about explicitly, and I would encourage people to also discuss the mental models they use in the comments.
To make things more concrete, we need some model of how knowledge is stored, processed and accessed by us. I like to think of it as a hierarchical graph structure, with some sense of locality and caching.
Now what do we mean by this? Let’s start by building this model bit by bit, from the ground up.
An idea, for now, is a concept, like binary search, DP, greedy and so on. Let’s say we solve problems that involve both DP and binary search. Or more generally, we remember some context that has both DP and binary search. Is there a data structure that can store “things” and “relations” between them? Of course, we know that graphs are such data structures! If you don’t know what a graph is, the first paragraph and the first figure at this link will help get some intuition about it.
So our first model is a graph — the nodes are “concepts” and the edges are “context”. If someone is starting out without any understanding of how things are connected, then their problem-solving approach would probably look like doing some kind of search on this graph. At least, this is how an AI would do things (keyword: A* search using heuristics). Most newbie approaches to problem-solving can be summarized as one of the following two approaches:
Here path doesn’t necessarily mean a simple path in graph-theoretic terms, it is more like a walk.
This graph can be built over time by practice, by storing the contexts in your brain and remembering all related concepts. The more the contexts that you can remember/internalize, the stronger the edge is, and the less time it takes for you to traverse it.
This is a very mechanical thing, and it would work if all the information you had was encoded into that graph, and you were expected to solve problems with only those ideas.
However, there are multiple issues in this:
These are where a lot of approaches get stuck, and progress seems difficult. Let’s try to refine this model further, and make things more systematic.
Let’s think about how things are related. Two concepts can either be unrelated, partially overlapping, a (generalization, specialization) pair (like DP, knapsack), or a (situation, aspect) pair (like DP and optimizing for time complexity). There can be even more categories, but we will just consider these, and the resulting structure will be general enough to account for more kinds of relations.
The very fact that generalizations and specializations exist points to a redundancy in our graph model, which will duplicate edges between concept nodes for such cases.
So we come to a hierarchical structure, where at the highest level, there are the so called “high-level” ideas, and at the lowest level, there are things like specific cases, implementations and what not.
A relevant image is the following, adapted from an image on this webpage. Note that this figure seems to imply that every “child” node has a single “parent” node, but that is not true. A child node can have multiple parent nodes.
Every layer in this structure is a graph itself, but the edges that connect concepts from a layer to concepts to another layer help us to capture all these relationships between concepts. Unrelated means different connected components, partially overlapping means that there is a path of length \(> 1\) that connects the two concepts, and the last two correspond to being an “ancestor-descendant” pair in some sense for this hierarchy.
Note that this is almost the same as the original model; we just “color” the concepts into various hierarchy levels. Something similar to this kind of a categorization has also been discussed in this post. This comment of mine talks about why this hierarchy is important.
This fixes a lot of the issues:
Now the one issue that still seems to plague us is creativity. I think of creativity as having two different components: latent and active. Both of them, while being creative processes, are quite different in the sense that latent creativity arises from subconscious thinking, while active creativity arises from augmenting the knowledge graph in real time.
It is quite hard to explain latent creativity (in fact, it is related to an open problem in the psychology of learning), but there is some evidence of its existence — if you leave the problem-solving process on autopilot, it becomes more probable that you will get flashes of inspirations and insights from seemingly nowhere. This is also one of the reasons why it is suggested to learn in a way that allows you to enjoy what you learn — because that will enable your mind to accept the task of processing information and ideas about that topic subconsciously.
However, active creativity comes from effort. This, at least in cases I have consciously observed, usually corresponds to noting weak (or completely new) edges in the graph, or traversing a long path in the graph, or generalizing and adding a new hierarchy layer (either a top-level layer, or a layer in the middle), or modifying ideas in your graph to form new concepts. This is a very restrictive definition of creativity, but we’ll only focus on this sort of creativity when referring to active creativity, lumping anything that doesn’t fit in this definition even vaguely with latent creativity.
In fact, a standard way of generalizing/tweaking existing ideas in mathematics comes from this procedure, which is just a way to add new layers in the graph (we will call this creative augmentation, for the lack of a better word):
Let’s say the set of ideas you have right now is \(S\) (initially empty), and the set of examples you have is \(E\) (initially a set of examples you have that you want to explore). Try to do the following operations as many times as possible, in order to form a near-“closure” of \(S\) and \(E\) under these operations (that is, try getting to a point where doing the following is very hard).
Category theory takes this sort of an idea to the extreme, and has a very rich theory that is applicable in a lot of places in math.
Now remember what we said about how this structure reduces computation time. Notably, some concepts are very general, and they are on the top of the hierarchy. So it makes sense to “cache” these concepts into your memory. So taking motivation from computer architecture, our L1 cache is the highest level of the hierarchy, L2 cache is the level just below it, and so on. The comment I linked earlier discusses this in a bit more detail.
Similarly, some contexts are very noteworthy and unforgettable. There is another kind of cache at play, that stores these contexts, and helps in faster retrieval of nodes that it connects.
Note that we have only talked about the “structural” properties of the so-formed hierarchical knowledge graph. Constructing and using this graph are both non-trivial tasks, and we will discuss about it below.
In other words, we have only dealt with the “spatial” aspects of knowledge, and “temporal” aspects are quite non-trivial to understand, and this is where most of the action occurs. Needless to say, the time taken and other parameters of this time-based process vary wildly from individual to individual, so finding an optimal learning strategy will take some trial and error.
In the language of the previous section, learning is just building your knowledge (hierarchical) graph. There are three standard ways learning takes place, however, they are not the only ones:
This corresponds to explicitly constructing subgraphs from some knowledge that has been spoon-fed to you. This is often the easiest way to learn, but it is the most prone to half-baked understanding. Unless studied properly, theory will almost inevitably lead to prominent concepts (nodes) but weak contexts (edges).
This corresponds to explicitly learning contexts (i.e., growing graphs by adding or strengthening edges), when you already have concepts in place. For instance, this blog talks about how to practice competitive programming, and it emphasizes the fact that, especially for the current competitive programming meta, you don’t need a lot of knowledge-based concepts, compared to problem-solving based nodes in your knowledge graph.
In a meta sense, since problem-solving ideas are themselves nodes in the knowledge graph, this way of learning also adds nodes to your knowledge graph (which we will call meta nodes — because it is not really related to the content of the knowledge graph as such). This comes from reflecting on how you solved the problem, remembering your thought process and the main ideas that went into solving the problem.
Also, a lot of the times a single context connects multiple concepts, so it is a “hyper-edge” (an edge of a hypergraph), and not an ordinary edge. This also leads to a natural “dual” of a context — a concept that is a generalization (in some sense) of related concepts that captures the essence of the context. We call it a dual node because of how edges and vertices in a graph are duals of one another.
This corresponds to adding stuff to the graph without outside intervention. This is what fuels original research and leads to good problem-setting.
It strengthens edges in your knowledge graph as well as creates non-existing ones, and can also lead to new nodes in your graph. The procedure described in the last section is an example of this kind of learning. As a rule of thumb, as explained in the last section, applying active creativity leads to more creativity (with some effort, of course), so this is a great way to explore ideas in depth and cultivate abstract reasoning.
However, this is also the hardest way to learn among the three, because of how non-trivial coming up with ideas from scratch is, and how much active effort this needs you to put in.
In the context of the knowledge graph, reasoning is just a partial traversal of the graph, where you can also build new contexts along the way. So it is just manipulating paths on the graph!
At this point, I think it is reasonable to read this post that talks about applying (often newly learnt) ideas, that involves a lot of recognizing configurations, forcing topics to fit and what not. All of this is a naive way to randomly traverse the knowledge graph.
As we mentioned earlier, reasoning strategies themselves are nodes in the knowledge graph, so it helps to have some knowledge of how you can think about things, as a starting point on how to think at all.
In what follows, we will try to categorize the more standard reasoning techniques by how they arrive at their conclusion.
We will categorize reasoning techniques based on the initial motivating ideas behind them. These categorizations depend on what kind of edges we think of while trying to connect ideas.
“Wishful thinking”
This is not really a way of reasoning, but a way of motivating what to reason about. In some sense, this tries to hypothesize the existence of a certain path in the knowledge graph without much evidence, and tries to reconstruct the path by some sanity-related arguments.
So you could think of this as doing “meet in the middle” on the graph.
Intuition
How do we think of things “intuitively”? We connect two topics by some vague link, and say that based on our past experiences, or rules of thumb, some conclusion must be true. In the context of the knowledge graph, this just means making a claim about the connectivity of the two topics, making an educated guess about the kind of connections we can have (close to wishful thinking, but not quite). This is like heuristic search on the graph.
Rigor
Reasoning with the help of rigor (from the very beginning itself) is quite the opposite of reasoning by intuition. Here, in most cases, we try to get some theoretical results (rigorously proved), and try to search for results that are closer to some kind of a desirable goal that we might not always have a clear picture of until the very end. So, this is like unguided DFS/BFS on a graph, where we stop only when a vertex satisfies some predicate.
Maybe there’s a better way of naming this reasoning technique, but I went with the name rigor just because rigor is a characteristic property of the cases when this reasoning technique shows up. You really can’t go wrong with a rigorous argument, and that’s why it helps in cases where the scenario isn’t too clear.
Of course, developing more reasoning techniques comes with more practice (remember how reasoning techniques are also nodes?). For a couple of great resources on general problem solving techniques, I would refer you to the following books:
We will also discuss some real-life examples of reasoning techniques later in the post.
Using the idea of knowledge graphs, we can think about some limitations that people face while solving problems.
As you might know, the attention span as well as how much information we can store at the same time is a function of how much working memory (or RAM) we have. The brain also works like an LRU cache — it removes the oldest stuff when the cache is full. This might lead to forgetting what you were trying to solve (or what intermediate conclusion you were trying to reason about) in the first place. The more complicated the problem you are solving, the more of the knowledge graph will be loaded into your working memory. So it becomes easy to forget your train of thought in such cases.
Thus a good piece of advice is to jot down your ideas, fast enough that it doesn’t break the flow of your reasoning, and comprehensive enough to not look like gibberish a few seconds into the future.
As we discussed earlier, active creativity requires effort and consistency. It is only natural that the brain shuts down any attempts at cultivating active creativity.
Once you have enough knowledge, it becomes harder to be creative because of being locked in a certain context when you have associated the context too strongly with the real-life situation, which might have even more hidden aspects than imagined.
There are a few ways to avoid this slump:
This adds some more “chaos” in the construction part of the knowledge graph, and from experience, mixing and matching sometimes leads to unexpected results.
When your knowledge graph becomes too large to be dense, or too localized because of not being loaded completely each time you want to solve a problem, it becomes harder to add more high-level ideas without losing flexibility. When you have too many high-level ideas, accessing them becomes a chore on its own.
A couple of ways to fix this are:
Since the above discussion might have been too abstract and hand-wavy to be useful, I’ll try to connect these ideas with some real-life examples — more specifically, we will discuss both learning strategies used by actual people, and problem solving strategies.
Now that we know how knowledge can be thought of, we should add things to the knowledge in a manner that makes things easy to retrieve.
Let’s think about how to take advantage of hierarchy (because it is the hierarchy that reduces the computation time as well as the load on the working memory).
Here’s what I subconsciously do whenever I read a new paper/tutorial/blog (and I try to make my posts structured in a way that makes this process easy). It is systematic enough to be an algorithm itself, so I will describe it in steps.
Don’t be afraid to prune out redundant nodes — the power of hindsight allows you to polish your mental presentation of ideas in a way that is natural, and this can be done only after you have read the text more than once. The issue with reading text (or any way of adding knowledge to your knowledge graph) is that you process things in a linear order. This adds an artificial gap between things you get to know earlier and things you get to know later. So to make this subgraph of your knowledge graph denser, it is important to re-read the text and re-evaluate your edges.
This is one way that I came up with that helps reinforce the connections between ideas that I get to learn from a text/tutorial/blog/discussion. If you use this kind of a method a large enough number of times, you’ll be able to apply this “algorithm” much faster and subconsciously. Your method might differ quite a bit, and that’s fine.
However, this was only about learning things from a source that spoon-feeds you some knowledge. Learning in other ways is much more chaotic, but it also helps you to build your own connections more organically, in a way that is more intuitive for you and gels well with your own set of techniques.
This is why it is always a good idea to try and learn about the same topic from a few problems that use it. Of course, knowing that the topic is related gives you an advantage — knowing a potential starting point, or a node on the path corresponding to the solution. However, solving in this way is a good way of making connections that you could not have made by just reading the text.
A good way to learn about the solution is to write out your solution in detail, and where the ideas came from. Did you consciously use an idea, or was it a subconscious jump along an edge?
After you write this out, it becomes yet another text for you to analyze using the method for texts! Remember to keep things in a hierarchy so that it becomes easy to access later on.
We still haven’t given examples of high-level ideas yet. What really are these high level ideas? The answer is that they are what you want them to be. The history behind your learning process determines, to a large extent, what you find intuitive and what you don’t. Since high-level ideas are meant to be generic and easy enough to be subconsciously used, you should really be defining them on your own. There’s no harm in getting to know about other people’s high level ideas and motivating your own definitions, though.
What follows is a list of some high-level ideas I use. The first part of this list includes common paths in the knowledge graph that deserve high-level “dual” meta nodes of their own — remember that we introduced this when we described learning via problems.
Now let’s come to the non-meta nodes. These are not as “universal” as the meta nodes, because you actually get work done using these nodes, while those nodes govern your usual searching patterns using these nodes. Since these are quite self-explanatory, I will just list some of them out, with some of their child nodes in brackets. Note that it is possible that a child node has more than one parent — this is one of the ways to strengthen your knowledge graph too!
You can add a lot of more non-meta nodes, but I feel that at least for me, these are the most important ones I use frequently.
When I try to solve problems, I almost always start by trying to recognize the setup — it saves time that could have been wasted with false starts and the time taken to analyze the setup properly before you can reason about it. As an interesting sidenote, which definitely should NOT be taken as a universal truth, is that problems on CF tend to be more amenable to these kinds of problem-solving strategies, while problems on AtCoder require more analysis and “thinking”.
I could just give an example of how my thought process works at this point, but I won’t do so with an explicit example for the following reasons:
However, I will still add references to a couple of places where problems are solved with some motivation behind them, in the hope that it will help you build your own thought processes by trying to recognize frequently used steps and motivations.
Now let’s talk about how you can fine-tune your learning process. We discussed earlier about a possible way to build your knowledge graph. But how do you know when to stop, and whether you have learnt a good amount of stuff? The vanilla answer to this is that there is never enough learning, and applying the creative augmentation process I described earlier is almost limitless.
Still, for practical reasons, you need some sort of a metric to understand how well you’ve understood the idea, and how well it has assimilated into your knowledge graph. The most surefire way of doing this is through creative problem-solving. Here’s what I used to do for a single topic when I used to train:
Gaining knowledge from experience is just how the subconscious knowledge augmentation process works, and that is why most high-rated people just tell you to practice whenever you ask for advice on how to get to some rating goal.
But there’s another way of gaining knowledge, and that is problem-setting. Though it might seem counterintuitive at first (of course, when you’re setting a problem, you’re just using your knowledge, right?), setting problems helps cement the contexts and concepts together.
One way is abstract problem-setting, and the other way is concrete problem-setting. Problem-setting on Polygon, for example, asks you to make a checker, validator and main solution. These three things might be related to very different ideas, and it can give you more ideas to add in the creative augmentation process.
However, always having to worry about these three can also limit the ideas you can come up with. Abstract problem-setting takes care of that. It is like setting problems for a math Olympiad, where your proofs will be graded, and existence-type arguments are allowed. It let’s you venture deeper into the territory of the idea, and by using generalization and abstractions, it will end up making new connections in the knowledge graph, and the pre-existing ones will become stronger.
This section contains some resources for further reading. The first of these is an actual psychology book that deals with problem-solving, the second one is a reference to some information about a field that treats topics in psychology with mathematical rigor (like the knowledge gaining process forms an antimatroid — something that I came across while writing my greedoids post), while the others are either some nice posts (and their comments) that resonate with some of the points I have made in this post, or some general pieces of advice that I link to whenever people ask me some very common questions. The last one in particular has some unrelated math olympiad related advice that I particularly like, so that’s a bonus.
Janet E. Davidson, Robert J. Sternberg, The Psychology of Problem Solving
This post on what makes problems difficult, which also uses a similar way to explain problem-solving.
https://blog.evanchen.cc/2019/01/31/math-contest-platitudes-v3/
https://blog.evanchen.cc/2014/07/27/what-leads-to-success-at-math-contests/
This post was originally written on Codeforces; relevant discussion can be found here.
Firstly, the CF Catalog is a great feature that makes it much easier to learn topics by having a lot of educational content in the same place, so thanks MikeMirzayanov!
I was originally going to post this as a comment under the catalog announcement post, but it became too large for a single comment, and it will probably need some more discussion.
I want to suggest the following improvements for the catalog (and initiate a discussion on how best to tackle these issues with the current version):
The reasons and detailed suggestions are as follows:
Multiple topics per post: Sometimes, a post might not fit into just one category at a time. For example, consider the following posts:
The current solution seems to be to just add duplicate entries in different section, but that is not done for most of the above posts (and might not be the first thought that comes to mind for a lot of people, who will probably settle for adding it into one section). My suggestion is using some sort of a community-assigned tag system (not the post tag system) for this task, where a single post can have multiple tags. This is similar to problem tags. Then, the filtering can be done by tags rather than displaying the whole tree at once.
Blog submission for review: The idea behind currently allowing only people whose maximum rating is \(\ge 2400\) is to avoid spam and vandalism in the catalog. However, I think it is not optimal for the following two reasons:
My suggested solution that fixes both of these is to add another stage before posts are added into the catalog. Sure, reds can directly add things to the catalog. The idea is to also allow people that are lower rated (but bounded below by, say, expert) to suggest additions to the catalog. Some kind of pull requests (implemented very simply as a simple blog suggestion with tags/topics) that can be viewed by people who can add stuff to the catalog should work this way.
This way, the issues are resolved as follows:
This post was originally written on Codeforces; relevant discussion can be found here.
People often say C++ is much more cumbersome to write than any sane language should be, but I think that is mainly due to lack of knowledge about how to write modern C++ code. Here’s how you can write code more smartly and succinctly in C++(20):
end
Have you ever found yourself using a vector like a stack just because you have to also access the last few positions, or writing a sliding window code using a deque (where the size might be variable, but you want to access an element at some offset from the end)?
Chances are that you have written code like this:
while (stk.size() >= 3 && stk[stk.size() - 2] + stk[stk.size() - 1] > stk[stk.size() - 3]) {
stk.pop_back();
}
A cleaner way of writing the same thing is this:
while (stk.size() >= 3 && end(stk)[-2] + end(stk)[-1] > end(stk)[-3]) {
stk.pop_back();
}
This trick works not just for end
, but for any iterator, provided that
(roughly) there is no undefined behaviour like out of bounds accesses.
This can be done only with iterators which provide random access. For example, iterators for vectors but not for sets.
len
and str
Python’s len
and str
come in handy a lot of times, and they’re
convenient to use because they’re free functions (that is, not member
functions).
Fortunately, C++ has its own set of free functions that are quite helpful:
size(v)
: It returns the unsigned size of \(v\), and does the same
thing as v.size()
That is, its numerical value is the size of
\(v\), and type is some unspecified unsigned type (aliased to
size_t
). This also works for arrays (but not raw pointers).ssize(v)
: It returns the signed size of \(v\). This is quite
useful, because it prevents bugs like a.size() - 1
on empty a
. It
doesn’t have a member function analogue, though. This also works for
arrays (but not raw pointers). If you use sz
macros, you can safely
remove them from your code!begin(v)
, end(v)
, rbegin(v)
, rend(v)
: They do the same thing
as what their corresponding member functions do. These also work for
arrays (but not raw pointers).empty(v)
: This checks if \(v\) is empty or not. Analogous to
v.empty()
data(v)
: This returns a pointer to the block of data where it
works.to_string(x)
: This returns the string representation of a datatype.
They’re only defined for things like int
and double
though.If you have used range-based for loops, you might have noticed that there is no support for indexing by default. So people usually either just switch to index based loop, or do the following:
int i = -1;
for (auto x : a) {
i++;
// do something
}
// or
{
int i = -1;
for (auto x : a) {
i++;
// do something
}
}
However, it either pollutes the parent scope with the variable i
and
makes things buggy, or just looks ugly. C++20 has the following feature
that makes it cleaner:
for (int i = -1; auto x : a) {
i++;
// do something
}
Similarly, if there is an object such that you only have to inspect its properties just once and it becomes useless later on, you can construct it in the init part of the if condition, like this:
if (auto x = construct_object(); x.good() && !x.bad()) {
// do something
}
A lot of people write code that looks like this:
std::vector<std::tuple<int, char, std::string>> a;
// populate a
for (int i = 0; i < (int)a.size(); ++i) {
int x = std::get<0>(a[i]);
char y = std::get<1>(a[i]);
std::string z = std::get<2>(a[i]);
}
// or
for (auto& A : a) {
int x = std::get<0>(A);
char y = std::get<1>(A);
std::string z = std::get<2>(A);
}
// or
for (auto& A : a) {
int x; char y; std::string z;
std::tie(x, y, z) = A;
}
A cleaner and more consistent way of writing this can be:
for (auto [x, y, z] : a) {
}
// or
for (auto& A : a) {
auto [x, y, z] = A;
}
If you want references instead of copies stored in x, y, z
, then you
can do this:
for (auto& [x, y, z] : a) {
}
// or
for (auto& A : a) {
auto& [x, y, z] = A;
}
Note that this is not limited to for loops, or even to just tuples, and
you can use this sort of unpacking (structured binding) for most structs
(in particular, custom structs you make, or std::pair
). In fact, if
you have std::array<int, 3>
, then this unpacking will give you its
contents in three different variables.
There are some lesser known functionalities that C++20 introduced, like
starts_with
and ends_with
on strings. Some of the more useful ones
can be found
here.
Notably, there is also support for binary searching like we do in competitive programming, and the relevant information can be found in the language specific section of this post.
You can also print a number in binary, analogously to what you can do in
Python. This can be done by constructing a bitset out of the integer and
converting it to a string, or just by printing std::format("{:b}", a)
(which is unfortunately not on CF yet).
emplace_back
, emplace_front
and emplace
Seriously, stop using a.push_back(make_pair(1, 2))
and
a.push_back(make_tuple(1, 2, 3))
and even a.push_back({1, 2, 3})
,
and start using a.emplace_back(1, 2)
instead.
The only reasonably common place this doesn’t work is when you have a
container that stores std::array
, so emplace_back
doesn’t work; in
that case a.push_back({1, 2, 3})
is fine.
The other variants: emplace_front
and emplace
do the same thing for
their other counterparts (for example, in a deque, queue or
priority_queue).
Lambdas in C++ are in fact cleaner and more intuitive (with more
features) than in Python. Compared to both Python lambdas (being able
to have more than 1 line of code) and local functions (having mutable
states more intuitively and capturing parameters from an ancestor
scope), in fact. This is because you can capture stuff more explicitly,
have mutable state, can pass them as function objects, and even
implement recursion using generic lambdas, though C++23 will have a
feature called deducing this
which will make a lot of things much
simpler.
They can be passed around to almost all algorithms in the STL algorithms library, which is part of the reason why the STL is so powerful and generic.
They can also be used to make certain functions local, and avoid polluting the global namespace with function definitions. One more thing about them is that they are constexpr by default, which is not true for traditional functions.
You probably know what we’re coming to at this point: std::deque
. It
is essentially std::vector
(without \(O(1)\) move guarantees but
whatever), but with no iterator invalidation on resizing/adding
elements.
That is why it is ideal in a very important setting:
No more Python laughing in our face about having to worry about iterator invalidation!
Suppose you want to do something for multiple arguments, but they aren’t indices. Most people would either duplicate code, or do something like this:
vector<char> a{'L', 'R', 'D', 'U'};
for (auto x : a) { /* do something */ }
But this has a disadvantage: it needs more characters and also makes it more error-prone to use vectors. Instead of this, you can do this:
for (auto x : {'L', 'R', 'D', 'U'}) { /* do something */ }
or even
for (auto x : "LRDU"s) { /* do something */ }
Note that the s
at the end makes it a string literal which is why it
works, and const char*
doesn’t work because there is a \0
at the end
of the literal. This can still allocate on the heap (not in this case,
only for big strings due to SSO).
In case you explicitly want an allocation with a vector for some reason, you can use
for (auto x : std::vector{'L', 'R', 'D', 'U'})
or
for (auto x : std::vector<char>{'L', 'R', 'D', 'U'})
C++20 has a feature called ranges, that allows you to do manipulations on, well, ranges.
Let’s think about how to implement this: For integers between 1 to 5, multiply them by 2, and print the last 3 elements in reversed order.
#include <bits/stdc++.h>
int main() {
std::vector<int> a{1, 2, 3, 4, 5};
for (auto& x : a) x *= 2;
for (int i = 0; i < 3; ++i) std::cout << a[(int)a.size() - i - 1] << '\n';
}
However, this is prone to having bugs (what about out of bounds errors in some other scenario?), and doesn’t look intuitive at all.
Now consider this:
#include <bits/stdc++.h>
namespace R = std::ranges;
namespace V = std::ranges::views;
int main() {
for (auto x : V::iota(1, 6) | V::transform([](int x) { return x * 2; }) | V::reverse | V::take(3))
std::cout << x << '\n';
}
Here the intent is pretty clear. You take integers in \([1, 6)\), multiply them by 2, reverse them, and take the first 3. Now in this resulting range of numbers, you print each of them.
So let’s see what happens here. The |
operator is defined for ranges
and an operation on ranges that returns a range. For example,
V::transform(range, lambda)
gives a range where every element is
transformed, and doing range | V::transform(lambda)
does the same
thing. In fact, for a lot of cases, this isn’t stored in memory at all
(and operations are done while iterating on the fly), unless you have
things like reversing a range.
One thing that you can finally do is split a string with delimiters in a
nice way (which used to be a major point of ridicule directed towards
C++ from a lot of Python users). This is as simple as using a split
view in the ranges library. For more information, see
this. Here’s an
example taken verbatim from the cppreference page:
#include <iomanip>
#include <iostream>
#include <ranges>
#include <string_view>
int main() {
constexpr std::string_view words{"Hello^_^C++^_^20^_^!"};
constexpr std::string_view delim{"^_^"};
for (const auto word : std::views::split(words, delim))
std::cout << std::quoted(std::string_view{word.begin(), word.end()}) << ' ';
}
A more common example can be like this:
#include <bits/stdc++.h>
namespace R = std::ranges;
namespace V = std::ranges::views;
int main() {
std::cin.tie(nullptr)->sync_with_stdio(false);
std::string s(1000000, '0');
for (int i = 0; i < 1000000 / 2; ++i) s[i * 2] = '1';
for (const auto word : V::split(s, '1'))
std::cout << std::string_view{begin(word), end(word)} << ' ';
}
Some ranges functionality is as follows:
r
is a range, then for most STL containers,
like std::vector
, you can do something like
std::vector<int> a(r.begin(), r.end())
. C++23 has a feature
R::to<ContainerType>(range)
that will construct a container out of a
range, and you could even do range | R::to<ContainerType>()
.range()
but without the
stride.reversed()
keys
takes a range whose elements are pair-like and
returns a range of their first elements (so if we have (key, value)
pairs, it will return keys), similar to .keys()
in Python.
Similarly, values
returns a range of the values.Note that this is not the only thing you can do with ranges. In fact,
with almost all possible algorithms that you can find in the algorithms
library that use a begin and an end iterator (like std::sort
or
std::partial_sum
or std::all_of
), you can find a ranges version
here, where you
can just pass the range/container name instead of the begin and end
iterators you usually pass.
By the way, if you don’t want to wait for C++23, you can use some of the functionality from this code, which I used to use a couple of years ago when C++20 was not a thing.
template <typename T>
struct iterable;
template <typename T>
struct iterable<T&> {
using type = T&;
};
template <typename T>
struct iterable<T&&> {
using type = T;
};
template <typename T, std::size_t N>
struct iterable<T (&&)[N]> {
using type = typename T::unsupported;
};
template <typename T>
struct iterator_from_iterable {
using iterable = typename std::remove_reference<T>::type&;
using type = decltype(std::begin(std::declval<iterable>()));
};
template <typename T>
struct iterable_traits {
using raw_iterable = T;
using raw_iterator = typename iterator_from_iterable<raw_iterable>::type;
using wrapped_iterable = typename iterable<T>::type;
using deref_value_type = decltype(*std::declval<raw_iterator>());
};
template <typename T>
struct Range {
private:
const T l, r, skip;
public:
struct It : public std::iterator<std::forward_iterator_tag, T> {
T i;
const T skip;
explicit It(T _i, T _skip) : i(_i), skip(_skip) {}
It& operator++() { return i += skip, *this; }
const It& operator++(int) {
auto temp = *this;
return operator++(), temp;
}
T operator*() const { return i; }
bool operator!=(const It& it) const { return (skip >= 0) ? (i < *it) : (i > *it); }
bool operator==(const It& it) const { return (skip >= 0) ? (i >= *it) : (i <= *it); }
};
using iterator = It;
using value_type = T;
Range(T _l, T _r, T _skip = 1) : l(_l), r(_r), skip(_skip) {
#ifdef DEBUG
assert(skip != 0);
#endif
}
Range(T n) : Range(T(0), n, T(1)) {}
It begin() const {
return It(l, skip);
}
It end() const {
return It(r, skip);
}
It begin() {
return It(l, skip);
}
It end() {
return It(r, skip);
}
};
template <typename... T>
struct zip {
public:
using value_type = std::tuple<typename iterable_traits<T>::deref_value_type...>;
using wrapped_iterables = std::tuple<typename iterable_traits<T>::wrapped_iterable...>;
using raw_iterators = std::tuple<typename iterable_traits<T>::raw_iterator...>;
using sequence = std::index_sequence_for<T...>;
struct It : public std::iterator<std::forward_iterator_tag, value_type> {
public:
explicit It(raw_iterators iterators) : iterators_(std::move(iterators)) {}
bool operator==(const It& it) const { return any_eq(it, sequence()); }
bool operator!=(const It& it) const { return !any_eq(it, sequence()); }
value_type operator*() const { return deref(sequence()); }
It& operator++() { return inc(sequence()), *this; }
const It& operator++(int) {
auto temp = *this;
return operator++(), temp;
}
private:
raw_iterators iterators_;
template <std::size_t... I>
bool any_eq(const It& it, std::index_sequence<I...>) const {
return (... || (std::get<I>(iterators_) == std::get<I>(it.iterators_)));
}
template <std::size_t... I>
value_type deref(std::index_sequence<I...>) const {
return {(*std::get<I>(iterators_))...};
}
template <std::size_t... I>
void inc(std::index_sequence<I...>) {
(++std::get<I>(iterators_), ...);
}
};
using iterator = It;
explicit zip(T&&... iterables) : iterables_(std::forward<T>(iterables)...) {}
It begin() { return begin_(sequence()); }
It end() { return end_(sequence()); }
It begin() const { return begin_(sequence()); }
It end() const { return end_(sequence()); }
private:
wrapped_iterables iterables_;
template <std::size_t... Int>
iterator begin_(std::index_sequence<Int...>) {
return iterator(std::tuple(std::begin(std::get<Int>(iterables_))...));
}
template <std::size_t... Int>
iterator end_(std::index_sequence<Int...>) {
return iterator(std::tuple(std::end(std::get<Int>(iterables_))...));
}
};
template <typename... T>
zip(T&&...) -> zip<T&&...>;
template <typename T>
struct enumerate {
public:
using size_type = typename std::make_signed<std::size_t>::type;
using wrapped_iterable = typename iterable_traits<T>::wrapped_iterable;
using raw_iterator = typename iterable_traits<T>::raw_iterator;
using value_type = std::pair<size_type, typename iterable_traits<T>::deref_value_type>;
struct It : public std::iterator<std::forward_iterator_tag, value_type> {
raw_iterator iter_;
size_type start_;
public:
It(raw_iterator it, size_type start) : iter_(it), start_(start) {}
bool operator==(const It& it) const { return iter_ == it.iter_; }
bool operator!=(const It& it) const { return iter_ != it.iter_; }
It& operator++() { return ++iter_, ++start_, *this; }
const It operator++(int) {
auto temp = *this;
return operator++(), temp;
}
value_type operator*() const { return {start_, *iter_}; }
};
using iterator = It;
explicit enumerate(T&& iterable, size_type start = 0) : iterable_(std::forward<T>(iterable)), start_(start) {}
It begin() { return It(std::begin(iterable_), start_); }
It end() { return It(std::end(iterable_), 0); }
It begin() const { return It(std::begin(iterable_), start_); }
It end() const { return It(std::end(iterable_), 0); }
private:
wrapped_iterable iterable_;
size_type start_;
};
template <typename T>
enumerate(T&&) -> enumerate<T&&>;
template <typename T, typename Index>
enumerate(T&&, Index) -> enumerate<T&&>;
Finally, I’d like to thank ToxicPie9, meooow and magnus.hegdahl for discussing certain contents of this post, and other members of the AC server for giving their inputs too.
]]>This post was originally written on Codeforces, relevant discussion can be found here.
This post is intended to be an introduction to permutations for beginners, as there seem to be no resources other than cryptic (for beginners) editorials that talk about some cycles/composition, and people unfamiliar with permutations are left wondering what those things are.
We’ll break the post into three major parts based on how we most commonly look at permutations. Of course, even though we will treat these approaches separately (though not so much for the last two approaches), in many problems, you might need to use multiple approaches at the same time and maybe even something completely new. Usually, you’d use these ideas in conjunction with something like greedy/DP after you realize the underlying setup.
An ordering of the sections in decreasing order of importance according to the current permutations meta would be: cycle decomposition, permutation composition, ordering.
However, the topics are introduced in a way so that things make more sense, because there are some dependencies in definitions.
To start with, here’s the definition of a permutation that you might find in many problems on Codeforces:
Definition: A permutation of size \(n\) is an array of size \(n\) where each integer from \(1\) to \(n\) appears exactly once.
But why do we care about these arrays/sequences? The reason behind this is simple.
Consider any sequence \(a\) of \(n\) integers. We can index them with integers from \(1\) to \(n\). So the \(i\)-th integer in the sequence is \(a_i\). To understand the following, it helps to consider the example where \(a_i = i\) for all \(i\).
Let’s now shuffle this sequence (elements are allowed to stay at their original places). We will try to construct a permutation from this shuffling in two ways:
This means shuffling around elements (or permuting them, in standard terminology) is another way to think about permutations.
This shows us how permutations are related to functions from \(\{1, 2, \dots, n\}\) to itself such that all images are distinct (and all elements have a unique pre-image). In other words, a permutation can be thought of simply as a bijection on a set, which is the most natural definition to use for quite a few applications. The second interpretation is what corresponds to this definition, and the thing in the first definition is what we sometimes call the position array of the permutation in the second definition (this name is not standard). We will later see that it is what is called the inverse of the permutation.
Note that the above explanation is quite important for having an intuition on what a permutation is. If you don’t understand some parts, I recommend re-reading this and trying a few simple examples until you’re comfortable with what the permutation and functions \(f\) and \(g\) have to do with each other.
Note that a permutation \([a_1, a_2, \dots, a_n]\) of \([1, 2, \dots, n]\) corresponds to a function \(f\) on \(\{1, 2, \dots, n\}\) defined by \(f(i) = a_i\). Implicitly speaking, the set of pairs \((i, a_i)\) uniquely determines the permutation (it is just the set of mappings from position to element) and vice versa. To understand compositions and inverses (which will be introduced later on) better, the following notation can come in handy:
Here we have just arranged the pairs vertically, from left to right. Note that these pairs can be shuffled around (like dominoes), and this can lead to some nice properties:
\[\begin{pmatrix} 1 & 2 & \cdots & n \\\ a_1 & a_2 & \cdots & a_n \end{pmatrix}\]
Note that when a permutation is sorted, it leads to \([1, 2, \dots, n]\). So any accumulation operation on the permutation array (like sum, product, xor, sum of squares, number of odd integers, etc.) that doesn’t rely on the order of elements in an array will give the same value for all permutations of the same length.
A fixed point of a permutation \(a\) is an index \(i\) such that \(a_i = i\). These are essentially the values that are not affected by the permutation at all, so if we disregard what happens to these values, we won’t be losing out on any information for that permutation \(a\). This is also why these \(i\)-s are sometimes skipped in the two-line notation (and also in the cycle notation, which will be introduced in the next section).
Note: If you are familiar with cycles or are on your second reading, you might note that this idea is probably more appropriate for cycles and compositions, but I kept it in this section since it shares some ideas with other subsections in this section. The same goes for the next section.
A derangement is a permutation with no fixed points. That is, for every \(i\), we have \(a_i \ne i\). One useful thing to know is how many of the \(n!\) permutations of size \(n\) are derangements. Let’s say this number is \(D_n\). There are at least three distinct ways to count them:
Solving linear equations
In this approach, we group all possible permutations by the number of their fixed points. If there are \(i\) fixed points, then upon relabeling the remaining \(n - i\) elements from \(1\) to \(n - i\), the resulting permutation is a derangement. So, the number of permutations on \(n\) elements with \(i\) fixed points is exactly \(\binom{n}{i} \cdot D_{n - i}\).
Summing this over all \(i\) gives us the identity \(n! = \sum_{i = 0}^n \binom{n}{i} \cdot D_{n - i}\).
Along with the fact that \(D_0 = 1\) and \(D_1 = 0\), we can use induction to show that \(D_n = n! \cdot \sum_{i = 0}^n \frac{(-1)^i}{i!}\). We can also use the following identity to come up with the same result:
If there are functions \(f\) and \(g\) from \(\mathbb{Z}_{\ge 0}\) to \(\mathbb{R}\), then the following two statements are equivalent:
Recursion
Let’s count the number of derangements using a recursion. Suppose \(a_n = k \ne n\). We look at \(a_k\). If \(a_k\) is \(n\), then there are \(D_{n - 2}\) ways of assigning the remaining numbers to the permutation. If \(a_k\) is not \(n\), then we note that this reduces to the number of derangements on \(n - 1\) numbers, by renaming \(n\) (to be assigned in the remaining \(n - 2\) slots) to \(k\).
So the recurrence becomes \(D_n = (n - 1) \cdot (D_{n - 1} + D_{n - 2})\), and this can be solved using induction too.
Using the principle of inclusion and exclusion
We can use an inclusion-exclusion argument to solve this problem too. Let \(S_i\) be the set of permutations that leave the \(i\)-th element fixed (the remaining are free to be permuted).
Then we need \(n! - |\cup_i S_i|\). Using the inclusion-exclusion principle, since the intersection of any \(k\) different \(S_i\)’s has size \((n - k)!\), we again get the same expression for \(D_n\).
An inversion in an array (not necessarily a permutation) is any pair of indices \((i, j)\) such that \(i < j\) and \(a_i > a_j\). What is the minimum number of inversions? It is 0, because you can just enforce \(a_i = i\). What is the maximum number of inversions? It is \(n(n - 1)/2\) because you can just enforce \(a_i = n - i + 1\), and then every unordered pair of distinct indices corresponds to an inversion.
Let’s consider the following task: For a given permutation (or an array with distinct elements), you are allowed to do the following operation any number of times:
Pick any \(1 \le i < n\) and swap \(a_i\) and \(a_{i + 1}\).
Is it possible to sort this array using these operations only?
If yes, what is the minimum number of operations if you want to sort this array using these operations only?
Note that if it is possible for a permutation, it is also possible for any array with distinct elements (we can just replace the elements in the array by their ranks when sorting in increasing order).
The answer to the first part is yes. The main idea is to pick the smallest element and keep swapping it with the previous element until it reaches the first position. Then do the same thing recursively for \([2, \dots, n]\).
Let’s think about the number of operations here. How many operations do we do in the first phase? Note that everything coming before \(1\) in the original permutation contributed to an inversion where the second element was \(1\). Now after \(1\) is in its intended place, there will be no more inversions that are related to \(1\) in any way. For the rest of the permutation, we can do this argument recursively. But for that, we need to see what happens to inversions where \(1\) was not involved. Note that the relative order of no other pair changes, so the other types of inversions remain the same! So, the total number of operations is clearly the number of inversions in the permutation.
Is this the minimum number of operations you need to sort the array? The answer turns out to be yes. Consider any operation that is done. Since it flips the order of two adjacent things, there is at most one inversion it can reduce. So the decrease in the number of inversions is at most \(1\). In a sorted array, there are no inversions. So the number of operations must be at least the number of inversions in the original permutation, and thus we are done.
Now you might think — how do we actually count the number of inversions in a permutation? The answer is you can do that using merge sort or by some point-update-range-sum data structure like Fenwick tree or segment tree. The merge sort solution can be found here, and for the data structure solution, consider the following:
An increasing subsequence of an array \(a\) (not necessarily a permutation) is a sequence of indices \(i_1 < i_2 < \dots < i_k\) such that \(a_{i_1} < a_{i_2} < \dots < a_{i_k}\). Decreasing subsequences are defined similarly.
An algorithm to find a longest increasing (or, with a few modifications, non-decreasing) subsequence of an array can be found in this nice video tutorial. But that is not what we are concerned with at the moment.
We are concerned with bounds on the length of any longest increasing subsequence (LIS from here on). However, for decreasing subsequences, an LIS has length \(1\). The Erdos Szekeres theorem tells us that in such cases, the length of the longest decreasing subsequence will be large.
More formally, the theorem states that in any permutation (or array with distinct elements) of size at least \(xy + 1\), there is either an increasing subsequence of length \(x + 1\) or a decreasing subsequence of length \(y + 1\).
The easiest way to prove this theorem is via an application of the Pigeonhole principle.
Suppose for the sake of contradiction that the theorem is false for some permutation \(a\). For every \(i\), consider the length of a longest increasing subsequence ending at index \(i\) and the length of a longest decreasing subsequence ending at index \(i\). Let’s call these numbers \(x_i\) and \(y_i\). Note that all \(x_i\)-s are integers between \(1\) and \(x\), and all \(y_i\)-s are integers between \(1\) and \(y\). So there can be at most \(xy\) distinct pairs \((x_i, y_i)\). By the Pigeonhole principle, there are \(i < j\) such that \((x_i, y_i) = (x_j, y_j)\). Since all elements are distinct, \(a_i < a_j\) or \(a_i > a_j\). In the first case, it is impossible that \(x_i = x_j\), and in the latter, it is impossible that \(y_i = y_j\). This is a contradiction, and we are done.
A more sophisticated and deeper proof of this theorem can be done using Dilworth’s theorem. This blog uses it to prove a special case of this theorem, though the proof can be modified easily to work for the complete proof too.
Note that permutations are just sequences of integers, so it is possible to sort the set of all possible sequences of size \(n\) lexicographically (i.e., in the order you would find words in a dictionary). This defines a natural indexing of each permutation. How do we find the next permutation from a given permutation?
The easiest way (in C++) is to use std::next_permutation
, but we’ll
briefly describe how it works.
Let’s find the first index \(i\) from the right so that \(a_i > a_{i - 1}\). Since \(i\) was the first index satisfying this condition, all indices \(i\) to \(n\) must form a decreasing sequence. Note that the smallest number in this sequence that is larger than \(a_i\) will be the new element at position \(i\), and the rest of them (along with the current \(a_i\)) will be sorted in increasing order after it. All of this can be implemented in time \(O(n - i + 1)\).
Note that starting from the lexicographically smallest permutation (which is \([1, 2, \dots, n]\)), the number of permutations between (both inclusive) this permutation and a permutation whose first position \(i\) such that \(a_i \ne i\) is \(k\), is at least \((n - k)! + 1\). This means that if you apply next_permutation even a large number of times, the number of elements in the permutation that will ever change will not be large (unless you start from a specific permutation, and even in that case, apart from a single change for potentially all indices, the same conclusion holds).
So if for a permutation of length \(n\), you do the next_permutation operation (as implemented above) \(O(n^k)\) times, the time taken will be (much faster than) \(O(k n^k \log n)\). You can even bound the number of operations to perform next_permutation \(r\) times by \(O(r + n)\). The analysis is similar to the complexity analysis when you increment the binary representation of an \(n\)-bit number \(r\) times.
There are a total of \(n!\) permutations of size \(n\). How do we construct a random permutation if all we have is a random number generator that can give us integers in a range we can choose?
For thinking about how we can incrementally construct such a permutation, let’s try to remove all elements \(> k\) for some \(k\). Note that the position of \(k\) in this array is equally likely to be any integer from \(1\) to \(k\). However, this won’t lead to a very efficient algorithm right away.
We would want to rather construct a random permutation from left to right. Suppose we have a prefix chosen already. Note that the permutations of all elements on the right are equally likely. Also, all elements on the right are equally likely to be the first element of the suffix (i.e., the last element of the just-larger prefix). This observation leads to Durstenfeld’s variation of the Fisher-Yates shuffle.
Now let’s think about some statistics of random permutations.
Firstly, what is the expected length of the greedily picked increasing sequence by this algorithm:
Note that an element is picked if and only if it is more than everything before it. That is, we can do the following using linearity of expectation:
\(E[l] = E[\sum_i(a_i \text{ chosen})] = \sum_i P[a_i \text{ chosen}] = \sum_i P[a_i > \max(a_1, \dots, a_{i - 1})] = \sum_i \frac{1}{i}\), which is the harmonic sum, and is approximately \(\ln n\).
However, it can be shown that the expected length of the LIS is \(\Theta(\sqrt{n})\), so a greedy approach is much worse than computing the LIS systematically.
For more random permutation statistics that also have some information on statistics derived from the next few sections, I would refer you to this.
We will now switch gears and note a very important part of the theory of permutations — the cycle decomposition. You will find yourself referring to this quite a lot while solving problems on permutations.
Suppose we have a permutation \(a\). Let’s fix some \(i\). Let’s try to look at the sequence \(i\), \(a_i\), \(a_{a_i}\) and so on. Eventually, it must repeat because there are at most \(n\) values that this sequence can take. For the sake of convenience, let’s call the \(j\)-th element of this sequence \(b_j\).
Then for some \(k < l\), we have \(b_k = b_l\). Let’s take \(k\) to be the smallest such index, and let \(l\) be the smallest such index for that corresponding \(k\). If \(k\) is not \(1\), then we must have \(b_{k - 1} = b_{l - 1}\), because \(a\) is a bijection. However, this contradicts the minimality of \(k\)! This means that the first element that repeats in the sequence is, in fact, \(i\).
Now suppose something in the sequence between \(1\) and \(l\) repeats (let’s say at positions \(1 < m < o < l\)). Then by repeating the bijection argument \(m - 1\) times, we must have \(b_{o - m + 1} = b_1 = i\), so \(o - m + 1 \ge l\). But this is a contradiction. This means that the sequence repeats starting from \(l\) and forms a cycle.
We call this a cycle because if we made the graph where there was a directed edge from \(i\) to \(a_i\), then this \(i\) would be in a cycle of length \(l - 1\).
Now, this was done for a single \(i\). Doing this for all \(i\) means that all elements are in a cycle in this graph. Clearly, since the indegree and outdegree of each \(i\) are \(1\), this means that these cycles are all disjoint.
Let’s take an example at this point. Consider the permutation \([2, 3, 1, 5, 4]\). The cycles for each element are as follows:
These correspond to the cycles \(1 \to 2 \to 3 \to 1\) and \(4 \to 5 \to 4\) in the graph.
Note that these cycles are independent. That is, we can just say that for these cycles, any element outside the cycle is irrelevant. So in this sense, for any permutation, we can just list out its cycles and we will be able to determine the permutation uniquely.
So we just represent a permutation as a list of its cycles. For example, the permutation \([2, 3, 1, 5, 4]\) can be represented as \((1 2 3)(4 5)\).
Note: there is a deeper meaning behind this notation that is related to composition and how disjoint cycles commute for composition.
Consider the following task: for a permutation, compute the value of \(a_{a_{\ddots_{i}}}\) for each \(i\) where there are \(k\) \(a\)-s, \(k \le 10^{18}\).
Note that if you can find cycles, you can do a cyclic shift for each cycle (appropriately computed modulo the cycle length) and give the answer.
What if it was not guaranteed that \(a\) is a permutation, but all elements in \(a\) are in \([1, n]\) nevertheless? The cycle argument breaks down here, but you can show the following fact:
The same problem can be solved in this setting, too, by using binary lifting for each vertex till it reaches the “main” cycle of its component.
In the language of graph theory, such graphs are called functional graphs (which have outdegree of each vertex = 1).
Some problems to practice on are Planet Queries (I and II) on CSES.
Let’s start from the array \([1, 2, \dots, n]\) and try to apply swaps on this array till we reach some desired permutation \(a = [a_1, a_2, \dots, a_n]\). Note that we can always apply such swaps, by trying to build the permutation from left to right. These swaps are also called transpositions.
Let’s now think about what happens when we apply a swap to a permutation. Suppose the swap swaps elements at positions \(i\) and \(j\). Let’s look at the graph associated with this permutation. In this graph, swapping two elements at positions \(x\) and \(z\) is equivalent to breaking two edges \(x \to y\) and \(z \to w\) and making two edges \(x \to w\) and \(z \to y\) (in something like a crossing manner).
We make two cases:
A relevant problem at this point would be 1768D.
Let’s consider another permutation sorting problem, but here rather than just adjacent elements being swapped, we allow swapping any elements. What is the minimum number of swaps to sort a permutation this way?
Note that in the final result, there are exactly \(n\) cycles (singletons). Let’s say we currently have \(c\) cycles. Since the number of cycles increases by at most 1 each time, there must be at least \(n - c\) swaps to sort this permutation.
Is this achievable? Yes. We can keep swapping elements in the same cycle while there is a cycle of length at least \(2\), and this will sort the permutation.
Suppose we have two permutations \(a\) and \(b\) of the same size \(n\). Now we want to define a composition of them as follows: \(ab\) is defined as the array \(c\) where \(c_i = a_{b_{i}}\).
This essentially corresponds to finding the composition of the \(g\)-functions of \(a\) and \(b\), with \(g\) for \(b\) being applied first.
Let’s get an intuitive sense of what it does, by getting some information out of the definition. Let’s first deal with the case when \(b\) is the “identity” permutation: \(b_i = i\). Then \(c = a\) according to this definition. Similarly, if \(a\) is the identity permutation, then \(c = b\). So, composing with the identity permutation in any order gives the original permutation back.
Now let’s understand this by using the two-line notation. The idea is to take the two-line notation for both permutations and do some sort of pattern matching to get to their composition.
Consider the following permutations \(a\) and \(b\).
\[\begin{pmatrix} 1 & 2 & \cdots & n \\ a_1 & a_2 & \cdots & a_n \end{pmatrix}\]
and
\[\begin{pmatrix} 1 & 2 & \cdots & n \\ b_1 & b_2 & \cdots & b_n \end{pmatrix}\]
Let’s reorder the columns of the two-line notation for \(a\) so that the lower row of \(b\) and the upper row of \(a\) match. However, this corresponds to shuffling the columns of \(a\) according to \(b\):
\[\begin{pmatrix} b_1 & b_2 & \cdots & b_n \\ a_{b_1} & a_{b_2} & \cdots & a_{b_n} \end{pmatrix}\]
Note that this is still a valid two-line notation for \(a\). Now, if we “merge” this with the two-line notation for \(b\), this gives us the two-line notation for the composition of \(a\) and \(b\).
Note how this idea of composition arises naturally from the interpretation of a permutation as a bijection on a set.
We start with the identity permutation, apply the first permutation to it, and then apply the second permutation to it. Applying means replacing \(x \mapsto a_x\). The two-line notation for a permutation \(a\) has the following property:
Note that we can associate a cycle itself with a permutation, where the non-cycle elements are fixed points, and the remaining are mapped according to the cycle. In this sense, cycles are permutations.
Also note that a swap itself is a permutation in the same manner. In fact, a swap is just a cycle of length 2.
It is helpful to play around with a few examples of swaps and cycles and write them in two-line notation to understand their mappings and try composing them.
At this point, you should also try to recall the following two things:
When we introduced cycles, we wrote them in a certain format. In fact, you can compose disjoint cycles without worrying about the order in which they are composed. This should be obvious from the fact that they are independent in a way (an element that gets moved in one permutation is a fixed point of the other). You can also see that the notation \((1 2 3)(4 5)\) also conveniently captures that this permutation is a composition of the cycles \((1 2 3)\) and \((4 5)\).
When we discussed swaps, we noted that you could “apply” swaps. This corresponds to left-multiplying the permutation with the corresponding swap permutation. It is instructive to come up with a few examples of swaps and how they compose at this point.
It is important to note that (by using the fact that function composition is associative) permutation composition is associative.
Let’s say we have a permutation where we came to by applying the following three permutations: \((12)\), \((312)(43)\) and \((53241)\) in this order, we would write it as \((53241)(312)(43)(12)\). Note that here \((12)\) and \((43)\) are cycles of length \(2\), i.e., they are swaps.
Now how do we reduce this back to a normal permutation? One way would be to write out the permutation for each cycle and then keep composing it. A cleaner way of doing the same thing would be to do this: for every element, start from the right, and replace it with what the cycle maps it to. For instance, if we want to find the image of \(1\), the first cycle maps \(1\) to \(2\), the second cycle maps \(2\) to \(2\), the third cycle maps \(2\) to \(3\), and the fourth cycle maps \(3\) to \(2\).
Recall that we mentioned earlier that an adjacent swap reduces the number of inversions by at most \(1\). Well, to be more precise, it changes the number of inversions by \(\pm 1\), and hence flips the parity of the number of inversions of the permutation.
We define the parity of the permutation to be the parity of the number of inversions of the permutation.
The reason why this is important is that it gives a handle on the parity of swaps (not just adjacent swaps) as well as some information about the cycle parities, as follows.
Let’s consider a swap that swaps elements at positions \(i < j\). Then a way to get to this would be to apply \(j - i\) adjacent swaps to bring \(a_j\) to position \(i\), and \(j - i - 1\) adjacent swaps to bring \(a_i\) (from its new position) to position \(j\). This takes an odd number of total adjacent swaps, so the final number of inversions is also flipped.
As a corollary, applying any swap changes the parity of the permutation, and in particular, all swap permutations have odd parity (parities add up modulo \(2\) over composition, as can be seen easily).
Now note that for a cycle, we can construct it by doing a few swaps. More specifically, if the cycle is \((i_1, i_2, \dots, i_k)\), then we can do \(k - 1\) swaps to apply the same transformation as the cycle permutation.
So all even-sized cycles have odd parity, and all odd-sized cycles have even parity.
Let’s consider any decomposition of a permutation into (possibly non-disjoint) cycles. As a result of the above discussion, the parity of the permutation is odd iff the number of even-sized cycles is odd.
In particular, for a decomposition of a permutation into swaps, the parity of the permutation is the same as the parity of the number of swaps in the decomposition. Note that this also implies that we can’t have two decompositions with an odd difference in the number of swaps.
Rephrasing the above result, we have the fact that if it takes an odd number of swaps to get to a permutation, the permutation has odd parity and vice versa.
Clearly, from the above discussion, it can be seen that parities add modulo \(2\) over composition. This, however, is important enough to warrant its own section because it summarizes all of the discussion above.
Firstly, let’s think about permutation in an application sense. Is there a way to get back the original permutation after some permutation has been applied to it? From our remarks on the two-line notation, it suffices to do this for the initial permutation being the identity permutation. Let’s say the applied permutation was \(a\):
\[\begin{pmatrix} 1 & 2 & \cdots & n \\ a_1 & a_2 & \cdots & a_n \end{pmatrix}\]
By our remarks on composition using the two-line notation, we need something that swaps these two rows as follows:
\[\begin{pmatrix} a_1 & a_2 & \cdots & a_n\\ 1 & 2 & \cdots & n \end{pmatrix}\]
Note that this is easy: consider the position array \(p\) of \(a\) as we defined earlier; i.e., \(p_i\) is the position where \(a\) equals \(i\), i.e., \(a_{p_i} = i\). Note that \(a\) is \(a_i\) at \(i\), so \(p_{a_i}\) is the position where \(a\) is \(a_i\), i.e., \(p_{a_i} = i\). This shows that \(pa = ap\), and this common value is the identity permutation (which we shall denote by \(\text{id}\) in what follows).
In other words, \(p\) and \(a\) are inverses of each other under composition. It is trivial to note that \(a\) is the position array of \(p\) as well. We denote \(p\) by \(a^{-1}\).
Now that we have some intuition about the inverse in terms of the two-line notation, we think about it in terms of cycles, swaps, and compositions:
In particular, for a decomposition of a permutation into (not necessarily disjoint) cycles, we can just take a mirror image of the decomposition, and it will correspond to the inverse of the permutation!
Also, the parity of the inverse is the same as the parity of the original permutation.
An involution is a permutation whose inverse is itself. Consider the disjoint cycle decomposition of the permutation \(a = c_1 \dots c_k\). \(a = a^{-1}\) means \(\text{id} = a^2 = \prod_i c_i^2\). Note that we are able to write this because disjoint cycles commute, and so do identical cycles. \(c_i^2\) should be an identity map on the elements of \(c_i\) because else the resulting permutation won’t be \(\text{id}\). This means that \(c_i\) is either of size \(1\) or \(2\). Conversely, it can be checked that these permutations are involutions.
Since permutation composition is associative, we can use binary exponentiation to compute the permutation. A more mathematical way to think about it is to use the same trick as in the previous subsection. With definitions the same as in the previous section, for any \(a\), \(a^k = \prod_i c_i^k\), so the cycle structure is determined by the powers of the cycles.
Finding the \(k\)-th power of a cycle just corresponds to mapping it to its \(k\)-th next neighbor in the cycle. If the length of the cycle is \(c\), and \(g = \gcd(c, k)\), then the disjoint cycle decomposition of \(c^k\) consists of \(g\) cycles, with each of them corresponding to a stride of \(k\) along the original cycle.
The order of a permutation \(a\) is defined as the least \(k\) such that \(a^k\) is the identity permutation. Let’s again look at the cycle decomposition as in the previous subsection. For a cycle \(c\) of length \(l\) to be “dissolved” into singletons, it is necessary and sufficient to have the power of the permutation divisible by \(l\). Hence, the order of the permutation is just the LCM of all cycle lengths.
For this section, we will consider the following problem: 612E. The solution is left as an exercise to the reader (it shouldn’t be hard, considering what we have discussed in the last few sections).
Suppose \(a, b\) are two permutations. We think about the following permutation: \(aba^{-1}\). If we think about how the two-line notation for these permutations combines, we can note that this is just the permutation whose cycle decomposition is almost the same as \(b\), but \(a\) is applied to each element in the cycles. In other words, we “reinterpret” the cycles by mapping to and from a separate “ground set” via permutation \(a\).
For a formal statement, let the cycle decomposition of \(b\) be \(c_1 \cdots c_k\), where \(c_i = (x_{i1}, \dots, x_{il_i})\). We claim that \(aba^{-1}\) is \(c’_1 \cdots c’_k\) where \(c’_i = (a_{x_{i1}}, \dots, a_{x_{il_i}})\).
The proof formalizes the earlier argument. Consider an element \(i\), and we look at what happens to \(i\) as each permutation is applied to it:
Since both of them give the same result, we are done.
This says that the operation \(aba^{-1}\) is nothing but a translation of the labels of \(b\) according to \(a\).
Note that by this result, since we are just applying the permutation to everything inside the cycles, we can map any product of disjoint cycles to any other product of disjoint cycles, provided the multiset of cycle sizes doesn’t change.
Let’s call \(b\) and \(c\) to be conjugate if there is an \(a\) such that \(c = aba^{-1}\), and call this operation conjugation.
Thus this leads to a partition of all permutations into equivalence classes (after verifying a couple of axioms), where everything inside a class is conjugate to everything else in the class. Note that any equivalence class is uniquely determined by the multiset of cycle sizes. These equivalence classes are called conjugacy classes.
This section will mostly be about some very brief references and is intended to tell people about some topics they might not have seen or heard about before, which are relevant to this discussion. There might be posts in the future regarding these topics, but it’s still a good idea to read up on these yourself.
The set of permutations of size \(n\) forms what is called a group and is denoted by \(S_n\), and it turns out that all finite groups of size \(n\) are isomorphic to some subgroup of \(S_n\). This is why there has been a lot of work on permutations from a group theoretic perspective. The section on permutation composition is best viewed under a group theoretic lens.
The analysis of a lot of games can be done using group theory, for instance, 15-game, Rubik’s cube, Latin squares, and so on.
There is a subgroup of permutations that corresponds to all even permutations. This group is also known to be simple and can be used to show results about the 15-game, for example.
Using group theoretic ideas, the Burnside lemma and the Polya enumeration theorem come into the picture, and they help to count classes of objects (equivalent under some relation) rather than objects themselves.
Stirling numbers are quite useful when counting things related to permutations, and they deserve a whole post for themselves.
Note that in the above discussions, especially in the section on ordering, we used arrays of distinct elements and permutations interchangeably. Whenever something only depends on the \(\le, <, >, \ge, =\) relations between the elements and not the elements themselves, it makes sense to replace elements with their rank. In the case of distinct elements, this “compressed version” becomes a permutation.
Just like the failed idea in the generation of permutations, that idea can also be used for dp. More details can be found in this post.
Closely related to the LIS and other ordering concepts is the idea of Young tableaus. This has a very interesting theory, and this post is a good reference for some of the results.
Suppose you have a subgroup generated by some permutations, and you want to find the size of the subgroup. The Schreier Sims algorithm helps in finding this size, and a good post can be found here.
There is a cool and interesting data structure that can do different kinds of queries on permutations and subarrays. This post is a good introduction to the data structure and what it can do.
I have added some standard problems whenever suitable, but for practicing permutations well, it makes sense to practice on a larger set of problems. Since there is a very large number of problems on permutations, maybe just adding a few of them that I know would not do justice to the variety of problems that can be made on permutations. A couple of possible methods of practicing are the following:
This post was originally written on Codeforces; relevant discussion can be found here.
I’ve been involved with testing a lot of Codeforces contests and coordinating contests for my uni, and have seen various kinds of issues crop up. However, I was able to come up with a stable system for my uni contests that helped make the experience as educational and error-free as possible, and I have found the lack of some of those things to be the main reason behind issues faced by Codeforces contests.
Given that it takes months to get a round ready, it is better to be more careful than necessary compared to getting your round unrated or having bad feedback in general.
I’d like to thank Ari, null_awe and prabowo for reviewing the post and discussing more about problem-setting.
The current Codeforces contest guidelines are here for problem-setters and here for testers, and they have some great sanity checks that will help you make a good contest. But I still think it is missing some things that leads to bad contests at times.
Let’s list out some issues that plague contests.
Out of these, the most glaring issue is having a wrong intended solution itself, and leads to making the contest unrated immediately.
I think the fundamental problem behind most of these issues is two-fold:
Whenever I coordinated contests at my uni, I sent out a couple of guideline documents for setters and testers, that built upon these two ideas. The guiding principle for the first of these reasons (that every author and tester should understand, in my opinion) is consistency (and proofs of it). Not just one type of consistency, but consistency of the intent, constraints, statement, checker, validator and tests — all at the same time.
NOTE: Before we go forward, let’s clarify what we mean by a formal proof. In this comment, I clarified that we are not talking about scary super-technical proofs (such as ones you might encounter in advanced set theory). A proof that is logically sound, where steps follow from basic logical reasoning and don’t have gaps, is good enough. I call it a “formal proof” just because people on Codeforces call anything that looks reasonable a “proof” (which is obviously wrong and misleading) — and that is not what I am talking about. If you’re not very familiar with the idea of a mathematical proof, don’t worry. It is not meant to be too hard. I would recommend reading the chapter on proofs in this book, to understand what kind of logical statements I am talking about, and the idea behind proofs generally.
Let’s understand this by asking ourselves the following questions:
We will now list the guidelines that help make contests error-free. Most of these will have both an author and a tester part. Keep in mind that these guidelines will be along the lines of a security perspective.
Note that there are some things that we can completely verify in a mathematical sense by proofs. But there are also things that we can’t be sure of. One of the reasons behind it is that the halting problem is undecidable. So, for those kinds of things, we enforce some kinds of limits on the runtime or the memory requirements, and since we also want to check the behavior of submitted solutions on large inputs, we miss out on a lot of tests due to the combinatorial explosion of the number of tests that happens for most problems.
Firstly, let’s see what we can prove the correctness of. Keep in mind that a proof of consistency is always an “if and only if” argument.
This is the part that comes even before setting the problem on Polygon or any other contest preparation system. Ensure that you have the following things:
None of this should have actual numbers in it (keep everything in terms of what is defined in the input, and don’t think about the constraints for now). Don’t even implement it in a programming language; use constructs in the programming language as well as standard algorithms whose proof of correctness is well-known as a black-box.
Your proof and algorithm should depend only on the statement and logic. Don’t implicitly add in any other assumptions or make wrong assumptions in the proof (if you write the proof, it will feel off anyway, so it’s quite hard to make a mistake like this if you follow this guideline).
Some notes:
Doing this step properly eliminates the possibility of having wrong statements or wrong intended solutions. It also avoids situations like this where the intended solution was “fast enough” on the tests but the problem in general didn’t seem to be solvable fast enough. Incidentally, that post is a good example of what can go wrong in a real-life setting.
Setting constraints is the part which requires getting your hands dirty. Let’s ignore that for now, and let’s say we have some constraints.
How do you ensure that your tests are correct? Think of it as a competitive programming problem again, though a very simple one. This is where a validator comes in. With a validator, you validate the tests based on what the problem states. The reasoning behind proving the correctness of the validator should go like this:
We show that the validator accepts a test if and only if the test is valid according to the problem statement. For the if-direction, [do something]. Now for the only-if-direction, [do something].
I recommend keeping the numbers in this proof generic enough, as is done in most math proofs. We will plug in the numbers later, when we discuss the constraints. Having this proof handy is quite important. Sometimes, when preparing a problem, you end up changing the constraints later on, but forget to update the validator. If you keep this proof as a comment in the validator itself, and inspect all the contest materials before committing changes, you will immediately catch inconsistencies in the statement and the validator.
Also, validators are important when hacks are enabled. If your validator is too loose, you risk some AC solutions getting hacked. And if your validator is too tight, some hacks might be invalid, and there are some hidden conditions in the problem that the hacker might be able to guess, which makes the contest unfair. For example, this contest had such an issue.
For choosing constraints, you need to write some code to verify that the TL and ML are fine and so on. However, this is not something you can prove even in a handwavy sense (unless you know the assembly that your code generates and the time every instruction takes in being executed in that order, on the worst possible test), so we will keep it for later.
This corresponds to checking that the contestant’s solution indeed outputs something that is valid, according to the problem statement.
Again, as in the validator, using a formal proof, ensure that the output satisfies all the conditions for it to be accepted as a valid output by the statement, and don’t add any extra checks. This is important because there have been CF contests where the checker checked something much stronger about the structure of the solution. A very simple example is “print a graph that is connected”. This doesn’t mention that the graph can’t have multiple edges or loops, but many people will just write a checker that will add these conditions anyway. This is another reason why adding links to precise definitions in the problem is a good thing.
Sometimes writing the checker is harder than solving the original problem itself, and there have been instances in certain CF contests where writing the checker was the second subtask of a problem! Keep this in mind while setting problems, since it is useless if you can’t write checkers. Sometimes, it might make sense to ask for only a partial proof of solution from a contestant — like a certain aspect that you can compute only if you use some technique, or something that is weaker than asking for the output, but is a reasonable thing to ask for (this changes the problem itself, though, and hence it doesn’t show up as an issue on the contestant-facing side of problem-setting).
As we mentioned earlier, almost everything about setting constraints and creating tests is something that we have to manage heuristically. Both of these things go hand-in-hand, which makes the whole process quite iterative. Some obvious guidelines that help in these cases are given in the next section.
At this point, there are the following issues we haven’t addressed.
The first four issues are solved only by testing and being consciously aware of the fact that they need to be fixed.
If problems are guessable, it makes sense to ask contestants to construct something that shows that their guess is correct. For example, it might be easy to guess that the size of the answer for a problem is \(n/2\). This is very guessable, and in this case, it is better to ask for the construction of the answer as well.
For avoiding problems that have appeared before, since there is currently no repository of all competitive programming (or math contest) problems that is searchable by context, the only way is to invite a lot of experienced testers with different backgrounds.
For avoiding unbalanced contests (both in terms of difficulty and topics), it is important to have a large variety of testers — people with various amounts of experience as well as various types of topics that they are strong at.
The last type of issue is something unique to online contests that are on a large scale — for example, CF contests. This is the main reason why pretests exist in the first case. To avoid long queues, usually there are at most 2-3 pretests for problems like A and B. To make strong pretests that are also minimal, there are two ways of choosing tests (which can be combined too):
This is one of the reasons why the first couple of problems are multi-test in nature.
If there’s anything I missed out, please let me know, since it’s been quite some time since I last coordinated a contest. Feel free to discuss on various aspects of problem-setting in this context, and I would request coordinators to also share their to-do list when they start coordinating a contest, right from the very beginning of the review stage.
]]>This post was originally written on Codeforces; relevant discussion can be found here.
Disclaimer: This is not an introduction to greedy algorithms. Rather, it is only a way to formalize and internalize certain patterns that crop up while solving problems that can be solved using a greedy algorithm.
Note for beginners: If you’re uncomfortable with proving the correctness of greedy algorithms, I would refer you to this tutorial that describes two common ways of proving correctness — “greedy stays ahead” and “exchange arguments”. For examples of such arguments, I would recommend trying to prove the correctness of standard greedy algorithms (such as choosing events to maximize number of non-overlapping events, Kruskal’s algorithm, binary representation of an integer) using these methods, and using your favourite search engine to look up more examples.
Have you ever wondered why greedy algorithms sometimes magically seem to work? Or find them unintuitive in a mathematical sense? This post is meant to be an introduction to a mathematical structure called a greedoid (developed in the 1980s, but with certain errors that were corrected as recently as 2021), that captures some intuition behind greedy algorithms and provides a fresh perspective on them.
The idea behind this post is to abstract out commonly occurring patterns into a single structure, which can be focused upon and studied extensively. This helps by making it easy to recognize such arguments in many situations, especially when you’re not explicitly looking for them. Note that this post is in no way a comprehensive collection of all results about greedoids; for a more comprehensive treatment, I would refer you to the references as well as your favourite search engine. If there is something unclear in this post, please let me know in the comments.
To get an idea as to how wide a variety of concepts it is connected to is, greedoids come up in BFS, Dijkstra’s algorithm, Prim’s algorithm, Kruskal’s algorithm, Blossom’s algorithm, ear decomposition of graphs, posets, machine scheduling, convex hulls, linear algebra and so on, some of which we shall discuss in this post.
Note that we will mainly focus on intuition and results, and not focus on proofs of these properties, because they tend to become quite involved, and there is probably not enough space to discuss these otherwise. However, we will give sketches of proofs of properties that are quite important in developing an intuition for greedoids.
We will take an approach in an order that is different from most treatments of greedoids (which throw a definition at you), and try to build up to the structure that a greedoid provides us.
How does a usual greedy algorithm look like? Initially, we have made no decisions. Then we perform a sequence of steps. After each step, we have a string of decisions that we have taken (for instance, picking some element and adding it to a set). We also want a set of final choices (so we can’t extend beyond those choices).
To make this concept precise, we define the following:
We can now rephrase the optimization problems that a large set of greedy algorithms try to solve in the following terms:
A (certain quite general type of) greedy algorithm looks something like this:
Of course, taking arbitrary \(w\) doesn’t make any sense, so we limit \(w\) to the following kinds of functions (we shall relax these conditions later, so that we can reason about a larger variety of problems):
To this end, we define greedoids as follows:
A greedoid is a simple hereditary language \(L\) on a ground set \(S\) that satisfies an “exchange” property of the following form:
This extra condition is crucial for us to be able to prove the optimality of greedy solutions using an exchange argument (the usual exchange argument that people apply when proving the correctness of greedy algorithms).
One notable result is that by this extra condition, all basic words in \(L\) have the same length. This might seem like a major downgrade from the kind of generality we were going for, but it still handles a lot of cases and gives us nice properties to work with.
Similar to the rank function for matroids, we can define a rank function for greedoids, which is also subcardinal, monotone and locally semimodular.
It turns out that greedoids have the following equivalent definitions which will be useful later on (we omit the proofs, since they are quite easy):
There is also a set analogue for simple hereditary languages: \(\emptyset \in F\) and for each \(X \in F\), we have an element \(x \in X\) such that \(X \setminus \{x\} \in F\). The intuition remains the same. Note that again, we don’t need this to hold for all \(x \in X\), but at least one \(x \in X\).
But why do we care about greedoids at all? The answer lies in the following informally-stated two-part theorem:
Theorem 1:
For the first part, prove the following claims in order:
For the second part, define generalized bottleneck functions as follows: for every element \(s\) of \(S\), construct a non-increasing sequence of values \(f(s, i)\). Then \(w(x_1x_2\cdots x_i) = \min_{j} f(x_j, j)\) is a generalized bottleneck function. Show that \(w\) satisfies the required conditions, and try to construct a \(w\) that can help you show that the extra condition in the definition of a greedoid is satisfied.
Note that the definition of greedoid doesn’t depend on \(w\) in any way, so it can be studied as a combinatorial structure in its own right — this leads to quite non-trivial results at times. However, when we think of them in the greedy sense, we almost always have an associated \(w\).
In a lot of cases, \(w\) has a much simpler (and restrictive) structure, for instance, having a positive and additive weight function (which is defined on each element). In that case, the following algorithm works for matroids (special cases of greedoids): sort elements in descending order of their weight, and pick an element iff adding it to the set of current choices is still in the matroid.
Usually, \(w\) is not defined on a sequence of steps, but on the set of choices you have made. However, in the general case, the “natural” relaxation from the ordered version to the unordered version of greedoids fails (both in being equivalent and in giving the optimal answer). In fact, this error was present in the original publications (since 1981) and was corrected fairly recently in 2021. For a more detailed discussion (and the proofs of the following few theorems), please refer to \([1]\), which we shall closely follow.
We shall start out with some special cases:
Theorem 2: Consider a greedoid \(F\) (unordered) on a ground set \(S\). Suppose \(c\) is a utility function from \(S\) to \(\mathbb{R}\), so that the weight of a set of decisions is additive over its elements. Then the greedy algorithm gives an optimal solution iff the following is true:
The following theorem generalizes the sufficiency:
Theorem 3: Consider a greedoid \(F\) (unordered) on a ground set \(S\) and \(w : 2^S \to \mathbb{R}\). The greedy algorithm gives an optimal solution if (and not iff) the following is true:
This theorem can be used to show optimality of Prim’s algorithm on its corresponding greedoid.
This theorem can also be derived from theorem 6 that we shall mention later on.
Remember that we mentioned that the original claim in the 1981 paper about greedoids was false? It turns out that the claim is true about a certain type of greedoids, called local forest greedoids. Since it is not so relevant to our discussion, we’re going to spoiler it to avoid impairing the readability of what follows.
A local forest greedoid is a greedoid that satisfies the following three additional conditions:
One property of this structure is that for every element \(x \in S\), we can associate a path to it by taking the unique set in \(F\) that contains \(x\), none of whose subsets in \(F\) also contain \(x\). This uniqueness comes from the local forest property.
Now we are ready to state a couple of theorems:
Theorem 4: Let \(S\) be a local forest greedoid on the ground set \(S\), and \(f\) be a real function of the paths of \(S\) corresponding to this greedoid, which satisfies the following constraints:
Then for the function \(w\) defined as the sum of \(f\) over paths of all elements in the argument, greedy gives an optimal solution.
As a corollary, we have the following:
Theorem 5: For an additive non-positive utility function defined on the ground set of a local forest greedoid, the greedy algorithm gives an optimal solution.
As we had mentioned earlier, the constraints on \(w\) while motivating the definition of a greedoid (ordered) are quite stringent, and they might not hold in a lot of cases. Here, we work upon reducing the set of constraints (which will also have implications on theorem 3 earlier).
Theorem 6: Let \(L\) be a greedoid on a ground set \(S\), and \(w\) be an objective function that satisfies the following condition:
Then the greedy algorithm gives an optimal solution.
The proof is quite easy, and goes by way of contradiction. There is a special case of the above theorem (that needs some care to prove), which is applicable in a lot of cases. It turns out that it corresponds to the last part of the proof sketch of Theorem 1, so maybe it’s not such a bad idea to read it again.
The main point of adding examples at this point is to showcase the kinds of greedoids that exist in the wild and some of their properties, so that it becomes easy to recognize these structures. I have added the examples in spoiler tags, just to make navigation easier. Some of these examples will make more sense after reading the section on how to construct greedoids.
While going through these examples, I encourage you to find some examples of cost functions that correspond to greedy optimality, so that you can recognize them in the future. Discussing them in the comments can be a great idea too. For a slightly larger set of examples, please refer to Wikipedia and the references.
When appropriate, we will also point out greedy algorithms that are associated with these structures.
These are greedoids that satisfy the local union property (as introduced in the section of local forest greedoids). Equivalently, they are also defined as greedoids where for every \(A \subseteq B \subseteq C\) with \(A, B, C \in F\), if \(A \cup \{x\}, C \cup \{x\} \in F\) for some \(x \not \in C\), then \(B \cup \{x\} \in F\).
Some examples are:
A greedoid where the ground set is the edges of a directed graph, and the elements of the greedoids are arborescences that are subgraphs of the original graph, and are rooted at a certain root \(r\). This is also a local forest greedoid. Using the theorem on local forest greedoids, we can show that Dijkstra and BFS are just special cases of the greedy algorithm on this greedoid. Similarly, the widest path problem (maximimize the capacity of the minimum capacity edge on a path) can also be done using modified Dijkstra, with a proof along the lines of similar ideas.
A generalization of this is the hypergraph branching greedoid. For more, please refer \([4]\).
Similar to the directed version but here the graph is undirected. The correctness of Prim’s algorithm is just an application of Theorem 2 on this greedoid.
Some generalizations of this are matroid branching greedoids, polymatroid greedoids and polymatroid branching greedoids, the last two of which are defined with an underlying polymatroid (which is a matroid). For more, please refer \([4]\).
This is related to the perfect elimination antimatroid, but is slightly different.
We do the following procedure on a chordal graph:
This sequence of vertices forms a greedoid.
A natural generalization is to the hypergraph elimination greedoid, and the directed branching greedoid is also a special case of the hypergraph elimination greedoid.
Matroids are greedoids that satisfy a stronger property than the interval property for interval greedoids, by removing the lower bounds. More precisely, a matroid is a greedoid that satisfies the following property:
Equivalently, they are greedoids where in the unordered version, for each \(X \in F\), and for all (not at least one) \(x \in X\) we have \(X \setminus \{x\} \in F\). In other words, they are downwards closed. In such a context, elements of \(F\) are also called independent sets.
Intuitively, they try to capture some concept of independence. In fact, matroids came up from linear algebra and graph theory, and greedoids were a generalization of them when people realized that matroids are too restrictive, and downwards-closedness is not really important in the context of greedy algorithms. The examples will make this notion clearer.
Note that for matroids, it is sufficient to define a basis (the set of maximal sets in \(F\)), and downwards closedness handles the rest for us.
Matroids have a much vaster theory compared to greedoids, due to being studied for quite a long time and being more popular in the research community. Since this is a post only on greedy algorithms, we will refrain from diving into matroid theory in too much detail. Interested readers can go through the references for more.
Some examples of matroids are as follows.
The greedoid where \(F\) is the power set of \(S\) is trivially a matroid. In other words, it is the matroid with a single maximal set — \(S\).
The matroid with the basis as all sets of some fixed size \(k\). This is a matroid that is formed by truncations of the free matroid. Truncations will be explained in the section on constructing more greedoids.
When the ground set is an undirected graph, and \(S\) is the set of all spanning forests of the graph, this matroid is called a graphic matroid.
Note that the correctness of Kruskal’s algorithm follows from this matroid.
When the ground set is the vertex set of a bipartite graph, and \(S\) is the set of endpoints of all possible matchings, this matroid is called a transversal matroid. Note that uniform matroids are a special case of transversal matroids (just consider a complete bipartite graph with \(k\) vertices on the left and \(n\) vertices on the right).
In a graph (directed or undirected), if there are two special sets of vertices \(U, V\), then the set of all subsets \(W\) of \(U\) where there are exactly \(|W|\) vertex-disjoint paths from \(V\) to \(W\) defines a matroid on \(U\). The transversal matroid is a special case of a gammoid. (This kind of a matroid also has a relation with flows and Menger’s theorem). A strict gammoid is one where \(U\) is the vertex set of the graph, and it is the dual of a transversal matroid. Duals of matroids will be explained in the section on constructing more greedoids.
This is a matroid that appears in algebraic contexts when considering field extensions. Consider a field extension \(U \subseteq V\). We construct a matroid on \(V\) as the ground set, with the independent sets being the subsets of \(V\) that are algebraically independent over \(U\). That is, the elements of independent sets don’t satisfy a non-trivial polynomial equation with coefficients in \(U\).
This matroid comes from linear algebra. Consider any set of points in a vector space as the ground set. Then the sets of linearly independent points in this set form a matroid called the vector matroid.
Consider a matrix \(M\). Then with the ground set as the set of columns of \(M\), the independent sets are the sets of columns that are independent as sets of vectors. This is almost the same as a vector matroid, but here we consider columns with different indices as different, even if they are the same as vectors.
Antimatroids are greedoids that satisfy a stronger property than the interval property for interval greedoids, by removing the upper bounds. More precisely, an antimatroid is a greedoid that satisfies the following property:
Unlike matroids, this does not necessarily imply upward closure, but it does imply closure under unions (which is another way to define antimatroids).
Another definition of antimatroids that is in terms of languages (and gives some intuition about their structure) calls a simple hereditary language over a ground set an antimatroid iff the following two conditions hold:
Using this, we note that the basic words are all permutations of the ground set.
Another piece of intuition (derived from the above definition) that helps with antimatroids is the fact that when we try constructing sets in an antimatroid, we keep adding elements to a set, and once we can add an element, we can add it any time after that.
It also makes sense to talk about antimatroids in the context of convex geometries. To understand the intuition behind this, we take the example of a special case of what is called a shelling antimatroid. Let’s consider a finite set of \(n\) points in the 2D plane. At each step, remove a point from the convex hull of the remaining points. The set of points at any point in this process clearly forms an antimatroid by the above intuition. In fact, if instead of 2D, we embed the ground set in a space with a high enough dimension, we can get an antimatroid isomorphic to the given matroid!
What is so special about convexity here? Turns out we can use some sort of “anti-exchange” rule to define convex geometries in the following manner. It will roughly correspond to this fact: if a point \(z\) is in the convex hull of a set \(X\) and a point \(y\), then \(y\) is outside the convex hull of \(X\) and \(z\).
Let’s consider a set system closed under intersection, which has both the ground set and empty set in it. Any subset of the ground set (doesn’t matter whether it is in the system or not) has a smallest superset in such a set system (and it is the smallest set from the system that contains the subset). Let’s call this mapping from the subset to its corresponding smallest superset in the system \(\tau\). Then we have the following properties:
Now for such a set system, it is called a convex geometry if the following “anti-exchange” property holds:
Intuitively, consider \(\tau\) as a convex hull operator. Then the anti-exchange property simply means that if \(z\) is in the convex hull of \(X\) and \(y\), then \(y\) is outside the convex hull of \(X\) and \(z\).
Now it turns out that the complements of all sets in an antimatroid form a convex geometry! This is not too hard to prove.
We shall now give some examples of antimatroids.
This is just the antimatroid that is formed when we take all prefixes of a word.
The downwards closed (under the partial order) subsets of elements in a poset form an antimatroid. In simpler words, if you have a DAG and you pick an arbitrary subset of vertices, construct a set consisting of all vertices from which at least one vertex in the subset is reachable. Now these kinds of constructed sets form an antimatroid. Note that the chain antimatroid is a special case of a poset antimatroid, when the order is total (or equivalently, when the DAG is a line). Intuitively, it is related to the shelling antimatroid when you try to “shell” the poset from the bottom.
The machine scheduling problem is related to this greedoid. Indeed, let \(D = (V, A)\) be a DAG whose vertices represent jobs to be scheduled on a single machine (with no interruptions) and edges of \(D\) represent precedence constraints to be respected by the schedule. Furthermore, a processing time \(a(v) \in \mathbb{N}\) is also given for every job \(v \in V\). Finally, a monotone non-decreasing cost function \(c_v : \{0, \dots, N\} \to \mathbb{R}\) is also given for every job \(v \in V\), where \(N = \sum_{v \in V} a(v)\) such that \(c_v(t)\) represents the cost incurred by job \(v\) if it is completed at time \(t\). The problem is to find a schedule (that is, a topological ordering of the jobs with respect to \(D\)) such that the maximum of the costs incurred is minimized. Lawler (1973) gave a simple greedy algorithm for this problem: it builds up the schedule in a reverse order always choosing out of the currently possible jobs one with the lowest cost at the current completion time.
Note that in the poset antimatroid, we shelled a poset from the bottom. When we do so from both ends (that is, sets are the union of a downwards closed and an upwards closed set), it also forms an antimatroid (called the double poset shelling antimatroid).
A shelling sequence of a finite set \(U\) of points in the Euclidean plane or a higher-dimensional Euclidean space is formed by repeatedly removing vertices of the convex hull. The feasible sets of the antimatroid formed by these sequences are the intersections of \(U\) with the complement of a convex set. Every antimatroid is isomorphic to a shelling antimatroid of points in a sufficiently high-dimensional space, as mentioned earlier.
A special case is the tree-shelling antimatroid, where we form the feasible sets by pruning leaves one by one. This is also a perfect elimination antimatroid.
Another special case is the edge tree-shelling antimatroid, where we form the feasible sets by pruning terminal edges (instead of leaves) one by one. Again, this is also a perfect elimination antimatroid.
Also see \([4]\) for more kinds of shelling antimatroids.
This arises when we consider chordal graphs and their perfect elimination orderings. Prefixes of all possible perfect elimination orderings formed an antimatroid.
An antimatroid where the ground set is the union of vertex and edge sets of an undirected graph, and feasible sets are sets \(A\) that satisfy the following property:
Let \(G\) be an undirected graph with vertex set \(V\) and edge set \(E\), and let \(x\) be a new vertex not in \(V\). We can define an antimatroid on ground set \(V \cup \{x\}\) with the feasible sets as the union of the power set of \(V\) and the set of all sets \(X \cup \{x\}\) such that all edges have at least one endpoint in \(X\).
Root a directed graph at a vertex \(r\). Then with the ground set as the set of vertices \(\ne r\), we can define an antimatroid consisting of sets of vertices that can be reached while searching along the graph (i.e., \(\emptyset\) is in the antimatroid, and if \(X\) is in the antimatroid, then \(X \cup \{y\}\) is in the antimatroid for \(y \not \in X\) iff \((x, y)\) is an edge and \(x \in X\)).
The same thing as the previous antimatroid, but rather than searching for a vertex, we search for an edge. That is, the ground set becomes the set of all edges, and we add an edge \((v, w)\) to a feasible set iff \((u, v)\) is in the feasible set.
The definitions are similar the previous two, only the underlying graph becomes undirected. Note that undirected line search can also be generalized to matroids instead of graphs.
In this variant, we assign a capacity \(c\) to every vertex in a rooted directed graph, which denotes the maximum length we can cover after we have reached this vertex (initially, the maximum length is infinite, and we take some kind of prefix minimums (more specifically, max length that can be travelled = min(previous max — 1, current capacity)) of the capacities). Then the sets that are formed as follows form an antimatroid:
Refer \([4]\).
This antimatroid stems from a combinatorially important game (in fact, it has so many deep implications on combinatorics, that it is an important research topic that is very active). A brief description of a chip firing game is as follows:
Note that \(v\) firing \(i\) times is equivalent to having fired \(i - 1\) times and having accumulated at least \(i \cdot deg(v)\) chips, and thus \(v\) is always available for firing after this condition is satisfied (until it is fired). So the pairs \((v, i)\) form an antimatroid. A consequence of the antimatroid property of these systems is that, for a given initial state, the number of times each vertex fires and the eventual stable state of the system do not depend on the firing order.
This antimatroid is special, in the sense that all antimatroids can be represented in this form. For why this is true, refer \([4]\).
Let \(G\) be a bipartite graph where some edges have been marked special. A vertex in the left partition is called extreme iff it is not adjacent to any special edge. Now consider the following procedure:
Sequences of this kind form an antimatroid on the vertices in the left partition.
Now that we got a few special types of greedoids out of the way, we shall look at some more greedoids.
This is motivated by chordal graph perfect elimination antimatroid. An edge \((u, v)\) in a bipartite graph \(G\) is called bisimplicial if the neighbourhoods of \(u\) and \(v\) form a complete bipartite subgraph of \(G\). The process which keeps removing bisimplicial edges (along with their endpoints) forms sequences that form a greedoid.
These greedoids are formed when we perform certain kinds of operations on a graph/monoid. Please refer to \([4]\) for a detailed description.
This is slightly different from the transversal matroid, but is nevertheless related to it. Consider an ordering of the right half of a bipartite matching. To construct feasible sets of the greedoid, we pick a prefix of this order, and find all sets of vertices on the left which have a perfect matching with this prefix.
This is similar to the bipartite matching greedoid, but we use gammoids instead of transversal matroids here.
Consider what happens when we perform column-wise Gaussian elimination on a matrix of full row rank. In the \(i\)-th row, we pick an unpicked column with the element at it not \(0\), and try to make everything else in that row \(0\) by subtracting an appropriate multiple of that column.
The set of chosen columns forms a greedoid.
A specialization of this to graphs is called the graphical gaussian elimination greedoid.
A generalization of this and the previous two examples comes from strong matroid systems, which can be read about in \([4]\).
These are greedoids that arise from the following kinds of operations:
Consider the ear decomposition \((e_0, P_1, \dots, P_k)\) of a 2-edge-connected undirected graph into a specified edge \(e_0\) and paths \(P_i\). This gives rise to a greedoid where the feasible sets are subsets \(X\) of edges not containing \(e_0\) such that
The basic words of this greedoid correspond to valid ear decompositions of the graph.
For the directed case, we need the graph to be strongly connected. The feasible sets are constructed as follows:
This is related to Blossom’s maximum matching algorithm on general graphs.
Let \(G = (V, E)\) be a graph and \(M\) a matching. We define a greedoid on \(E \setminus M\) so that a feasible sequence of edges corresponds to an edge labeling in the matching algorithm.
Edmonds matching algorithm has the following setup:
A graph is matching-critical if deleting any of its nodes results in a graph that has a perfect matching. Such a graph has an odd number of nodes. A blossom forest \(B\) (with respect to \(M\)) is a subgraph of \(G\) that has a set \(A\) of nodes (inner nodes) with the following properties.
The connected components of \(G \setminus A\) are called the pseudonodes or blossoms of the blossom forest, and if we contract every pseudonode we get a forest. Every connected component of \(B\) contains exactly one node that is not covered by \(M\).
The Edmonds algorithm consists of building up a blossom forest one edge at a time (not counting the edges in \(M\)). If it finds an edge connecting two pseudonodes in different components of the blossom forest, then it is easy to construct a larger matching and the algorithm starts again. If it finds an edge connecting a pseudonode to a node \(v\) outside the blossom forest, then it adds this edge and the edge of \(M\) incident with \(v\) to the blossom forest. If it finds an edge that connects two pseudonodes in the same connected component or two nodes in the same pseudo node, then it adds this to the blossom forest. Finally, if none of these steps can be carried out, the algorithm stops and concludes that the matching \(M\) is maximum. If \(M\) is a maximum matching, the edge set of the blossom forests restricted to \(E(G) \setminus M\) with respect to \(M\) forms a greedoid. To prove this, we use the fact that every maximum matching forest has the same set \(A\) of inner nodes and the same pseudonodes. Hence they all have the same number of edges, i.e. all bases have the same cardinality. Since this also holds for all subgraphs containing \(M\), by properties of polymatroid greedoids, we have a greedoid. A run of the last phase of the Edmonds algorithm corresponds to a basic word of this greedoid.
Note that some constructions from polymatroids/matroids etc. were already mentioned in the section on examples, so we will focus on the ones that were not.
When we take matroids and remove sets with rank > some threshold, we again get a matroid.
When we take the dual of a matroid (set the new basis elements to be complements of the original basis elements, and get a downward closure), the resulting structure is also a matroid.
We can restrict a matroid by throwing away all feasible sets which have anything outside a fixed subset \(S\) of the ground set. This leads to another matroid.
Contracting \(S\) in a matroid \(M\) is equivalent to taking the dual matroid of \(M\), deleting \(S\) (or in the restriction sense, restricting it to the complement of \(S\)), then taking the dual again.
The reason why these two operations are important is that when we apply a sequence of operations that are either restrictions or contractions, we get to a matroid minor, a concept similar to a graph minor (deletion corresponds to edge deletion and contraction corresponds to edge contraction).
Taking the direct sum and union of two matroids (possibly on different ground sets) also leads to matroids. In these operations, you take the sum/union of both the ground sets and the sum/union of the each element in the Cartesian product of the sets of their independent sets.
For antimatroids on the same ground set, we construct feasible sets by taking the union of every possible pair of feasible sets. This is another antimatroid.
Refer to the construction in \([5]\). Polymatroid greedoids naturally arise from these operations, though the class of generated greedoids is more generic than just these (and such greedoids are called trimmed greedoids).
Removing all feasible sets of size \(> k\) from a greedoid also forms a greedoid.
Defined identically to matroid restriction.
For a feasible set \(X\), the contraction of the greedoid is constructed with ground set \(S \setminus X\) and feasible sets being those sets in \(2^{S \setminus X}\) such that adding elements of \(X\) to them gives feasible sets in the original greedoid. This is a greedoid.
Greedoid minors are defined similarly to matroid minors.
This is a collection of interesting facts about greedoids that just don’t seem to fit in the above sections, but are important enough not to be ignored. Of course, with every important construction or idea, there are a lot of facts that lead to a combinatorial explosion of the kinds of facts we can derive about anything, so what we discussed above is way too incomplete to be a comprehensive introduction. It’ll hopefully evolve over time too, so I’m open to suggestions regarding this section.
Lastly, since this is a huge post, some mistakes might have crept in. If you find any, please let me know!
]]>This post was originally written on Codeforces; relevant discussion can be found here.
Recently, I retired from competitive programming completely, and I found myself wondering about what I’ll do with all the competitive-programming-specific knowledge I have gained over the years. For quite some time, I’ve been thinking of giving back to the community by writing about some obscure topics that not a lot of people know about but are super interesting. However, every time I ended up with ideas that are either too obscure to be feasibly used in a competitive programming problem, too technical to ever be seen in a contest, or having too many good posts/articles about them anyway.
So here’s my question: do you think there are topics/ideas that you find are interesting and beautiful, but are not appreciated enough to have their own posts? If I am familiar with those ideas, I can consider writing something on them (and irrespective of that, I would encourage you to write a post on such ideas if you’re familiar with them). If I don’t know about them, it’ll be a good way for me and everyone else visiting this post to learn about it (and if I like the topic, who knows I might as well write something on it too). Also, it’s not like the ideas have to necessarily be related to algorithms and data structures, some meta stuff like this nice post works too.
For an idea about the kinds of posts I’m talking about, you could refer to my previous educational posts. I was mainly inspired by adamant’s and errorgorn’s posts, since they tend to be really interesting (I would recommend checking them out!).
]]>This post was originally written on Codeforces; relevant discussion can be found here.
This post was initially meant to request updates to the C compiler on Codeforces (given the large number of posts complaining about “mysterious” issues in their submissions in C). While writing it, I realized that the chances of the said updates would be higher if I mentioned a few reasons why people would like to code in C instead of C++ (some of the reasons are not completely serious, as should be evident from the context).
Before people come after me in the comments saying that you should use Rust for a few reasons I will be listing below, please keep in mind that I agree that Rust is a good language with reasonable design choices, and I prefer C over Rust, at least for now.
scanf=/=printf
compared to the older benchmarks in the post.
However, this is not the case for C. My hypothesis is that this issue
wasn’t patched for the C compiler. There is, however, a quick fix for
this issue: add this line to the top of your submission, and it will
revert to the faster implementation (at least as of now):
#define __USE_MINGW_ANSI_STDIO 0
. Thanks to ffao
for telling me about this hack.timespec
struct to get high
precision time, there is a compilation error (and it seems it is
because of limited C11 support in GCC 5.1). Update: a similar issue
arises (timespec_get
not found) when I submit with C++20 (64) as
well, so I think this is an issue with the C library rather than the
compiler.A possible fix to the above issues that works most of the time is to submit your code in C++20 (64 bit) instead of C11. You might face problems due to different behavior of specific keywords or ODR violations due to naming issues, but it should be obvious how to fix those errors manually.
Coming to the original point of the post, I have a couple of update requests (if any other updates can make C easier to use, feel free to discuss them in the comments). Tagging MikeMirzayanov so these updates can be considered in the immediate future.
timespec_get
issue (and other C11/C17 support issues): I
am not exactly sure on how to go about fixing this, but since this
seems to be a C library issue, it could be a good idea to try
installing a C library that has this functionality (and is C11
compliant). The issue probably boils down to Windows not having
libraries that support everything in the C standard, since C11/C17
works perfectly on my Linux system.Now I’ll discuss why C is relevant for competitive programming.
Of course, while choosing programming languages for competitive programming, C++ is usually recommended to beginners (over C and other languages) because of reasons that include the following:
qsort
and so on, with some overhead, though it is generally better
to write your own sort to avoid the overhead of using function
pointers completely).Some of its less-noticed advantages over C in the context of competitive programming (apart from the obvious benefits of having a good set of data structures and algorithms) are:
rand
with time(NULL)
is
pretty bad, so in C, the way to do this is very platform specific
until C11 (and people are not aware of how significant a revision C11
was, compared to C99 and other revisions before it). Also, rand
is
terrible in general, so using a C port of a C++ STL random number
generator is probably a better bet. For reference,
this is a good random
number generator.constexpr
(and template) support: If you enjoy constant-factor
optimization, you would probably miss this. In certain fast IO
implementations, for instance, a constexpr array is built at compile
time to use chunks of bytes at a time rather than a single byte.
Templates are important because writing your own library comes in
handy in contests. However, using _Generic
, it is possible to write
generic code to a certain extent as well.Despite these disadvantages, it is possible (and might also be helpful in training) to use C for contests. C11 (which should be abandoned in favor of C17 due to C17 having fixes to C11 defect reports) has a standard library that is seldom talked about and has some of the following features which make life simpler:
tgmath.h
: Type generic math for floating point numbers and complex
numbersstdint.h
and inttypes.h
: Types like int64_t
are also available
in C, contrary to popular belief.If you are using GCC and glibc, there are more surprises in store for you that make C all the more usable:
qsort
if sufficient space is available (which is pretty much
always the case in competitive programming), so no worrying about
quicksort being hackable!Now that I have explained why C isn’t as bad to code in as you might have thought earlier, you might wonder what the point of using C is at all. After all, isn’t C++ meant to be a more supposedly “user-friendly” language than C? (In fact, it is not, with all its language complexities – for instance, I haven’t heard of a popular language that has as many “value-categories” as C++ and is as easy to shoot yourself in the foot with, with the sole exception of JS – but that is a topic for another day).
Straight to the point, competitive programming in C has the following benefits:
map<int, pair<set<int, greater<int>>, FenwickTree<double>>>
for a
problem whose editorial uses a single Fenwick tree or sorting + two
pointers, you might want to consider using a language which is not
this powerful (of course, using these complicated data structures can
also lead to unnecessary TLEs and overcomplicate your implementation).
C is the best example; in terms of algorithms bundled in the standard
library, there are only two algorithms: sorting and binary search,
which are well-known to be the only “hard-to-write-for-beginners”
algorithms that you would need before you hit GM. This is, of course,
a bit exaggerated, but if you want to hit GM, you would probably just
want to get better at implementing easier problems first, after which
it is only natural to try to write your own library. Writing a
treap/hash table for simple data types is probably all you would ever
need before that.Coding in C is probably not a big deal for you at all if you are rainboy, anyway.
]]>This post was originally written on Codeforces; relevant discussion can be found here.
Since someone recently asked me about my competitive programming setup, and I like tinkering with my setup to make it as useful and minimal as possible, I thought I should share my setup that I’ve used for the past few years and a modified version that I’ve used at onsite ICPC contests. I’ve also talked to a few people who went to IOI and had a similar setup, and I’m fairly confident that at least some people will be able to successfully use this without having to worry too much about the details, like I did. This is definitely NOT the only way to set up a basic environment, but it was the way that worked for me for quite a long time.
This post is also meant as a companion post to this post on using the command line.
Before we start with the tools, it’s a good point to mention this resource as another useful reference for installing vim and g++.
The choice of tools here is motivated by two things — availability and minimalism.
About availability: most onsite contests have a Linux system (which is
almost always Ubuntu), and while installing the most common compiler
used for competitive programming (g++) on Ubuntu, the meta-package
build-essentials
is required by the standard installation procedure.
This package also installs make
. Most onsite contests provide gdb
too. As far as vim
is concerned, it is provided on almost all onsite
contests as well (and if it is not present, you can use vi
as a good
enough substitute).
Needless to say, this setup assumes that you have a Linux machine or Linux VM. If you’re on Windows, using WSL should suffice. Most of these things should also work on MacOS (speaking from my limited experience of testing an ICPC contest while stuck with a machine with MacOS on it).
The fact that they’re minimalist also correlates with the fact that
they’re fast and easy to set up (sometimes they don’t need anything to
be set up either) and won’t take a ton of your time looking at
complicated menus and waiting for startup. Note that using vim
might
be a bit too much if you haven’t used it earlier and you’re not willing
to devote a few days to getting comfortable with it, so it’s a matter of
personal preference to use some other text editor like Sublime or VS
Code. However, I do not know of any alternative to make
(other than
writing a shell script, which is a bit inconvenient unless you’re a
sh/bash expert) and gdb
that are as widely available. I’d like to
re-emphasize the fact that this is just my setup; your mileage might
vary.
Now about the tools and how to install them:
vim
is a minimalist and customizable text editor that allows you to
do a lot of stuff in a no-nonsense way. Its keybindings can be a bit
counterintuitive at first, but once you get used to it, it makes
writing and editing code very convenient. Installation instructions
can be found at the link I shared earlier.make
is a minimal build system that reads something called a
Makefile and figures out how to do your compilation for you.
Installation instructions can be found at the link I shared earlier,
under the heading “An example of a workflow in Vim”.gdb
is a command-line debugger that works for a lot of languages
and is pretty convenient and intuitive to use (and also quite
scalable). For installation instructions, a search on your favourite
search engine should be sufficient.This is a more concise setup that I’ve used for onsite contests, and people might want to build upon it further.
I prefer using the following template, but for ICPC, I added some things from the KACTL template to keep things uniform in the codebook.
#include <bits/stdc++.h>
using namespace std;
using ll = int64_t;
int main() {
cin.tie(nullptr)->sync_with_stdio(false);
}
For setting up vim, you just need to create a file ~/.vimrc
which
stores all your configurations.
set nocp
filetype plugin indent on
syntax enable
autocmd BufNewFile,BufRead * setlocal formatoptions-=cro
set hi=500 wmnu ru nu rnu bs=eol,start,indent whichwrap+=<,>,h,l hls is lz noeb t_vb= tm=500 enc=utf8 ffs=unix nowb noswf et sta lbr tw=200 ai si wrap vi= ls=2 cb=unnamedplus,unnamed sw=4 ts=4
colorscheme slate
let mapleader = ","
inoremap {<CR> {<CR><BS>}<Esc>O
inoremap [<CR> [<CR><BS>]<Esc>O
inoremap (<CR> (<CR><BS>)<Esc>O
map 0 ^
packadd termdebug
let g:termdebug_popup=0
let g:termdebug_wide=163
let &t_ut=''
set nocp
filetype plugin indent on
syntax enable
autocmd BufNewFile,BufRead * setlocal formatoptions-=cro
source $VIMRUNTIME/delmenu.vim
source $VIMRUNTIME/menu.vim
set hi=500 wmnu ru nu rnu bs=eol,start,indent whichwrap+=<,>,h,l hls is lz noeb t_vb= tm=500 enc=utf8 ffs=unix nowb noswf et sta lbr tw=200 ai si wrap vi= ls=2 cb=unnamedplus,unnamed mouse=a mousem=popup sw=4 ts=4
colorscheme slate
let mapleader = ","
inoremap {<CR> {<CR><BS>}<Esc>O
inoremap [<CR> [<CR><BS>]<Esc>O
inoremap (<CR> (<CR><BS>)<Esc>O
map 0 ^
" :Termdebug <executable_name>, :help terminal-debug, Ctrl+W twice to switch windows
packadd termdebug
let g:termdebug_popup=0
let g:termdebug_wide=163
let &t_ut=''
let &t_SI="\<Esc>[6 q"
let &t_SR="\<Esc>[4 q"
let &t_EI="\<Esc>[2 q"
set nocp
: required for some technical issues, most prominently when
you load a vimrc with vim -u <vimrc_path>
or to avoid bugs when you
just use vi
instead. Basically, always keep it in your vimrc.
filetype plugin indent on
: turns on filetype detection, plugins and
indentation. If you’re more curious about what this does, read
this.
syntax enable
: turns on syntax highlighting.
autocmd BufNewFile,BufRead * setlocal formatoptions-=cro
: turns off
some comment continuation behavior (when a comment continues to a new
line or is broken across lines) that I don’t like. You can consider
removing this line if you’re fine with it.
source $VIMRUNTIME/delmenu.vim
and source $VIMRUNTIME/menu.vim
:
adds support for menus, if you still want to use some kinds of menus.
For menus to get triggered with a right click, you can set
mouse=a mousem=popup
like I did in the beginner vimrc.
hi=500 wmnu ru nu rnu bs=eol,start,indent whichwrap+=<,>,h,l hls is lz noeb t_vb
tm=500 enc=utf8 ffs=unix nowb noswf et sta lbr tw=200 ai si wrap vi= ls=2 cb=unnamedplus,unnamed mouse=a mousem=popup sw=4 ts=4=:
does the following things in order: sets history to 500 lines, makes
the numbering more useful, sets some wrapping options, sets some
searching and match-highlighting options, turns off audio and visual
notifications for errors, manages encoding and filesystem related
issues, turns off backups etc., handles indenting and converts all
tabs to spaces, manages clipboards
colorscheme slate
: sets the colorscheme to slate. There are a few
inbuilt colorschemes, and I just happened to find it to be convenient
for long hours of coding.
let mapleader = ","
: sets the leader key to a comma. Not really
important if you don’t use it.
inoremap {<CR> {<CR><BS>}<Esc>O
etc: when you input a curly brace
and press return (fast enough), it autocompletes the ending curly
brace. Similarly for square brackets and parentheses.
map 0 ^
: when you input 0
in normal mode, it’ll go to the first
non-whitespace character.
packadd termdebug
, let g:termdebug_popup=0
and
let g:termdebug_wide=163
: adds a package to run gdb from inside
vim and sets up a few window options. The documentation for it can
be found using the thing mentioned in the comment before it.
let &t_ut
’’=: fixes some background-color related issues that I
find annoying, but it’s barely noticeable.
let &t_SI
"\<Esc>[6 q", =let &t_SR
"\<Esc>[4 q"= and
let &t_EI
"\<Esc>[2 q"=: sets the cursor shapes when you’re in
insert mode, normal mode and replace mode.
Note that you can run make
without any setup at all. For example, if
the path (relative or absolute) to your cpp file is
<some_path>/code.cpp
, you can run make <some_path>/code
and it will
generate an executable at the path <some_path>/code
, using the default
compilation flags. make
also prints the command that it used to
compile the file.
One feature of make
is that once you compile a file, unless there are
changes in the source file, it won’t compile it again. To counter that,
either use make -B
instead of make
, or use the Makefile below and
run make clean
which removes all executables from the current
directory upto max depth 1.
For using more compilation flags with make
, create a file named
Makefile
and store the following Makefile in it (you can modify this
too).
D ?= 0
ifeq ($(D), 1)
CXXFLAGS=-std=c++17 -g -Wall -Wextra -Wpedantic -Wshadow -Wformat=2 -Wfloat-equal -Wconversion -Wlogical-op -Wshift-overflow=2 -Wduplicated-cond -Wcast-qual -Wcast-align -Wno-variadic-macros -D_GLIBCXX_DEBUG -D_GLIBCXX_DEBUG_PEDANTIC -fsanitize=address -fsanitize=undefined -fno-sanitize-recover -fstack-protector -fsanitize-address-use-after-scope
else
CXXFLAGS=-O2 -std=c++17
endif
clean:
find . -maxdepth 1 -type f -executable -delete
CXXFLAGS
is a variable that refers to the flags used to compile a
C++ file.D
is a variable that we defined, and it is equal to 0
by default.
If instead of running make a
, we run make a D=1
, it will use the
flags that are in the first line. These flags make your program
slower, but make it easier to debug your program using gdb and have
some assertion checks that can sometimes tell you the issue when the
program runs without having to use gdb.GDB doesn’t really need setup. With the above flags, the most I have ever needed was to run the executable with debug flags in gdb, and look at the backtrace to pinpoint the line with the error. To do that, I do the following (you can also do this in vim using the termdebug package that is added in the vimrc above and is provided by vim out of the box):
gdb <executable_name>
in.txt
, type run < in.txt
and
hit enter. If you don’t have an input file, you can just give it
input like in a terminal. Output redirection works as well.bt
and hit enter.My setup for online contests is a bit more involved, and I’ll keep this section a bit more concise, since you can just set this up once and never have to worry about it again, in contrast with onsite contests, where every character counts.
I use a bigger template for online contests, which is as follows:
#ifndef LOCAL
#pragma GCC optimize("O3,unroll-loops")
#pragma GCC target("avx2,bmi,bmi2,popcnt,lzcnt")
// #pragma GCC target("sse4.2,bmi,bmi2,popcnt,lzcnt")
#endif
#include "bits/stdc++.h"
#ifdef DEBUG
#include "includes/debug/debug.hpp"
#else
#define debug(...) 0
#endif
using ll = int64_t;
using ull = uint64_t;
using ld = long double;
using namespace std;
int main() {
cin.tie(nullptr)->sync_with_stdio(false);
// cout << setprecision(20) << fixed;
int _tests = 1;
// cin >> _tests;
for (int _test = 1; _test <= _tests; ++_test) {
// cout << "Case #" << _test << ": ";
}
}
Note that there’s an include for debugging, and it’s pretty easy to use (I’ve mentioned it here).
Note that the following vimrc has some non-competitive-programming stuff
as well, since I use vim as my primary editor. For more information on
each plugin (each such line starts with Plug
or Plugin
), you can
search for it on GitHub. For instance, for the line
Plug 'sbdchd/neoformat'
, the relevant GitHub repo is
this. I use the snippet manager
UltiSnips and clang-format and clangd for code formatting and static
analysis purposes. I used to use some of the commented plugins earlier,
so those can be useful too.
filetype plugin indent on " required
syntax enable
" hack to enable python3 rather than python2
if has('python3')
endif
" Plugins
call plug#begin('~/.vim/plugged')
Plug 'sbdchd/neoformat'
let g:opambin = substitute(system('opam config var bin'),'\n$','','''')
let g:neoformat_ocaml_ocamlformat = {
\ 'exe': g:opambin . '/ocamlformat',
\ 'no_append': 1,
\ 'stdin': 1,
\ 'args': ['--enable-outside-detected-project', '--name', '"%:p"', '-']
\ }
let g:neoformat_enabled_ocaml = ['ocamlformat']
let g:neoformat_python_autopep8 = {
\ 'exe': 'autopep8',
\ 'args': ['-s 4', '-E'],
\ 'replace': 1,
\ 'stdin': 1,
\ 'env': ["DEBUG=1"],
\ 'valid_exit_codes': [0, 23],
\ 'no_append': 1,
\ }
let g:neoformat_enabled_python = ['autopep8']
let g:neoformat_try_formatprg = 1
" Enable alignment
let g:neoformat_basic_format_align = 1
" Enable tab to spaces conversion
let g:neoformat_basic_format_retab = 1
" Enable trimmming of trailing whitespace
let g:neoformat_basic_format_trim = 1
Plug 'lervag/vimtex'
let g:tex_flavor='latex'
let g:vimtex_view_method='zathura'
let g:vimtex_quickfix_mode=0
set conceallevel=2
let g:tex_conceal='abdmg'
Plug 'sirver/ultisnips'
let g:UltiSnipsExpandTrigger = '<tab>'
let g:UltiSnipsJumpForwardTrigger = '<tab>'
let g:UltiSnipsJumpBackwardTrigger = '<s-tab>'
Plug 'neoclide/coc.nvim', {'branch': 'release'}
" also see https://github.com/clangd/coc-clangd for clangd support
" for haskell: https://www.reddit.com/r/vim/comments/k3ar3i/cocvim_haskelllanguageserver_starter_tutorial_2020/
try
nmap <silent> [c :call CocAction('diagnosticNext')<cr>
nmap <silent> ]c :call CocAction('diagnosticPrevious')<cr>
endtry
let g:loaded_sql_completion = 0
let g:omni_sql_no_default_maps = 1
Plug 'junegunn/fzf', { 'dir': '~/.fzf', 'do': './install --all' }
" Plug 'jakobkogler/Algorithm-DataStructures'
Plug 'xuhdev/vim-latex-live-preview', { 'for': 'tex' }
let g:livepreview_previewer = 'zathura'
Plug 'godlygeek/csapprox'
let g:CSApprox_hook_post = [
\ 'highlight Normal ctermbg=NONE',
\ 'highlight LineNr ctermbg=NONE',
\ 'highlight SignifyLineAdd cterm=bold ctermbg=NONE ctermfg=green',
\ 'highlight SignifyLineDelete cterm=bold ctermbg=NONE ctermfg=red',
\ 'highlight SignifyLineChange cterm=bold ctermbg=NONE ctermfg=yellow',
\ 'highlight SignifySignAdd cterm=bold ctermbg=NONE ctermfg=green',
\ 'highlight SignifySignDelete cterm=bold ctermbg=NONE ctermfg=red',
\ 'highlight SignifySignChange cterm=bold ctermbg=NONE ctermfg=yellow',
\ 'highlight SignColumn ctermbg=NONE',
\ 'highlight CursorLine ctermbg=NONE cterm=underline',
\ 'highlight Folded ctermbg=NONE cterm=bold',
\ 'highlight FoldColumn ctermbg=NONE cterm=bold',
\ 'highlight NonText ctermbg=NONE',
\ 'highlight clear LineNr'
\]
Plug 'powerline/powerline'
let g:Powerline_symbols = 'fancy'
set fillchars+=stl:\ ,stlnc:\
set termencoding=utf-8
Plug 'preservim/nerdcommenter'
Plug 'alx741/vim-rustfmt'
call plug#end()
set rtp+=~/.vim/bundle/Vundle.vim
call vundle#begin()
Plugin 'dense-analysis/ale'
Plugin 'morhetz/gruvbox'
Plugin 'Chiel92/vim-autoformat'
Plugin 'VundleVim/Vundle.vim'
Plugin 'octol/vim-cpp-enhanced-highlight'
let g:cpp_member_variable_highlight = 1
let g:cpp_class_decl_highlight = 1
let g:cpp_class_scope_highlight = 1
let g:cpp_posix_standard = 1
" let g:cpp_experimental_simple_template_highlight = 1
" let g:cpp_no_function_highlight = 1
" Plugin 'NavneelSinghal/vim-cpp-auto-include'
" autocmd BufWritePre /path/to/workspace/**.cpp :ruby CppAutoInclude::process
Plugin 'rhysd/vim-clang-format'
Plugin 'iamcco/markdown-preview.nvim'
let g:mkdp_auto_start = 0
let g:mkdp_auto_close = 1
let g:mkdp_refresh_slow = 0
let g:mkdp_command_for_global = 0
let g:mkdp_open_to_the_world = 0
let g:mkdp_open_ip = ''
let g:mkdp_browser = ''
let g:mkdp_echo_preview_url = 0
let g:mkdp_browserfunc = ''
let g:mkdp_preview_options = {
\ 'mkit': {},
\ 'katex': {},
\ 'uml': {},
\ 'maid': {},
\ 'disable_sync_scroll': 0,
\ 'sync_scroll_type': 'middle',
\ 'hide_yaml_meta': 1,
\ 'sequence_diagrams': {},
\ 'flowchart_diagrams': {},
\ 'content_editable': v:false,
\ 'disable_filename': 0
\ }
let g:mkdp_markdown_css = ''
let g:mkdp_highlight_css = ''
let g:mkdp_port = ''
let g:mkdp_page_title = '「${name}」'
let g:mkdp_filetypes = ['markdown']
Plugin 'vim-airline/vim-airline'
Plugin 'vim-airline/vim-airline-themes'
let g:airline_theme='badwolf'
let g:airline#extensions#tabline#enabled=1
let g:airline#extensions#tabline#formatter='unique_tail'
let g:airline_powerline_fonts=1
Plugin 'udalov/kotlin-vim'
" https://github.com/iamcco/markdown-preview.nvim/issues/43
Plugin 'davidhalter/jedi-vim'
call vundle#end() " required
execute pathogen#infect()
" Autocmds
autocmd VimEnter * SyntasticToggleMode " disable syntastic by default
autocmd BufNewFile,BufRead * setlocal formatoptions-=cro
autocmd BufNewFile ~/Desktop/cp/workspace/**.cpp 0r ~/.vim/templates/template.cpp
autocmd BufNewFile,BufRead *.kt set filetype=kotlin
autocmd BufReadPost * if line("'\"") > 1 && line("'\"") <= line("$") | exe "normal! g'\"" | endif
let &t_SI = "\<Esc>[6 q"
let &t_SR = "\<Esc>[4 q"
let &t_EI = "\<Esc>[2 q"
command W w !sudo tee % > /dev/null
" Basic
let $LANG='en'
set langmenu=en
source $VIMRUNTIME/delmenu.vim
source $VIMRUNTIME/menu.vim
let indentvalue=4
set noshowcmd noruler
set history=500
set autoread
set so=7
set wildmenu
set wildignore=*.o,*~,*.pyc,*/.git/*,*/.hg/*,*/.svn/*,*/.DS_Store
set ruler
set number
set relativenumber
set cmdheight=2
set hid
set backspace=eol,start,indent
set whichwrap+=<,>,h,l
set ignorecase
set smartcase
set hlsearch
set incsearch
set lazyredraw
set magic
set showmatch
set mat=2
set noerrorbells
set novisualbell
set t_vb=
set tm=500
set nocursorline
set nocursorcolumn
set foldcolumn=1
set encoding=utf8
set ffs=unix,dos,mac
set nobackup
set nowb
set noswapfile
set expandtab
set smarttab
let &shiftwidth=indentvalue
let &tabstop=indentvalue
set lbr
set tw=200
set autoindent
set smartindent
set wrap
set viminfo=
set bg=dark
set t_Co=256
set switchbuf=useopen,usetab,newtab
set stal=2
set laststatus=2
set clipboard+=unnamedplus
set mouse=a
set mousemodel=popup
set ttimeout
set ttimeoutlen=1
set ttyfast
" Appearance
try
colorscheme gruvbox
let g:gruvbox_transparent_bg=1
let g:gruvbox_contrast_dark='hard'
let g:gruvbox_invert_tabline=1
catch
endtry
if has("gui_running")
set guioptions-=T
set guioptions-=e
set guitablabel=%M\ %t
set guifont=JetBrains\ Mono\ 12
endif
" Keymappings
inoremap {<CR> {<CR><BS>}<Esc>O
inoremap [<CR> [<CR><BS>]<Esc>O
inoremap (<CR> (<CR><BS>)<Esc>O
let mapleader = ","
nmap <leader>w :w!<cr>
map <space> /
map <c-space> ?
map <silent> <leader><cr> :noh<cr>
map <C-j> <C-W>j
map <C-k> <C-W>k
map <C-h> <C-W>h
map <C-l> <C-W>l
map <C-n> :NERDTreeToggle<CR>
map <leader>l :bnext<cr>
map <leader>h :bprevious<cr>
let g:lasttab = 1
au TabLeave * let g:lasttab = tabpagenr()
map 0 ^
" for setting up inkscape-figures
inoremap <C-f> <Esc>: silent exec '.!inkscape-figures create "'.getline('.').'" "'.b:vimtex.root.'/figures/"'<CR><CR>:w<CR>
nnoremap <C-f> : silent exec '!inkscape-figures edit "'.b:vimtex.root.'/figures/" > /dev/null 2>&1 &'<CR><CR>:redraw!<CR>
ca Hash w !cpp -dD -P -fpreprocessed \| tr -d '[:space:]' \
\| md5sum \| cut -c-6
function! CmdLine(str)
call feedkeys(":" . a:str)
endfunction
" debugging stuff: to run in vim, run :Termdebug <executable_name>, and :help terminal-debug for help (use Ctrl+W twice to switch between windows)
packadd termdebug
let g:termdebug_popup = 0
let g:termdebug_wide = 163
set rtp^="~/.opam/default/share/ocp-indent/vim"
On my personal system, since compilation time becomes a major overhead, I also use precompiled headers. Also, for contests like Hash Code, I use OpenMP at times, so I have some flags for those in my Makefile.
PEDANTIC_CXXFLAGS = -Iincludes/debug/includes -std=c++20 -g -Wall -Wextra -Wpedantic -Wshadow -Wformat=2 -Wfloat-equal -Wconversion -Wlogical-op -Wshift-overflow=2 -Wduplicated-cond -Wcast-qual -Wcast-align -Wno-variadic-macros -DDEBUG -DLOCAL -D_GLIBCXX_DEBUG -D_GLIBCXX_DEBUG_PEDANTIC -fsanitize=address -fsanitize=undefined -fno-sanitize-recover -fstack-protector -fsanitize-address-use-after-scope
PEDANTIC_CFLAGS = -std=c17 -g -Wall -Wextra -Wpedantic -Wshadow -Wformat=2 -Wfloat-equal -Wconversion -Wlogical-op -Wshift-overflow=2 -Wduplicated-cond -Wcast-qual -Wcast-align -Wno-variadic-macros -DDEBUG -DLOCAL -D_GLIBCXX_DEBUG -D_GLIBCXX_DEBUG_PEDANTIC -fsanitize=address -fsanitize=undefined -fno-sanitize-recover -fstack-protector -fsanitize-address-use-after-scope
NORMAL_CXXFLAGS = -Iincludes -std=c++20 -O2 -DTIMING -DLOCAL -ftree-vectorize -fopt-info-vec
NORMAL_CFLAGS = -std=c17 -O2 -DTIMING -DLOCAL -ftree-vectorize -fopt-info-vec
PARALLEL_CXXFLAGS = -fopenmp
PARALLEL_CFLAGS = -fopenmp
D ?= 0
P ?= 0
ifeq ($(D), 1)
CXXFLAGS = $(PEDANTIC_CXXFLAGS)
CFLAGS = $(PEDANTIC_CFLAGS)
else
CXXFLAGS = $(NORMAL_CXXFLAGS)
CFLAGS = $(NORMAL_CFLAGS)
endif
ifeq ($(P), 1)
CXXFLAGS += $(PARALLEL_CXXFLAGS)
CFLAGS += $(PARALLEL_CFLAGS)
endif
clean:
find . -maxdepth 1 -type f -executable -delete
rm -f *.plist
Let me know if I missed out on something, or if I should add something here!
]]>This post was originally written on Codeforces; relevant discussion can be found here.
Recently someone asked me to explain how to solve a couple of problems which went like this: “Find the expected time before XYZ happens”. Note that here the time to completion is a random variable, and in probability theory, such random variables are called “stopping times” for obvious reasons. It turned out that these problems were solvable using something called martingales which are random processes with nice invariants and a ton of properties.
I think a lot of people don’t get the intuition behind these topics and why they’re so useful. So I hope the following intuitive introduction helps people develop a deeper understanding. Note that in the spirit of clarity, I will only be dealing with finite sets and finite time processes (unless absolutely necessary, and in such cases, the interpretation should be obvious). I will do this because people tend to get lost in the measure-theoretic aspects of the rigor that is needed and skip on the intuition that they should be developing instead. Also, note that sometimes the explanations will have more material than is strictly necessary, but the idea is to have a more informed intuition rather than presenting a terse-yet-complete exposition that leaves people without any idea about how to play around with the setup.
People already familiar with the initial concepts can skip to the interesting sections, but I would still recommend reading the explanations as a whole in case your understanding is a bit rusty. The concept of “minimal” sets is used for a lot of the mental modelling in the post, so maybe you’d still want to read the relevant parts of the blog where it is introduced.
I would like to thank Everule and rivalq for suggesting me to write this post, and them and meme and adamant for proofreading and discussing the content to ensure completeness and clarity.
BONUS: Here’s an interesting paper that uses martingales to solve stuff.
Let’s look at the simplified definition of a probability space first. We say simplified here because there are some technicalities that need to be sorted out for infinite sets, which we’re not considering here anyway.
A probability space is a triple \((\Omega, \mathcal{F}, P)\) consisting of
For this function to have some nice properties and for it to make sense as some measure of “chance”, we add some more constraints to these functions:
Let’s now build some intuition about this definition (the following, especially the part about minimal sets, applies only to finite sets, but there are a lot of similarities in the conclusions when we try to generalize to other sets using measure-theoretic concepts).
Note that since \(\mathcal{F}\) is closed under intersections, if we do the following procedure multiple times: take two sets and replace them with their intersection, we will end up with some “minimal” non-empty sets that we can’t break further (this not