This is a collection of my thoughts on AI and problem-solving, which have remained relatively constant over the past few years, and some comments on recent efforts in this direction.

Recently, Deepmind claimed that they had made significant progress on the IMO Grand Challenge, and their automated systems called AlphaProof and the buffed version of their original AlphaGeometry, called AlphaGeometry 2 now perform at the IMO silver level.

Having monitored this field in the past on-and-off, this set off some flags in my head, so I read through the whole post here, as well as the Zulip discussion on the same here. This post is more or less an elaboration on my comment here, as well as some of my older comments on AI/problem-solving. This is still not the entirety of what I think about AI and computer solvability of MO problems.

I had a couple of gripes with the content of their blogs, and the overly-sensational headlines, so I would like to dissect their approach, and provide (heavily-opinionated) commentary on their approaches and propose approaches that I feel will be worth pursuing, in my capacity as a retired math Olympiad contestant and someone who has fiddled around with automated theorem proving and proof assistants like Lean and Coq in the past (while also doing cutting-edge machine learning research for quite a while now).

Past approaches

Problem-solving is not something that the field of AI is a stranger to. In fact, most early approaches to AI were focused on reasoning tasks, that you can read about on the Wikipedia page here. Early approaches were more or less symbolic and grounded in logic (read: the Lisp era), while computational approaches started to take over after some time. The focus also shifted from reasoning to much more manageable goals (pattern recognition and natural language processing). That brings us to the present, where LLMs are all the rage and by the mere mention of LLMs, VCs will empty their pockets to invest in your firm (okay, not that much, but you get the point).

Do present-day LLMs possess actual intelligence?

Given how talking to ChatGPT and friends seems like talking to a real human being, by some definition of “intelligence”, people have started claiming that these models have some inherent intelligence.

Well, technically speaking, in some approximate sense, these models (which are all just transformer models) just provide some sort of likelihood-maximizing next token given the whole context until now, where the likelihood function is trained from data.

The actual mechanism is much more complex. However, here’s a brief (and crude and not necessarily with correct notation) overview of the general understanding of how the architecture works. The explanation is skimmable - it is only for understanding what the architecture does that grants it such strong pattern-recognition abilities for the domains it is used in.

We first compute vector embeddings that embed tokens (that the text is split into) into a high dimensional space where more “collinear” vectors are supposed to be closer in meaning. Then a part of the architecture (called an attention head) looks at these embeddings and computes certain properties of these embeddings. To do this, it computes three different types of vectors:

  • The query - The hypothesis is that there is a vector whose dot product with the current word’s embedding is supposed to explain the property (in the current word) that the attention head is supposed to capture. The query is the dot product in question.
  • The key - The hypothesis is that there is a matrix whose dot product with the embeddings of all the words we have explains how important those words are with respect to this property in general. This captures how “helpful” this word is, to the property (instead of being the subject of the property, it has more of a supporting role for the property).
  • The value - Firstly, we take the dot product of the query and key vectors. The hypothesis is that the dot product of the query and key embeddings will give us the importance of each word with respect to the property in question as well as its connection to the current word, and taking the softmax of this will give us some weights that sum to 1. Now that we have weighted each word with the relevant probabilities, we take their projections that give us new embeddings for the original words, that have some context information built into them. (These are not projections, as per the definition, and the definition is general enough that you can interpret it in multiple ways - such as adding something to the original embeddings and making them go closer and closer to an infinitely enriched embedding).

Using one attention head, we are able to capture one property, and using multiple of these (i.e. using a multi attention head architecture), we end up with embeddings that concern themselves with certain properties (no matter how non-interpretable). Doing this over and over again captures the interactions of these properties and some other stuff, and the hypothesis is that in the end, we end up with highly informative embeddings of each token that have context information.

Now using this architecture and some initial embedding (this can also be trained, as is usually done in GPTs), you can do a lot of things, depending on what you train this architecture on, for example. One way of training is to give the output as the embedding of the next word for each prefix. And at inference, just read off the last token. For more details, the open source GPT2 repository is a good place to start.

But does this imply intelligence? There are a lot of people who say that it does (AI optimists overwhelmingly fall in this category), and lots of people on the other side of the debate. Speaking of myself, I tend to be on the other side of the debate more often than not.

Why? I feel that reasoning is a crucial part of intelligence. Such reasoning must come from certain hard rules embedded into a system - if you think of what the LLM learns as a realization of some kind of incomplete fuzzy logic, then usual logic is the limit where there is perfect certainty and things are completely specified. LLMs are famously bad at counting and arithmetic, for example, and that is simply one consequence of their failure to encode basic axioms (in any of their many equivalent forms). Similarly, these models quite literally learn correlations (as seen in the desired properties of word embeddings - similarity comes from the dot product) rather than causation.

The heavy dependence of these models on data (without much emphasis on hypotheticals) may make these models able to learn “common sense” based on the data distribution they have seen, but without letting them extrapolate in out-of-sample hypothetical situations reasonably.

Another reason, which might be a bit annoying to the people who think that the ends justify the means, is that I believe that blackbox methods are not supposed to be trusted when it comes to reasoning, especially in adversarial/complex situations where the reasoning is lengthy and highly non-linear. The internet is a pretty easy dataset for machine learning (remember what I said earlier about pattern-recognition being much easier as a goal than reasoning?), and that is because what humans say on the internet is easily (easily within all the context of the text, for example) predictable on the average case, and the average case is what matters for these tasks (which are also very well-behaved in general, i.e., thin-tailed in some sense). In such a case, blackbox methods will converge pretty fast to a reasonable minimum, because we are already overparameterizing the network so much that there is very little to be gained by making the model more reasonable and precise - and all that remains is to train it on a super large amount of data to ensure that the model learns most behaviour that will ever be seen completely (there has been criticism of LLMs and other large machine learning models as data-laundering machines, and this has led to copyright lawsuits). There was also a lot of controversy here related to concerns over test-data leakage for AlphaCode 2.

Meanwhile, if you take any tasks (research or even academic competitions like the IMO) that involve multi-hop complex reasoning that requires sophistication of the level of making conjectures/generalizations, LLMs currently fail to do well there, perhaps solely because of the reason that encoding this reasoning systematically is non-trivial in LLMs (and no matter how much we hypothesize about the functionality of different parts of these models, what the model learns is what the learning algorithm finds to be the easiest optimizer of the loss function at hand). Approaches that reduce complexity by (indirectly) embedding some approach similar to the initial approaches to AI should be way more successful, in particular because they recognized the importance of explicitly incorporating logic into systems. Modern machine learning theory also recognizes the importance of this - regularizing models using priors exists, and people design their machine learning models in a domain-specific manner too. There are also specialized approaches in machine learning that design models in ways that allow them to be trained in ways that incorporate certain logical constraints, but these methods are mostly very domain-specific, regardless of how effective they may be in their domain of applicability.

All in all, my personal opinion is that LLMs should be thought of as what they are - language models and nothing else. People who think that intelligence is merely word processing are not capturing the general part of AGI by deliberately ignoring logic, math and creativity, or are deliberately being disingenuous due to conflicts of interest. A generic solution to AGI does not seem in sight for now, though small steps towards it are what will eventually guide us there, if it is ever possible.

Solving math problems at the IMO mechanically

From my experience as a past math Olympiad/competitive programming contestant, I have had an informal mechanical-ness ordering of the typical MO subjects for quite a while, as anyone who has been competing for a while will quickly develop.

Geometry

It is the most mechanical of them all - so mechanical that there are multiple coordinate systems developed to allow contestants to “bash” their way through the problems, as it is affectionately called. This fact has been known for at least a few centuries - Cartesian coordinates were developed in 1637 by Rene Descartes, and there was an informal understanding that all 2D geometry problems can be solved using it.

A decidability-friendly formalization of geometry came in 1959 in the form of Tarski’s axioms for Euclidean geometry. The good part is that there is an algorithm (via the Tarski–Seidenberg theorem) to prove/disprove any geometrical statement that can be written in Tarski’s formulation - however, the bad part is that the algorithm is computationally infeasible on a computer.

There have been multiple independent approaches, mostly by Olympiad geometry enthusiasts, to make this much more viable. From a contest perspective, most (non-combinatorial) geometry problems can be proposed in the following form:

Given certain points and “curves” (more specifically, algebraic varieties), with a constant number of independent constraints, prove that a certain property is true.

In most practical cases, the number of independent constraints never exceeds 10. Informally speaking, independent constraints are those constraints which can’t be naively encoded in constructions of the diagram.

So the general approach for these is to use any of the multiple coordinate systems (Cartesian/complex/barycentric/trilinear/tripolar), determine the equation of each object, and go through each constraint one by one to convert it into an equation. Then the conclusion should follow from the equations written down, and this can usually be done via a Computer Algebra System (CAS), or by hand, if you devote enough time (and practice) to it. This sort of approach is also taken by the Kimberling Encyclopedia of Triangle Centers, to maintain a list of properties of triangle centers that have ever been encountered in the wild. There are ways documented in literature, that make this explicit - for example Wu’s method, which ends up being more general than is usually needed, and the computational complexity suffers because of such generality.

This approach clearly lacks beauty, and is often frowned upon (but Olympiad graders must give credit to the solution if it is correct math, no matter how begrudgingly). However, it is very effective, and I have yet to find a pure Euclidean geometry problem (i.e., without hints of combinatorics, algebra or number theory) that can’t be approached using such a method.

Recently, an alternative approach to automated geometry problem-solving has come up. Instead of mindlessly bashing the problem, the idea is to try to get to a logical synthetic proof instead (i.e., one that does not involve heavy algebraic manipulations), much closer to what one would see in a high school geometry class.

One simple way is to combine Wu’s method with basic geometrical facts (the latter is something that has always been known by MO contestants, and the former has been around since the late 1970s) - as it turns out, it comes pretty close to the much more recent AlphaGeometry when performance under a time-limit is considered (in fact, the method will always come up with a solution).

The other approach along these lines is a neuro-symbolic one, which has also been investigated for quite a while before AlphaGeometry was a thing. Multiple surveys on this general method can be found in this GitHub repository. It is noteworthy that AlphaGeometry is also such an approach, and I’ll elaborate on the approach in its own section.

There are also different things that are not exactly related to solving geometry problems, like GeoGen, a system to auto-generate Olympiad geometry problems, and OkGeometry which helps in finding properties of dynamic configurations.

Algebra (inequalities)

The second-easiest thing to approach mechanically, after geometry, is the field of algebraic inequalities. Very early on, the main arsenal of a math Olympiad consisted of basic inequalities like AM-GM, power-means, Cauchy Schwarz, Rearrangement, Holder, Jensen, Karamata, Schur and Muirhead. They were expected to apply these inequalities in creative ways to prove a certain inequality, which was either the main problem or a small part of it.

However, things started changing rather quickly as inequalities became more popular. Methods like SOS (sum of squares), Mixing Variables, uvw, Buffalo Way, and so on started becoming more and more popular, mechanizing the approach to proving inequalities. I know of at least 3 different (private) pieces of software that were independently developed by people on AoPS, in order to semi-automatically solve inequalities posted on the forum. It was so well-known that there are still collections like this that collect humorous proofs of inequalities via horribly tedious calculations.

Algebra (remaining) and Number Theory

We are now stepping into more perilous territory. Note that we are confining ourselves to math Olympiads only - Riemann’s hypothesis, Collatz conjecture and the like (which can also incidentally be written in the language of inequalities) are off-limits.

The reason behind this is that constructions and non-obvious lemmas become way more important now, compared to what we have already seen so far. And the involved theorems and lemmas come from a much larger space than one would find in geometry. For example, you might want to apply a lemma on a certain kind of infinite set, but in geometry, any interesting set will generally be finite. So even if the set of rules might be smaller, the set of applicable instances might be infinitely larger. This is intuitively part of the reason why there is no analogue to the decidability theorem for Tarski geometry for anything involving the natural numbers, for instance.

Generally, the overlap between these subjects and combinatorics is also much higher than for geometry and combinatorics, and logic starts becoming more and more non-linear (i.e., the most natural path of deduction is not a chain, but more like a graph). Non-linearity of deduction is a harder thing for LLMs to reason about than linear deduction, presumably because of the context being linear. Proofs are generally written linearly - and while LLMs can be made to work with the topologically sorted (linearized) proof, it is definitely not the most natural way of learning how to prove things.

Combinatorics

Combinatorics in math Olympiads has the reputation of being the most chaotic subject, and is often referred to as the dumpyard of problems that can’t be categorized elsewhere. Judging from this reputation alone, it should be the hardest subject for AI to train for.

The following are some of the reasons which make MO combinatorics a relatively high bar for AI compared to the other subjects

  • The search space for solutions is barely organized, which is the natural enemy of both search-based approaches and learning-based approaches.
  • Constructive elements crop up more naturally in combinatorics, and searching for such constructions is also hard. This is in some sense similar to writing code automatically, but strictly harder. IOIers tend to find MO combinatorics easier too, and it has good reason behind it - combinatorics problem-solving requires a certain level of algorithm design literacy. I first realized this when a certain IMO gold medalist advised me to stop studying combinatorics from traditional math Olympiad books and read algorithm design books instead.
  • Experimentation is a major part of solving combinatorics problems, in the absence of a structured approach. Meanwhile, no AI does experimentation in as much depth/with as much explicitness as is required.
  • Combinatorics problems are harder to formalize because of multiple reasons - lack of a unified framework for representing problems (i.e., these problems are too ad hoc) and too many moving parts in a problem (processes and so on) being a couple of reasons.
  • Much larger intersection with research mathematics - most hard problems proposed to the IMO that end up in the combinatorics shortlist tend to have a much larger intersection with research mathematics than the rest of the subjects combined, mostly because most modern research mathematics can’t be classified into algebra/NT as we see it on the IMO. This points to the fact that hard combinatorics problems are closer to research mathematics than the IMO, and should be generally much harder due to the deeper thought involved.

In a later section, I’ll point out approaches that I feel can help AI tackle combinatorics problems, based on these insights. No guarantees of course, combinatorics is hard.

The AlphaCode approach and analysis

There were two versions of AlphaCode - the first one with some discussion here (funnily this also had some discussion on math olympiad geometry problems vs AI, in which I was involved, which preceded AlphaGeometry by quite a bit and predicted that geometry would be the first to be conquered using AI), and the second one with some discussion (and a lot of controversy) here.

Quoting the article: “We pre-train our model on selected public GitHub code and fine-tune it on our relatively small competitive programming dataset. At evaluation time, we create a massive amount of C++ and Python programs for each problem, orders of magnitude larger than previous work. Then we filter, cluster, and rerank those solutions to a small set of 10 candidate programs that we submit for external assessment. This automated system replaces competitors’ trial-and-error process of debugging, compiling, passing tests, and eventually submitting.”

I’ll be blunt - I don’t like this approach. Code generation is as much of a precise thing as math is, and especially so in competitive programming. As I already said earlier, combinatorics problems and informatics olympiad problems have a lot of correlation, so it does not make sense that an approach that does not use logic will perform well at competitive programming problems either.

It works (at the level it works) because the simple problems on Codeforces are really just to encourage people to submit something and start doing competitive programming. A problem at the level of Div2A/B is something anyone who has elementary math aptitude (read: word problems) can solve immediately, and Div2C is something that someone who has additionally seen a lot of code can solve with enough effort. And Div2A/B is the level at which their AI solves problems semi-reliably. You can expect the limited capabilities of a non-fine-tuned LLM to match this performance. In fact, if you used ChatGPT back then, solving Div2A/B was within the reach of a participant who only knew how to prompt engineer (and had some basic common sense).

There were additional concerns that there was test data leakage, which is a very serious issue, considering the fact that LLMs are basically data-laundering machines, especially with small datasets. Their submission to the hardest problem of a contest, which they cherry-picked and displayed very proudly was also highly suspected to be due to a test-data leak - the solution matches too much with the editorial solution of that problem for it to be a coincidence.

This led to a lot of discussion over the data source, and there was a lot of evidence (for example, in this thread) that Google had used more recent training data than the evaluation data. This was also verified when we saw some random Egyptian names in the boilerplate code people use for their solutions, that the AI had copied as well - consider this bot submission and this human submission (which was the first submission on Codeforces/GitHub, chronologically, that had this function name). At this point, it is worth noting that there are more than a thousand public Codeforces submissions made by AlphaCode (but through different accounts, which, fortunately, some helpful users on Codeforces aggregated under another comment in the parent post of the linked thread) - there might be quite a bit to learn from these submissions, especially considering that the list of attempts made for each problem should be a good representative of what the model does towards the end of its search. Unfortunately, the AI did not participate in any live contests, so we might never know how it performs on real out-of-sample data.

This culminated with a member of the community (who is very well-respected in the ML community too) getting on a call with Google, in which they confirmed that the data was sourced from a third party company called remotasks, and some of their training data was more recent than their test data. Google’s plan for no leakage in evaluation was that they’ve checked that no problem description or the same solution worked for a training dataset problem. While it protects against fine-tuning test data leakage, it does not prevent the base model (Gemini) from being trained on the editorial.

The thing I find disingenuous about their press (I will talk more about it later) is the over-sensationalization of everything they do, and the lies involved. They also vehemently opposed any test data leak, which was something they could not admit, else their AI project (which was in the initial stages back then) would have been completely doomed. And they were struggling with putting out their own AI too back then (a necessity for such companies to build the shiny new thing that everyone is crazy about), so I would not put it past them to do anything that could save their company in the field of AI. This desperation is also visible in how they take highly optimistic quotes from top competitive programmers (who probably don’t know how these models work) and possibly remove all skeptical comments.

We will see more examples of such disingenuous (but a bit less harmful) press later too, but we’ll try to stick to the technical parts a bit more.

So, how do we fix this anyway? I would argue that, if you go down into the details, doing this is probably going to take a very different approach to generating algorithms than in the math part.

After all, half of all of AlphaCode’s submissions were basically hardcoding the first test-case, showing true dedication to test-driven development, ignoring the problem statement at hand completely.

One issue with competitive programming problems is the lack of formal definitions in literature. For example, consider a basic data structure that is very common in the competitive programming community, called a segment tree. It was formalized as recently as 2020 (public documentation). Similar documentation exists for only a very small subset of data structures used in competitive programming.

Let’s say we have defined mathematical properties of these structures. Then finding a framework to convert these into math problems (that fill an algorithm in the blank and then prove its correctness) is the major problem that needs to be solved first (and in this sense, this is harder than just solving a math Olympiad problem). Doing automated problem-solving is undecidable due to the Halting problem, so we will definitely need heuristics, and will have to be satisfied with the possibility of not getting an answer. This is probably the reason why people don’t think about it too much and plug in their problem into a GPT and manually tinker with the solution or keep prompting until they get tired.

To reiterate, LLMs work well on easier problems (but can not work well on harder problems due to no logic) because there are way too many similar problems on the internet, and syntax of the programming languages fed into the LLM during training is uniform enough that the recall of the LLM turns out to be decent for repeated problems.

The AlphaGeometry approach

So what is the AlphaGeometry approach? This is not as much of a shitshow as AlphaCode, fortunately. But it is not as eye-catching either, in the sense of having fancy models generating stuff. The paper is here.

The main ideas are the following:

  • Geometry has very little (formal language) data. So we want to somehow generate lots of synthetic data, that corresponds to problems of a decent level. This can be done by running a symbolic deduction engine (based on two techniques called deductive database and algebraic rules).
  • Modern geometry proofs have one thing that is not easily attackable by deduction engines - constructions. Auxiliary constructions are what they call exogeneous terms in geometry, which means that you have to basically pull them out of somewhere almost magically. To do this, they basically do the same thing, but keep those proofs, where the conclusions are independent of some objects in the problem statement. Then they move those objects in the set difference to the “construction” part of the proof, and fine-tune a transformer (on the formal language, not natural language) to recognize these constructions whenever the base solver is stuck, after pre-training it on the whole synthetic dataset.

The conclusion is that out of the 30 problems, the DD+AR approach with human-designed heuristics was the best one that did not involve any transformer, and they were able to solve 18 problems with it. Training the transformer gave improvements - when it was trained only on auxiliary constructions, it solved 21 problems, when only pre-trained, it solved 23 problems, and after both, it solved 25 problems.

Notably, they found that asking GPT4 for hints on auxiliary constructions did not help (it only added one extra problem to the vanilla DD+AR approach and solved 15 problems).

This table summarizes their results.

Intuitively, how the transformer helps is by patching up the parts of the proof that a symbolic engine can’t complete; it knows the general sequence of the proofs (by learning them almost completely, of course), and is fine-tuned to tell the proofs what kinds of constructions are statistically the most likely to help, by extrapolating from training data. Perhaps this solidifies the reputation of Euclidean geometry (incidentally also my strongest MO subject, and one I used to be one of the strongest human contestants in the world at for quite a while before I retired) as the most mechanical among all the MO subjects.

But the story does not end here. As pointed out earlier, a new paper that considered Wu’s method as a helper for AlphaGeometry claims that Wu’s method alone solves 15 problems in a time limit of 5 minutes per problem on a CPU-only laptop (which casts some doubt on the AlphaGeometry paper’s claims that Wu’s method took them 48h to solve only 10 problems). And in this paper, they also show that DD+AR with Wu’s method solves 21 problems, matching AlphaGeometry’s score of 21 when the transformer was trained only on auxiliary constructions. And these problems include 2 problems that were not solved by AlphaGeometry, so when stacking these two methods gives an AI that solves 27 problems on IMO-AG-30. Adding these 2 problems to the DD+AR with human heuristics approach gives a decent score of 23 problems, which is not that far off from AlphaGeometry, and still does not use transformers or any other machine learning approach.

Again, the AlphaGeometry publicity just focused on the LLM part, because of course they need their LLM funds.

One thing to note, by the way, is how capturing constructions here shows that we definitely need more structured ways to reason about problems automatically, or as in my comment that I linked in the beginning, “As a natural generalization, it would greatly help the AI guys if they identify certain patterns in human problem-solving — naturally because humans have very limited working memory, so our reasoning is optimized for low context and high efficiency. Patterns like simplification/specialization, coming up with conjectures, wishful thinking and other such common patterns that one can find in books like Polya’s How to Solve It should probably be captured during training these models. […] Of course, after we have all these models, the control flow is just the traversal of the graph where these models are all nodes.”

Some people pointed out the bitter lesson to me as opposing this view, in this context. I agree with the article, even if it is sometimes just used as an excuse to be lazy and throw compute at a problem. However, I argue that the argument in that blog is consistent with my comment. The ability to reason can come from higher abstractions that we may never be able to completely understand, that arise at a scale incomprehensible to our minds, and I completely agree with that (Inter-universal Teichmüller theory anyone?). We are not baking in any knowledge. We are only ensuring that our architectures (in the most general sense of the word, regardless of whether they are classical ML models or not) are able to capture fundamental solving strategies and reduce the complexity of the search space. Quoting from the article: “[Simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries] are not what should be built in, as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity. Essential to these methods is that they can find good approximations, but the search for them should be by our methods, not by us. We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done.” We never overlook sensible transformer architectures, so why overlook sensible and scalable architectures that reduce our search space complexity by adding provably complete blocks (that also perform computational search, which is also scalable)? In fact, LLMs as of now in the context of reasoning perhaps violate the ideas in that article - by simply scaling breadth-wise and not depth-wise (in the sense of having shallow reasoning parroted off from the internet, and not seeming capable of original, directed and precise mathematical thought).

The AlphaProof approach

So AlphaGeometry was decent, even though we noted that we can do just fine without transformers too, at the current stage. How do we go to solving something that is non-geometry from here?

A disclaimer: AlphaProof does not seem to have a paper published with the details yet (and as a consequence, it is not reproducible), so the legitimacy of their results is low. There are a lot of details that are either ambiguous or straight-up missing. I understand that this might be because they want to go ahead and be the first to solve the IMO Grand Challenge (and such secrecy is common in this case), but that leaves us with a lot less to go forward with, and potentially robs other people of the opportunity to claim that they had the first silver-level performance (and as Kevin Buzzard jokes in the Zulip chat, is it really a silver medal performance if it took the AI 3 days? And may I add, even with the amount of compute Google could throw at the problem?).

All the cynicism aside, the details of the below are based on the details mentioned in their blog post here.

Turns out, the AlphaProof architecture is more or less similar to the AlphaGeometry one. There are a few major differences:

  • The language used is now Lean, instead of the geometry specific language used in AlphaGeometry. This increases the action space (which is the set of valid syntax for the analogous language model being used in the generation step).
  • Apparently, the DD+AR method does not work so well here, and they pivoted to a purely neuro-symbolic method (again, it’s possible that they did not try any old methods from literature). They do not mention what exactly the proof strategy is, either. Having a paper (and making this proof system available) would clarify this further, and legitimize their claims a bit more.
  • The data generation procedure mentioned here is interesting. They claim to have some sort of auto-formalization procedure (at least for Olympiad-style problems) that fine-tuning Gemini achieves - this is a huge achievement if true, and I would love to experiment with this myself. My hypothesis is that this is not an accurate auto-formalization procedure, and it probably takes mathematical statements and converts them to some formal statements (that may or may not correspond to the original statements), and prove/disprove them.
  • They claim that some language model is pre-trained, and it is not clear whether it is the translator model or the solver model. If it is the solver model, everything seems to work, but the issue of where they got the training data comes into the picture (maybe from other automated solvers?). Otherwise, it is not clear what the proof engine initializes itself with, because you need a place to start with, and the state of search for non-neural automated theorem proving at the IMO level is not good. Perhaps the proof engine is hybrid as in AlphaGeometry, and they just don’t mention the helper strategy (or the main strategy) to find proofs. Or it is possible that they started out with a basic dataset where the Lean proofs are relatively trivial, and bootstrapped their way up the proof lengths (and/or architecture complexity).
  • The solving procedure is also relatively dynamic, changing some parts of the problems to “understand” structure and using RL to be able to find better proof strategies for the problem at hand.

They claim that this solved 3 problems (non-geo) in 3 days (unclear whether this was per problem or total time), and the geo problem was solved by AlphaGeometry 2 in 19 seconds. Meanwhile, the two combinatorics problems are unsolved.

More on this hopefully when there is more info/a paper.

Some insights

General insights

The proof search strategy generally seems pretty naive, and seems to lack depth, in the face of a tremendous branching factor that proofs generally have. Working on improving the search strategy should help reduce the involved compute.

Generating combinatorics data

I suspect that there is a good amount of imbalance in the training data. When I was browsing some of the manually written out proofs at this repo of manually written out Lean solutions, I observed one thing - most combinatorics proofs were incomplete (in the Lean code, you can check this by searching for sorry)! And perhaps for good reason - combinatorics is generally harder to fit in a formal framework than the other three subjects (for example, if someone at the IMO implicitly assumes ZFC instead of ZF, what will the graders do? And what if the formulations need many more definitions than in the other three subjects?).

Maybe we could say the same for combinatorics problems, considering that even when the model ran RL training loops (during inference) on variations of the combinatorics problems, they still remained unsolved (despite P5 being relatively easy)?

Even if these problems are auto-formalized well, there is a high chance that the language model learns combinatorics problems pretty slowly, or is simply unable to solve problems (thereby not affecting the relevant weights in a meaningful way).

One way to fix this is formalizing combinatorics solutions by hand, but it would be a gigantic effort, not worth the manpower. This brings us to the other solution.

Making the setup more combinatorics friendly

One (perhaps slightly far-fetched) reason can be that Lean might not be the perfect language for combinatorics, at the level of complexity we have. One possibility is that the model is simply unable to learn the structure, and that the model requires the right level of abstraction to think at. In this respect, there has been research on proof assistants (like Cur, GitHub) that are somewhat Lispy in nature - they allow metaprogramming and building up to a domain specific language rather easily. With the right abstractions being used, it might be feasible for the model to learn more high level ideas (of course, having a different part of the solver for bringing the abstraction to a lower level might be necessary). This follows from common sense - if you have a large hierarchical database of theorems, then a complex problem might just be applications of a couple of high level theorems in a certain manner. Similarly, it frees us from the low-level constraints that are in some way an extra constraint that is as bad as imposing our own knowledge on the system.

Another reason might be, as I pointed out earlier, that our proof search mechanism is very naive and probably can’t constrain the search space enough to be able to prove theorems on itself. In AlphaGeometry, the novel idea was to fine-tune a model on the construction parts of the proofs. However, it does not seem to be the case for AlphaProof, and it does not seem to be very feasible here either.

The lack of context is also a significant issue - proofs can become arbitrarily deep (so we should probably use an architecture that has long enough context windows, or a hybrid model that captures long-term implications via elimination rules directly instead of relying on the model). Word embeddings do not carry the intended meaning as much as they did in the intended article, and the tokens need to be worked on more carefully as well.

In my comment, I proposed different model nodes (which can be arranged in an arbitrary digraph-like architecture) corresponding to different basic building blocks of a potential search strategy, that act as helper nodes to the architecture, helping it find exogeneous terms in proofs (and increase the visibility of proofs that have such terms) both during inference and during adversarial training:

  • Example generation nodes: These nodes can generate (partially or completely) specialized instances of the theorem at hand. This is simply the job of a traditional theorem prover, which can also be augmented by a language model, in order to construct useful examples.
  • Conjecture nodes: These nodes can inspect the examples and make a conjecture based on all these nodes. This can be done via training the model on chains of the form (example -> conjecture).
  • Generalization nodes: These do what they say, but are really just the inverse of the example generation nodes.

Now that we have a description of all these nodes, all that remains is information flow between these nodes. Even the structure of this graph is not 100% clear. One possible information flow model can be via a graph that is still visible in many approaches these days - KANs and liquid time-constant networks to name a few - the worker nodes each output something that must be used by the other nodes, and the most important parts are in the nodes themselves. And the transition probabilities come from some sort of policy model itself, sort of learning the model itself.

This list of nodes is nowhere near exhaustive, but for most combinatorics problems, this sort of approach generally works, for a human mind at least. Just as in the attention mechanism, it might be possible that these models learn completely different things than what we are expecting. But giving it extra degrees of freedom will always give the model enough breathing space to do more than it was designed for.

Of course, there must be ways to encode these nodes directly in a reinforcement learning setting - I hope there is more research on such results. The field of meta reinforcement learning comes to mind, and from my experience with graph learning, I am pretty sure that such approaches are worth trying.

Alluding to the earlier comment that hard Olympiad combinatorics is closer to research more than the other three subjects, putting in effort in this direction should help the AI improve on much more generic benchmarks, even if the right unifying framework towards AGI does not seem obvious at the moment. In some sense, combinatorics is the thing to beat.