news

Google AI lost the IMO gold medal by one point! It took 19 seconds to solve a problem, crushing human players. The super evolution of geometry AI is shocking.

2024-07-26

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


New Intelligence Report

Editor: Editorial Department

【New Wisdom Introduction】Just now, Google DeepMind's latest math model won the silver medal in the IMO Mathematical Olympiad! Not only did it answer 4 out of 6 questions with full marks, just 1 point away from the gold medal, but it also took only 19 seconds to answer the 4th question. The quality and speed of its solution astonished the human judges who scored the questions.

AI has won the silver medal in the IMO Mathematical Olympiad!

Just now, Google DeepMind announced that the real questions of this year's International Mathematical Olympiad were solved by its own AI system.

Among them, AI not only successfully completed 4 out of 6 questions, but also got full marks on each question, which is equivalent to the highest score of the silver medal - 28 points.

This result is only 1 point away from the gold medal!


Among the 609 contestants, only 58 won gold medals.

In the official competition, human players will submit their answers in two batches, with a time limit of 4.5 hours each time.

Interestingly, the AI ​​only took a few minutes to answer one of the questions, but took a full three days to answer the remaining questions, which can be said to be a serious time limit.


The two AI systems - AlphaProof and AlphaGeometry 2 - made great contributions this time.

Key point: 2024 IMO is not in the training data of these two AIs.

Scott Wu (winner of three IOI gold medals), one of the founders behind AI engineer Devin, lamented, "When I was a child, Olympic competitions were everything to me. I never thought that just 10 years later, they would be solved by AI."


This year's IMO competition has a total of six questions, involving algebra, combinatorics, geometry and number theory. We solved four of the six questions to give us a feel for the level of AI.



AI's mathematical reasoning ability shocked the grading professor

We all know that previous AI has always struggled in solving mathematical problems due to limitations in reasoning capabilities and training data.

The two AI players who appeared together today broke this limitation. They are:

- AlphaProof, a new system for formal mathematical reasoning based on reinforcement learning

- AlphaGeometry 2, the second generation geometry problem solving system

The answers given by the two AIs were scored according to the rules by renowned mathematicians Professor Timothy Gowers (IMO gold medalist and Fields Medalist) and Dr. Joseph Myers (two-time IMO gold medalist and Chairman of the IMO 2024 Question Selection Committee).

In the end, AlphaProof correctly solved two algebra problems and one number theory problem, one of the most difficult problems, which only five human contestants in this year's IMO solved; AlphaGeometry 2 solved a geometry problem.

The only two combinatorial math problems that were not solved.

Professor Timothy Gowers was also deeply shocked during the grading process.

It is very impressive that the program was able to come up with such a non-obvious solution, and far exceeded my expectations given the current state of the art.


AlphaProof

AlphaProof is a system that can prove mathematical statements in the formal language Lean.

It combines a large pre-trained language model with the AlphaZero reinforcement learning algorithm, which taught itself to play chess, shogi, and Go.

A key advantage of formal languages ​​is that they allow for formal verification of proofs involving mathematical reasoning. However, their use in machine learning has been limited by the limited amount of relevant data written by humans.

In contrast, natural language-based approaches, despite having access to large amounts of data, can produce plausible but incorrect intermediate reasoning steps and solutions.

To overcome this, Google DeepMind researchers fine-tuned the Gemini model to automatically translate natural language problem statements into formal statements, building a large library of formal problems of varying difficulty, thereby building a bridge between the two complementary fields.

When solving a problem, AlphaProof generates candidate solutions and proves or disproves them by searching for possible proof steps in Lean.


Every proof that is found and verified is used to strengthen AlphaProof’s language model, allowing it to solve more difficult problems later.

To train AlphaProof, the researchers proved or disproved millions of problems covering a wide range of difficulty and mathematical subject matter from the weeks before the competition to during the competition.

During the competition, they also applied a training loop by reinforcing proofs on self-generated variations of the competition problem until a complete solution was found.


Infographic of the AlphaProof reinforcement learning training cycle: about a million informal math problems are translated into formal math language by the formal network; then, the solving network gradually trains itself to solve more challenging problems by searching for proofs or refutations of these problems and using the AlphaZero algorithm

AlphaGeometry 2

AlphaGeometry’s upgraded version, AlphaGeometry 2, is a neural-symbolic hybrid system trained from scratch based on Gemini’s language model.

Based on an order of magnitude more synthetic data than the previous generation, it can solve more difficult geometry problems, including equations involving object motion, angles, scales, and distances.

In addition, it uses a symbolic engine that is two orders of magnitude faster than its predecessor. When it encounters new problems, it uses a novel knowledge sharing mechanism that enables advanced combinations of different search trees to solve more complex problems.

Before participating in IMO this year, AlphaGeometry 2 had already achieved many successes: it was able to solve 83% of the IMO geometry problems in the past 25 years, while the first generation could only solve 53%.

In this IMO, AlphaGeometry 2 shocked everyone with its amazing speed—it solved Problem 4 within 19 seconds of receiving the formalized problem!


Question 4 requires proof that the sum of ∠KIL and ∠XPY equals 180°. AlphaGeometry 2 suggests constructing a point E on line BI such that ∠AEB = 90°. Point E helps determine the midpoint L of AB, forming many similar triangle pairs such as ABE ~ YBI and ALE ~ IPC, thus proving the conclusion

AI’s problem-solving process

It is worth mentioning that these problems will first be manually translated into formal mathematical language before being submitted to AI.
P1

Generally speaking, the first question (P1) in each IMO exam is relatively easy.

Netizens said, "P1 only requires high school math knowledge, and human players can usually complete it within 60 minutes."


The first question of IMO 2024 mainly examines the properties of the real number α and requires finding a real number α that satisfies specific conditions.


AI gave the correct answer - α is an even integer. So, how did it solve it?


In the first step of solving the problem, AI first gave a theorem that the sets on the left and right sides are equal.

The set on the left represents all real numbers α that satisfy the condition that, for any positive integer n, n can divide ⌊i*α⌋ from 1 to n; the set on the right represents that there exists an integer k, k is an even number, and the real number α is equal to k.


The following proof is divided into two directions.

First, prove that the set on the right is a subset of the set on the left (simple direction).


Then, prove that the set on the left is a subset of the set on the right (difficult direction).


By the end of the code, the AI ​​came up with a key equation: ⌊(n+1)*α⌋ = ⌊α⌋+2n(l-⌊α⌋), using the equation to prove that α must be an even number.


Finally, DeepMind summarized the three axioms that AI relies on in the problem-solving process: propext, Classical.choice, and Quot.sound.

The following is the complete solution process of P1: https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/imo-2024-solutions/P1/index.html


P2

The second question examines the relationship between positive integer pairs (a, b), which involves the properties of the greatest common divisor.


The answer AI found is:


The theorem is that for pairs of positive integers (a,b) that satisfy certain conditions, their set can only contain (1,1).

In the following problem-solving process, the proof strategy adopted by AI is to first prove that (1,1) satisfies the given conditions, and then prove that this is the only solution.

Prove that (1,1) is the final solution, using g=2, N=3.


Prove that if (a,b) is a solution, then ab+1 must divide g.


In this process, AI used Euler's theorem and the properties of modular operations for reasoning.


Finally, prove that a=b=1 is the only possible solution.

Here is the complete solution process of P2: https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/imo-2024-solutions/P2/index.html


P4

P4 is a geometry proof question that requires you to prove a specific geometric angle relationship.


As mentioned above, this was accomplished by AlphaGeometry 2 in 19 seconds, setting a new record.

According to the given solution, like the first generation of AlphaGeometry, the auxiliary points in all solutions are automatically generated by the language model.

In the proof, all angle tracking uses Gaussian elimination, and d(AB)−d(CD) is equal to the directed angle from AB to CD (modulo π).

During the problem-solving process, the AI ​​will manually mark similar triangles and congruent triangle pairs (marked in red).

Next, it is the problem-solving steps of AlphaGeometry, which is completed by using the "proof by contradiction".

First use Lean to formalize the propositions that need to be proved and visualize the geometric construction.


The key steps in the proof are as follows.


The complete solution process is shown in the figure below: https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/imo-2024-solutions/P4/index.html


P6

IMO question 6 is the "ultimate boss", which explores the properties of functions and requires proving specific conclusions about rational numbers.


AI solves, c=2.


First, let’s look at the theorem statement. It defines the properties of “Aquaesulian functions” and states that for all such functions, the set of values ​​of f(r)+f(-r) has at most 2 elements.


The proof strategy is to first prove that for any Aquaesulian function, the set of values ​​of f(r)+f(-r) has at most 2 elements. Then construct a specific Aquaesulian function so that f(r)+f(-r) has exactly 2 different values.


Prove that when f(0)=0, f(x)+f(-x) can take at most two different values, and prove that there cannot be an Aquaesulian function with f(0)≠0.


Construct the function f(x)=-x+2⌈x⌉ and prove that it is an Aquaesulian function.


Finally, we will prove that for this function, f(-1)+f(1) =0 and f(1/2)+f(-1/2)=2 are two different values.

The following is the complete solution process: https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/imo-2024-solutions/P6/index.html


You can do math Olympiad problems, but can you tell which is bigger, 9.11 or 9.9?

Andrew Gao, a researcher at Stanford University and Sequoia, affirmed the significance of this AI breakthrough:

Crucially, the latest IMO test questions were not included in the training set. This is important because it shows that the AI ​​can handle completely new and unseen questions.

Moreover, geometric problems successfully solved by AI have always been considered extremely challenging because they involve spatial properties (requiring intuitive thinking and spatial imagination).


Jim Fan, a senior scientist at Nvidia, wrote a long article saying that the big model is a mysterious existence.

They can win silver medals in the Mathematical Olympiad, but they also make mistakes on questions like "Which number is bigger, 9.11 or 9.9?"

Not only Gemini, but even GPT-4o, Claude-3.5, and Llama-3 cannot answer 100% correctly.


By training AI models, we are exploring a vast area beyond our own intelligence. In the process, we discovered a very strange area - an exoplanet that looks like Earth but is full of uncanny valleys.

This seems unreasonable, but we can explain it with the training data distribution:

AlphaProof and AlphaGeometry 2 are trained on formal proofs and domain-specific symbolic engines. To some extent, they are better at solving professional Olympic competition problems, even though they are built on general LLM. The training set of GPT-4o is mixed with a large amount of GitHub code data, which may far exceed mathematical data. In the software version, "v9.11 > v9.9" may have seriously distorted the data distribution. Therefore, this error is understandable to some extent.

The head of Google's developers said that models that can solve difficult mathematical and physical problems are the key path to AGI, and today we have taken another step on this road.


Another netizen said that there was too much information this week.


References:

https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/

https://x.com/DrJimFan/status/1816521330298356181