AI does not (yet) beat humans in new mathematical problems: the results of the “First Proof” test

AI does not (yet) beat humans in new mathematical problems: the results of the “First Proof” test

Artificial intelligence has been beaten by humans in solving 10 complicated mathematical problems within the “First Proof” project. The project was born from this question: can AI replace mathematicians? To try to answer, a group of researchers from several European and US universities created the project “First Proof”, which means “first demonstration”. The objective of this project is to evaluate the AI capabilities in solving complex mathematical problems, the proof of which has never been published before.

The First Proof team, which had already organized a first less “official” test in February, submitted 10 new problems of extremely advanced mathematics to four different AI models. The responses were then rated by experts of the sector, just like for a normal review of scientific research. The results, published on the First Proof website on June 10, show that, so far, theTO THE it is still far from the level of the best human researchers. The best performing system, created by the University of Zurich, managed to demonstrate only 6 questions out of 10. The worst, that of Princeton University, demonstrated only 2.

How the test was structured and what its purpose is

Evaluating the actual mathematical capabilities of AI is not easy. One of the main problems is that models are trained on huge amounts of texts, scientific articles and material available online. If a question it’s too much similar to something already present in the training data, the risk is that the system will simply be able to solve it proposing a solution already present in the data on which it was trained, appearing more capable than it is.

To really test its capabilities, the “First Proof” team asked researchers from around the world to submit problemsquestions and theorems that they had resolved during their research, but which Not they still had published neither in scientific articles nor online. Among all the proposals, ten were selected, belonging to different branches of mathematics, from geometry toalgebra until chance.

AI systems have had to address these problems completely autonomously, without human assistance. The solutions were then examined by a group of about thirty mathematicians, following a process similar to that used in review of articles scientific.

Four different AIs were tested and ETH Zurich won

Four AI models took part in the challenge. The only large company directly present was OpenAI with ChatGPT 5.5 Pro. The other three systems were developed by research groups fromUniversity of California of Los Angeles (UCLA), of Princeton University of New Jersey and the Federal Polytechnic (ETH) of Zurich.

All three universities have developed so-called “harness”, i.e. AI systems in which a model (e.g. ChatGPT) produces a solution and other models (e.g. Gemini and Claude) the they checkthey criticize it and improve it through a series of subsequent steps. The winning system from ETH Zurich worked exactly like this: ChatGPT generated a possible proof, which was then verified and improved with the contribution of Gemini and Claude.

As we said, the ETH Zurich harness was the best, solving and proving correctly 6 out of 10 problems. In second place was the UCLA team, followed by ChatGPT 5.5 Pro. Last was the Princeton system, which, with a harness based mainly on Gemini 3.1 Pro, managed to demonstrate only two questions.

All this, however, was not without costs. The systems created by the three universities, precisely because of their continuous verification mechanisms, were incredibly expensive. The ETH model has come to consume 950 dollars in an attempt to resolve a single problemwithout success. For comparison, ChatGPT 5.5 Pro used alone only required $144 to tackle the entire set of ten problems.

Image
One of the 10 “First Proof” test problems. This geometry question has not been solved by any AI and the proof attempts have cost a total of more than 1100 dollars.

Because AI still can’t replace mathematicians

After the competition, the First Proof team tried to understand why some of the proposed problems remained unsolved. The main conclusion is that these questions required ideas or strategies very different from those present in the existing mathematical literature. According to Johannes Schmitt, a member of the ETH Zurich team, AI often lacked “a critical and unexpected idea”, that creative step necessary to complete the reasoning and arrive at the proof.

Another limitation that emerged concerns the way in which the models construct the proofs. AIs tended to develop with great precision the more procedural parts e mechanicalthose considered most boring by humans, but, in steps more conceptually complex they tended to be a lot less rigorous. In some cases they took for granted results that would require demonstration, without providing any justification. In others they cited articles that did not actually contain the mentioned result.

This does not mean that artificial intelligence is useless for mathematical research. On the contrary, the test shows that it can already be a useful tool for verify steps and be supported in demonstrations. The First Proof team will begin preparing a new edition of the test, scheduled for October 2026. According to the organizers, the next versions of the test could help to better understand in which contexts artificial intelligence can become truly useful to mathematicians, from verifying proofs to finding new strategies to tackle problems that are still open.