Mathematicians interact with AI, July 2025 update

This is a guest post from Aravind Asok¹. If you have comments about this, you can contact him at asok@usc.edu. We’ll see if there’s some way to later post moderated comments here.

Recently, several symposia have been organized in which groups of mathematicians interacted with developers of various AI systems (specifically, reasoning models) in a structured way. We have in mind the Frontier Math Symposium hosted by Epoch AI and the Deepmind/IAS workshop. The first of these events received more coverage in the press than the second. It spawned several articles including pieces in Scientific American and the Financial Times, though both articles are currently behind a paywall.² Curiously absent from these discussions is any kind of considered opinion of mathematicians regarding these interactions, though hyperbolic quotes from these pieces have made the rounds on social media. Neither of these events was open to the public: participation in both events was limited and by invitation. In both cases the goal was to foster transparent and unguarded interactions.

For context, note that many mathematicians have spent time interacting with reasoning models (Open AI’s ChatGPT, Google’s Gemini, and Anthropic’s Claude among others). While mathematicians were certainly not exempt from the wave of early prompt-based experimentation with initial public models of ChatGPT, they have also explored the behavior of reasoning models on professional aspects of mathematics, testing the models on research mathematics, homework problems, example problems for various classes as well as mathematics competition problems. Anecdotally, reactions run the gamut from dismissal³ to surprise.⁴ However, a structured group interaction with reasoning models provides a qualitatively different experience than these personal explorations. Since invitation to these events was controlled, their audience was necessarily limited; the Epoch event self-selected for those who expressed specific interest in AI,⁵ though the IAS/Deepmind event tried to generate a more random cross section of mathematicians.
Much press coverage has a breathless feel, e.g., including coverage of comments by Sam Altman in, say, Fortune.⁶ It seems fair to say that mathematicians are impressed with the current performance of models, and, furthermore, see interesting avenues for augmenting mathematical research using AI tools. However, many mathematicians view the rhetoric that “math can be solved”, extrapolating from progress on competition-style mathematics viewed as a game, as problematic at best, and at worst presenting a fundamental misunderstanding of the goals of research mathematics as a whole.

Our discussion here will focus on the Epoch AI-sponsored meeting for concreteness, which was not “secret” in any dramatic or clandestine sense, contrary to some reports. The backstory: Epoch AI has been trying to create benchmarks for the performance of various released LLMs⁷ (a.k.a., chatbots like Open AI’s ChatGPT, Anthropic’s Claude, Google Deepmind’s Gemini, etc.).⁸ Frontier Math is a benchmark designed to evaluate the mathematical capabilities of reasoning models. This benchmark consists of tiered lists of problems. Tier 1 problems amount to “mathematical olympiad” level problems, while Tiers 2 and 3 are “more challenging” requiring “specialized knowledge at the graduate level.” Frontier Math sought to build a Tier 4 benchmark of “research
level” problems.

Building the Tier 4 benchmark necessitated involving research mathematicians. Earlier this year, Epoch reached out to mathematicians through varying channels. Initial requests promised some amount of money for delivering a problem of a particular type, but many mathematicians unfamiliar with the source of the communication either dismissed it as not credible or had no interest in the monetary compensation.⁹ To speed up the collection of Tier 4 problems, Epoch came up with the idea of hosting a symposium. The symposium was advertised on several social media outlets (e.g., Twitter) and various mathematicians were contacted directly by e-mail. Interested participants were sometimes asked to interview with Frontier Math lead mathematician Eliot Glazer and also to produce a prospective problem. Mathematics is a fairly small community so many of the people who attended already knew others who were attending; also the vast majority of attendees came from California. Participants did sign a non-disclosure agreement, but it was limited to information related to the problems that were to be delivered. Symposium participants also had their travel and lodging covered, and were paid a \$1500 stipend for their participation.

Participants were given a list of criteria for problem construction; problems must: ¹⁰

Have a definite, verifiable answer (e.g., a large integer, a symbolic real, or a tuple of such objects) that can be checked computationally.
Resist guesswork: Answers should be “guessproof,” meaning random attempts or trivial brute-force approaches have a negligible chance of success. You should be confident that a person or AI who has found the answer has legitimately reasoned through the underlying mathematics.
Be computationally tractable: The solution of a computationally intensive problem must include scripts demonstrating how to find the answer, starting only from standard knowledge of the field. These scripts must cumulatively run less than a hour on standard hardware.

The participants were divided into groups based on field specificity (number theory, analysis, algebraic geometry, topology/geometry and combinatorics) and told to produce suitable problems.

How did participants contextualize this challenge? In mathematics research one frequently does not know in advance the solution to a given problem, nor whether the problem is computationally tractable. In fact, many mathematicians will agree that knowing a problem is soluble can be game-changing.¹¹ Moreover, deciding which problems should be deemed worthy of study can be difficult. As a consequence, by and large, participants did not frame the challenge as one of producing research problems, but rather one of simply producing appropriate problems.

Unsurprisingly, ability to construct such problems varied from subject to subject. For example, one geometer said that it was quite difficult to construct “interesting” problems subject to the constraints. There are also real questions about the extent to which “ability to resist guesswork” truly measures “mathematical understanding”. Many participants were rather open about this: even if AI managed to solve the problems they created, they did not feel that would constitute “understanding” in any real sense.

While most participants had written and submitted problems before the symposium started, few people had an idea at that point of what would be “easy” or “hard” for a model. Most of the first day was spent seeing how models interacted with these preliminary problems, and the subsequent discussions refined participants’ understanding of the stipulation that problems were resistant to guesswork. Along the way, models did manage to “solve” some of the problems, but that statement deserves qualification and a more detailed understanding of what constitutes a “solution”.

One key feature of reasoning models was explicit display of “reasoning traces”, showing the models “thinking”. These traces displayed models searching the web and identifying related papers, but their ability to do so was sensitive to the formulation of the problem in fascinating ways. For example, in algebraic geometry, formulating a problem in terms of commutative ring theory instead of varieties could elicit different responses from a model. However, it is a cornerstone of human algebraic geometry to be able to pass back and forth between the two points of view with relative ease. In geometry/topology, participants noted that models demonstrated no aptitude for geometric reasoning. For example, models could not create simple pictorial models (knot diagrams were specifically mentioned) for problems and manipulate them.¹² In algebraic and enumerative combinatorics, models applied standard methods well (e.g., solving linear recurrences, appealing to binomial identities), but if problems required several steps as well as ingenuity models were stymied, even if they were prompted with relevant literature or correct initial steps.

When a model did output a correct answer, examining the reasoning traces sometimes indicated that happened because the problem was constructed in such a way that the answer could be obtained by solving a much simpler but related problem. In terms of the exam solution paradigm, we would probably say such a response was “getting the right
answer for the wrong reason” and assign a failing grade to such a solution!

Participants were routinely told to aim to craft problems that even putative future reasoning models would find difficult. From that standpoint, it was easy to extrapolate that a future model might behave in a more human way, demonstrate “understanding” in a human sense, and isolate the missing key ingredient. This created a pervasive fear that if reasoning traces indicated models seemed “close now”, then one should extrapolate that the problems would be solvable by future models.¹³ Participants did observe that if literature in a particular domain was suitably saturated, the models could identify lemmas that would be appropriate and generate relevant mathematics. This was certainly impressive, but one wonders to what extent the natural language output affects perception of the coherence of responses: it is easy for things to “look about right” if one does not read too closely! Eventually, participants did converge on problems that were thought to meet the required bar.

The language models that we worked with were definitely good at keyword search, routinely generating useful lists of references. The models also excelled at natural language text generation and could generate non-trivial code, which made them useful in producing examples. However, press-reporting sometimes exaggerated this, suggesting that reasoning models are “faster” or “better” than professional mathematicians. Of course, such statements are very open to interpretation. On the one hand, this could be trivially true, e.g., calculators are routinely faster than professional mathematicians at adding numbers. Less trivially, it could mean automating complicated algebraic computations, but even this would be viewed by most mathematicians as far from the core essence of mathematical discovery.

The participants at the meeting form a rather thin cross-section of mathematicians who have some interest in the interface between AI (broadly construed) and mathematics. The symposium Signal chat became very active after the Scientific American article was posted. Undoubtedly participants felt there were exciting possible uses of AI for the development of mathematics. There are also real questions about whether or when future “reasoning models” will approach “human-level”competence,¹⁴ as well as serious and fascinating philosophical questions about what that even means; this is a direct challenge for the mathematics community. What does it mean to competently do research mathematics? What is valuable or important mathematics?

Finally, there are important practical questions about the impact, e.g., environmental or geopolitical, of computing at this level.¹⁵ All these questions deserve attention: barring some additional as-yet-unseen theoretical roadblock, reasoning models seem likely to continue improving, underscoring the importance of these questions. As things stand, however—particularly when it comes to mathematical reasoning—caution seems warranted in extrapolating future research proficiency of models.

With the aid of generous input from Ben Antieau, Greta Panova, Kyler Siegel, Ravi Vakil, and Akshay Venkatesh.↩︎
Some discussion of the IAS/Deepmind event is available on Michael Harris’ June 8 substack post.↩︎
It seems many people have a collection of standard mathematical questions for which reasoning models produce only hallucinatory outputs. Some discussion of the disconnect between stated AI benchmark progress on mathematics as opposed to “real” research as of March 2025 can be found in the article The Disconnect Between AI Benchmarks and Math Research.↩︎
The performance of language models on standard exam questions for undergraduate and graduate classes was a routine source of surprise. However, one expects reasoning models should perform better in areas where literature is dense. People are also routinely impressed by the fact that the models have improved so much over time.↩︎
We would be remiss not mention that many mathematicians justifiably have concerns about legitimizing corporate technological endeavors. Such worries are especially important since the companies developing reasoning models plausibly view “mathematical progress”, say in terms of ability of models to solve mathematical problems of various types, as a way to distinguish amongst themselves. The vague statement “our model is good at math” can simultaneously be propaganda or simply false depending on context and audience.↩︎
Altman states: “In some sense AIs are like a top competitive programmer in the world now or AIs can get a top score on the world’s hardest math competitions or AIs can do problems that I’d expect an expert PhD in my field to do”. When Altman makes such a statement, given his role, it’s easy to question his intentions. However, one cannot help but interpret his comments differently if there are also mathematicians making statements that can be construed as “AI systems are good at math”. Even vague statements to this effect by mathematicians could be used to minimize legislative targeting, or avoid scrutiny.↩︎
See Epoch AI’s benchmarking dashboard for a more detailed discussion.↩︎
See here for a blogpost giving some context to math benchmarks, a little bit of background in “reasoning models” as well as discussion of computational
efforts involved.↩︎
Some discussion of these requests took place at the Joint Mathematics Meetings (JMM), which was held in Seattle in January. According to some comments here, mathematicians expressed skepticism at delivering problems for money
exposing a fundamental disjunction between academia and industry: pure mathematicians aspire to study mathematics to advance understanding, but industry researchers are required to deliver creations that buoy their supporting institutions.↩︎
See https://epoch.ai/frontiermath/tier-4 for further discussion of Tier 4 problems.↩︎
It is also worth pointing out that formalizing “inherent difficulty of proof discovery”, say in terms of decision problems, led to significant theoretical challenges to previous generation approaches to artificial intelligence. For a recent revision and extension of this notion of difficulty, see Artifical intelligence and inherent mathematical difficulty by W. Dean and A. Naibo. Once again, it is unclear whether such an approach has any bearing on what mathematicians might view as important for mathematics.↩︎
More broadly, the sense was that forthcoming AI systems would be able to navigate the literature, including obscure corners, and would be quite capable at performing “standard”computations. Mathematicians themselves have created plenty of open source computational packages (SnapPy, Regina), which are already integrated
into Python and hence automatically part of the toolkit to which a model has access. However, models seemingly lacked what mathematicians call geometric intuition.↩︎
There’s a phenomenon alluded to in the Scientific American article as “proof by intimidation” which exposes a relevant phenomenon. If someone asserts boldly that they have solved a problem in a particular research domain, they have sufficient status and the approach they describe includes keywords/techniques that seem like they
should have bearing on the problem, mathematicians tend to give them the benefit of the doubt. Indeed, mathematicians will frequently believe the problem has been solved without going through the “solution” in detail. It also routinely happens in mathematical practice that such “solutions” break down under additional scrutiny, e.g., because some subtle part of the argument was not sufficiently well explained. This discussion seems relevant to participants perception of putative “solutions” provided by the models. Moreover, many of the stories of “solutions to problems” travelled between groups, creating a kind of echo effect.↩︎
Recent mathoverflow posts raise the question: Is this a bad moment for a math career in the context of news over AI models. The conversation around AI and jobs also extends to current fears. For example, without being precise yet about what we mean by AI, graduate students, especially at some larger public universities, are being tasked to “use AI” to speed up their workflow, frequently leading to despair.↩︎
Environmental impacts of generative AI include increased electricity demand and water consumption sometimes localized to data center construction. See Michael Harris’s substack for a recent discussion with some links.↩︎

Mathematicians interact with AI, July 2025 update

About

Quantum Theory, Groups and Representations

Not Even Wrong: The Book

Subscribe to Blog via Email

Recent Comments

Categories

Archives

Links

Mathematics Weblogs

Physics Weblogs

Some Web Pages

Twitter

Videos

Mathematicians interact with AI, July 2025 update

Share this:

About

Quantum Theory, Groups and Representations

Not Even Wrong: The Book

Subscribe to Blog via Email

Recent Comments

Categories

Archives

Links

Mathematics Weblogs

Physics Weblogs

Some Web Pages

Twitter

Videos