{"id":15039,"date":"2025-07-04T10:46:39","date_gmt":"2025-07-04T14:46:39","guid":{"rendered":"https:\/\/www.math.columbia.edu\/~woit\/wordpress\/?p=15039"},"modified":"2025-07-05T16:58:19","modified_gmt":"2025-07-05T20:58:19","slug":"mathematicians-interact-with-ai-july-2025-update","status":"publish","type":"post","link":"https:\/\/www.math.columbia.edu\/~woit\/wordpress\/?p=15039","title":{"rendered":"Mathematicians interact with AI, July 2025 update"},"content":{"rendered":"<p><em>This is a guest post from Aravind Asok<a id=\"fnref1\" class=\"footnote-ref\" role=\"doc-noteref\" href=\"#fn1\"><sup>1<\/sup><\/a>. If you have comments about this, you can contact him at <a href=\"mailto:asok@usc.edu\">asok@usc.edu<\/a>. We&#8217;ll see if there&#8217;s some way to later post moderated comments here.<\/em><\/p>\n<p>Recently, several symposia have been organized in which groups of mathematicians interacted with developers of various AI systems (specifically, reasoning models) in a structured way. We have in mind the <a href=\"https:\/\/frontiermath-symposium.epoch.ai\/\">Frontier Math Symposium<\/a> hosted by Epoch AI and the Deepmind\/IAS <a href=\"https:\/\/www.ias.edu\/math\/events\/deepmind-mathai-workshop\">workshop<\/a>. The first of these events received more coverage in the press than the second. It spawned several articles including pieces in <a href=\"https:\/\/www.scientificamerican.com\/article\/inside-the-secret-meeting-where-mathematicians-struggled-to-outsmart-ai\/\">Scientific American<\/a> and the <a href=\"https:\/\/www.ft.com\/content\/564403fa-134c-4385-9e57-4cfc53880508\">Financial Times<\/a>, though both articles are currently behind a paywall.<a id=\"fnref2\" class=\"footnote-ref\" role=\"doc-noteref\" href=\"#fn2\"><sup>2<\/sup><\/a> Curiously absent from these discussions is any kind of considered opinion of mathematicians regarding these interactions, though hyperbolic quotes from these pieces have made the rounds on social media. Neither of these events was open to the public: participation in both events was limited and by invitation. In both cases the goal was to foster transparent and unguarded interactions.<\/p>\n<p>For context, note that many mathematicians have spent time interacting with reasoning models (Open AI\u2019s ChatGPT, Google\u2019s Gemini, and Anthropic\u2019s Claude among others). While mathematicians were certainly not exempt from the wave of early prompt-based experimentation with initial public models of ChatGPT, they have also explored the behavior of reasoning models on professional aspects of mathematics, testing the models on research mathematics, homework problems, example problems for various classes as well as mathematics competition problems. Anecdotally, reactions run the gamut from dismissal<a id=\"fnref3\" class=\"footnote-ref\" role=\"doc-noteref\" href=\"#fn3\"><sup>3<\/sup><\/a> to surprise.<a id=\"fnref4\" class=\"footnote-ref\" role=\"doc-noteref\" href=\"#fn4\"><sup>4<\/sup><\/a> However, a structured group interaction with reasoning models provides a qualitatively different experience than these personal explorations. Since invitation to these events was controlled, their audience was necessarily limited; the Epoch event self-selected for those who expressed specific interest in AI,<a id=\"fnref5\" class=\"footnote-ref\" role=\"doc-noteref\" href=\"#fn5\"><sup>5<\/sup><\/a> though the IAS\/Deepmind event tried to generate a more random cross section of mathematicians.<br \/>\nMuch press coverage has a breathless feel, e.g., including coverage of comments by Sam Altman in, say, <a href=\"https:\/\/fortune.com\/2025\/06\/20\/openai-ceo-sam-altman-ai-phds-entry-level-corporate-job-cuts-what-is-left-gen-z-college-gradautes\/\">Fortune<\/a>.<a id=\"fnref6\" class=\"footnote-ref\" role=\"doc-noteref\" href=\"#fn6\"><sup>6<\/sup><\/a> It seems fair to say that mathematicians are impressed with the current performance of models, and, furthermore, see interesting avenues for augmenting mathematical research using AI tools. However, many mathematicians view the rhetoric that \u201cmath can be solved\u201d, extrapolating from progress on competition-style mathematics viewed as a game, as problematic at best, and at worst presenting a fundamental misunderstanding of the goals of research mathematics as a whole.<\/p>\n<p>Our discussion here will focus on the Epoch AI-sponsored meeting for concreteness, which was not \u201csecret\u201d in any dramatic or clandestine sense, contrary to some reports. The backstory: Epoch AI has been trying to create benchmarks for the performance of various released LLMs<a id=\"fnref7\" class=\"footnote-ref\" role=\"doc-noteref\" href=\"#fn7\"><sup>7<\/sup><\/a> (a.k.a., chatbots like Open AI\u2019s ChatGPT, Anthropic\u2019s Claude, Google Deepmind\u2019s Gemini, etc.).<a id=\"fnref8\" class=\"footnote-ref\" role=\"doc-noteref\" href=\"#fn8\"><sup>8<\/sup><\/a> Frontier Math is a benchmark designed to evaluate the mathematical capabilities of reasoning models. This benchmark consists of tiered lists of problems. Tier 1 problems amount to \u201cmathematical olympiad\u201d level problems, while Tiers 2 and 3 are \u201cmore challenging\u201d requiring \u201cspecialized knowledge at the graduate level.\u201d Frontier Math sought to build a Tier 4 benchmark of \u201cresearch<br \/>\nlevel\u201d problems.<\/p>\n<p>Building the Tier 4 benchmark necessitated involving research mathematicians. Earlier this year, Epoch reached out to mathematicians through varying channels. Initial requests promised some amount of money for delivering a problem of a particular type, but many mathematicians unfamiliar with the source of the communication either dismissed it as not credible or had no interest in the monetary compensation.<a id=\"fnref9\" class=\"footnote-ref\" role=\"doc-noteref\" href=\"#fn9\"><sup>9<\/sup><\/a> To speed up the collection of Tier 4 problems, Epoch came up with the idea of hosting a symposium. The symposium was advertised on several social media outlets (e.g., Twitter) and various mathematicians were contacted directly by e-mail. Interested participants were sometimes asked to interview with Frontier Math lead mathematician Elliot Glazer and also to produce a prospective problem. Mathematics is a fairly small community so many of the people who attended already knew others who were attending; also the vast majority of attendees came from California. Participants did sign a non-disclosure agreement, but it was limited to information related to the problems that were to be delivered. Symposium participants also had their travel and lodging covered, and were paid a \\$1500 stipend for their participation.<\/p>\n<p>Participants were given a list of criteria for problem construction; problems must: <a id=\"fnref10\" class=\"footnote-ref\" role=\"doc-noteref\" href=\"#fn10\"><sup>10<\/sup><\/a><\/p>\n<ol>\n<li>Have a definite, verifiable answer (e.g., a large integer, a symbolic real, or a tuple of such objects) that can be checked computationally.<\/li>\n<li>Resist guesswork: Answers should be \u201cguessproof,\u201d meaning random attempts or trivial brute-force approaches have a negligible chance of success. You should be confident that a person or AI who has found the answer has legitimately reasoned through the underlying mathematics.<\/li>\n<li>Be computationally tractable: The solution of a computationally intensive problem must include scripts demonstrating how to find the answer, starting only from standard knowledge of the field. These scripts must cumulatively run less than a hour on standard hardware.<\/li>\n<\/ol>\n<p>The participants were divided into groups based on field specificity (number theory, analysis, algebraic geometry, topology\/geometry and combinatorics) and told to produce suitable problems.<\/p>\n<p>How did participants contextualize this challenge? In mathematics research one frequently does not know in advance the solution to a given problem, nor whether the problem is computationally tractable. In fact, many mathematicians will agree that knowing a problem is soluble can be game-changing.<a id=\"fnref11\" class=\"footnote-ref\" role=\"doc-noteref\" href=\"#fn11\"><sup>11<\/sup><\/a> Moreover, deciding which problems should be deemed worthy of study can be difficult. As a consequence, by and large, participants did not frame the challenge as one of producing research problems, but rather one of simply producing appropriate problems.<\/p>\n<p>Unsurprisingly, ability to construct such problems varied from subject to subject. For example, one geometer said that it was quite difficult to construct \u201cinteresting\u201d problems subject to the constraints. There are also real questions about the extent to which \u201cability to resist guesswork\u201d truly measures \u201cmathematical understanding\u201d. Many participants were rather open about this: <em>even<\/em> if AI managed to solve the problems they created, they did not feel that would constitute \u201cunderstanding\u201d in any real sense.<\/p>\n<p>While most participants had written and submitted problems before the symposium started, few people had an idea at that point of what would be \u201ceasy\u201d or \u201chard\u201d for a model. Most of the first day was spent seeing how models interacted with these preliminary problems, and the subsequent discussions refined participants\u2019 understanding of the stipulation that problems were resistant to guesswork. Along the way, models did manage to \u201csolve\u201d some of the problems, but that statement deserves qualification and a more detailed understanding of what constitutes a \u201csolution\u201d.<\/p>\n<p>One key feature of reasoning models was explicit display of \u201creasoning traces\u201d, showing the models \u201cthinking\u201d. These traces displayed models searching the web and identifying related papers, but their ability to do so was sensitive to the formulation of the problem in fascinating ways. For example, in algebraic geometry, formulating a problem in terms of commutative ring theory instead of varieties could elicit different responses from a model. However, it is a cornerstone of human algebraic geometry to be able to pass back and forth between the two points of view with relative ease. In geometry\/topology, participants noted that models demonstrated no aptitude for geometric reasoning. For example, models could not create simple pictorial models (knot diagrams were specifically mentioned) for problems and manipulate them.<a id=\"fnref12\" class=\"footnote-ref\" role=\"doc-noteref\" href=\"#fn12\"><sup>12<\/sup><\/a> In algebraic and enumerative combinatorics, models applied standard methods well (e.g., solving linear recurrences, appealing to binomial identities), but if problems required several steps as well as ingenuity models were stymied, even if they were prompted with relevant literature or correct initial steps.<\/p>\n<p>When a model did output a correct answer, examining the reasoning traces sometimes indicated that happened because the problem was constructed in such a way that the answer could be obtained by solving a much simpler but related problem. In terms of the exam solution paradigm, we would probably say such a response was \u201cgetting the right<br \/>\nanswer for the wrong reason\u201d and assign a failing grade to such a solution!<\/p>\n<p>Participants were routinely told to aim to craft problems that even <em>putative future reasoning models<\/em> would find difficult. From that standpoint, it was easy to extrapolate that a future model might behave in a more human way, demonstrate \u201cunderstanding\u201d in a human sense, and isolate the missing key ingredient. This created a pervasive fear that if reasoning traces indicated models seemed \u201cclose now\u201d, then one should extrapolate that the problems would be solvable by future models.<a id=\"fnref13\" class=\"footnote-ref\" role=\"doc-noteref\" href=\"#fn13\"><sup>13<\/sup><\/a> Participants did observe that if literature in a particular domain was suitably saturated, the models could identify lemmas that would be appropriate and generate relevant mathematics. This was certainly impressive, but one wonders to what extent the natural language output affects perception of the coherence of responses: it is easy for things to \u201clook about right\u201d if one does not read too closely! Eventually, participants did converge on problems that were thought to meet the required bar.<\/p>\n<p>The language models that we worked with were definitely good at keyword search, routinely generating useful lists of references. The models also excelled at natural language text generation and could generate non-trivial code, which made them useful in producing examples. However, press-reporting sometimes exaggerated this, suggesting that reasoning models are \u201cfaster\u201d or \u201cbetter\u201d than professional mathematicians. Of course, such statements are very open to interpretation. On the one hand, this could be trivially true, e.g., calculators are routinely faster than professional mathematicians at adding numbers. Less trivially, it could mean automating complicated algebraic computations, but even this would be viewed by most mathematicians as far from the core essence of mathematical discovery.<\/p>\n<p>The participants at the meeting form a rather thin cross-section of mathematicians who have some interest in the interface between AI (broadly construed) and mathematics. The symposium Signal chat became very active after the Scientific American article was posted. Undoubtedly participants felt there were exciting possible uses of AI for the development of mathematics. There are also real questions about whether or when future \u201creasoning models\u201d will approach \u201chuman-level\u201dcompetence,<a id=\"fnref14\" class=\"footnote-ref\" role=\"doc-noteref\" href=\"#fn14\"><sup>14<\/sup><\/a> as well as serious and fascinating philosophical questions about what that even means; this is a direct challenge for the mathematics community. What does it mean to competently do research mathematics? What is valuable or important mathematics?<\/p>\n<p>Finally, there are important practical questions about the impact, e.g., environmental or geopolitical, of computing at this level.<a id=\"fnref15\" class=\"footnote-ref\" role=\"doc-noteref\" href=\"#fn15\"><sup>15<\/sup><\/a> All these questions deserve attention: barring some additional as-yet-unseen theoretical roadblock, reasoning models seem likely to continue improving, underscoring the importance of these questions. As things stand, however\u2014particularly when it comes to mathematical reasoning\u2014caution seems warranted in extrapolating future research proficiency of models.<\/p>\n<section id=\"footnotes\" class=\"footnotes footnotes-end-of-document\" role=\"doc-endnotes\">\n<hr \/>\n<ol>\n<li id=\"fn1\">With the aid of generous input from Ben Antieau, Greta Panova, Kyler Siegel, Ravi Vakil, and Akshay Venkatesh.<a class=\"footnote-back\" role=\"doc-backlink\" href=\"#fnref1\">\u21a9\ufe0e<\/a><\/li>\n<li id=\"fn2\">Some discussion of the IAS\/Deepmind event is available on Michael Harris\u2019 <a href=\"https:\/\/siliconreckoner.substack.com\/p\/missed-opportunities\">June 8 substack post<\/a>.<a class=\"footnote-back\" role=\"doc-backlink\" href=\"#fnref2\">\u21a9\ufe0e<\/a><\/li>\n<li id=\"fn3\">It seems many people have a collection of standard mathematical questions for which reasoning models produce only hallucinatory outputs. Some discussion of the disconnect between stated AI benchmark progress on mathematics as opposed to \u201creal\u201d research as of March 2025 can be found in the article <a href=\"https:\/\/sugaku.net\/content\/ai-benchmarks-vs-real-math-research\/\"><em>The Disconnect Between AI Benchmarks and Math Research<\/em><\/a>.<a class=\"footnote-back\" role=\"doc-backlink\" href=\"#fnref3\">\u21a9\ufe0e<\/a><\/li>\n<li id=\"fn4\">The performance of language models on standard exam questions for undergraduate and graduate classes was a routine source of surprise. However, one expects reasoning models should perform better in areas where literature is dense. People are also routinely impressed by the fact that the models have improved so much over time.<a class=\"footnote-back\" role=\"doc-backlink\" href=\"#fnref4\">\u21a9\ufe0e<\/a><\/li>\n<li id=\"fn5\">We would be remiss not mention that many mathematicians justifiably have concerns about legitimizing corporate technological endeavors. Such worries are especially important since the companies developing reasoning models plausibly view \u201cmathematical progress\u201d, say in terms of ability of models to solve mathematical problems of various types, as a way to distinguish amongst themselves. The vague statement \u201cour model is good at math\u201d can simultaneously be propaganda or simply false depending on context and audience.<a class=\"footnote-back\" role=\"doc-backlink\" href=\"#fnref5\">\u21a9\ufe0e<\/a><\/li>\n<li id=\"fn6\">Altman states: \u201cIn some sense AIs are like a top competitive programmer in the world now or AIs can get a top score on the world\u2019s hardest math competitions or AIs can do problems that I\u2019d expect an expert PhD in my field to do\u201d. When Altman makes such a statement, given his role, it\u2019s easy to question his intentions. However, one cannot help but interpret his comments differently if there are also mathematicians making statements that can be construed as \u201cAI systems are good at math\u201d. Even vague statements to this effect by mathematicians could be used to minimize legislative targeting, or avoid scrutiny.<a class=\"footnote-back\" role=\"doc-backlink\" href=\"#fnref6\">\u21a9\ufe0e<\/a><\/li>\n<li id=\"fn7\">See <a href=\"https:\/\/epoch.ai\/data\/ai-benchmarking-dashboard\">Epoch AI\u2019s benchmarking dashboard<\/a> for a more detailed discussion.<a class=\"footnote-back\" role=\"doc-backlink\" href=\"#fnref7\">\u21a9\ufe0e<\/a><\/li>\n<li id=\"fn8\">See <a href=\"https:\/\/www.galois.com\/articles\/o3-frontier-math-and-the-future-of-mathematics\">here<\/a> for a blogpost giving some context to math benchmarks, a little bit of background in \u201creasoning models\u201d as well as discussion of computational<br \/>\nefforts involved.<a class=\"footnote-back\" role=\"doc-backlink\" href=\"#fnref8\">\u21a9\ufe0e<\/a><\/li>\n<li id=\"fn9\">Some discussion of these requests took place at the Joint Mathematics Meetings (JMM), which was held in Seattle in January. According to some comments <a href=\"https:\/\/sugaku.net\/content\/understanding-the-cultural-divide-between-mathematics-and-ai\/\">here<\/a>, mathematicians expressed skepticism at delivering problems for money<br \/>\nexposing a fundamental disjunction between academia and industry: pure mathematicians aspire to study mathematics to advance understanding, but industry researchers are required to deliver creations that buoy their supporting institutions.<a class=\"footnote-back\" role=\"doc-backlink\" href=\"#fnref9\">\u21a9\ufe0e<\/a><\/li>\n<li id=\"fn10\">See <a class=\"uri\" href=\"https:\/\/epoch.ai\/frontiermath\/tier-4\">https:\/\/epoch.ai\/frontiermath\/tier-4<\/a> for further discussion of Tier 4 problems.<a class=\"footnote-back\" role=\"doc-backlink\" href=\"#fnref10\">\u21a9\ufe0e<\/a><\/li>\n<li id=\"fn11\">It is also worth pointing out that formalizing \u201cinherent difficulty of proof discovery\u201d, say in terms of decision problems, led to significant theoretical challenges to previous generation approaches to artificial intelligence. For a recent revision and extension of this notion of difficulty, see <a href=\"https:\/\/arxiv.org\/abs\/2408.03345\"><em>Artifical intelligence and inherent mathematical difficulty<\/em><\/a> by W. Dean and A. Naibo. Once again, it is unclear whether such an approach has any bearing on what mathematicians might view as important for mathematics.<a class=\"footnote-back\" role=\"doc-backlink\" href=\"#fnref11\">\u21a9\ufe0e<\/a><\/li>\n<li id=\"fn12\">More broadly, the sense was that forthcoming AI systems would be able to navigate the literature, including obscure corners, and would be quite capable at performing \u201cstandard\u201dcomputations. Mathematicians themselves have created plenty of open source computational packages (SnapPy, Regina), which are already integrated<br \/>\ninto Python and hence automatically part of the toolkit to which a model has access. However, models seemingly lacked what mathematicians call geometric intuition.<a class=\"footnote-back\" role=\"doc-backlink\" href=\"#fnref12\">\u21a9\ufe0e<\/a><\/li>\n<li id=\"fn13\">There\u2019s a phenomenon alluded to in the Scientific American article as \u201cproof by intimidation\u201d which exposes a relevant phenomenon. If someone asserts boldly that they have solved a problem in a particular research domain, they have sufficient status and the approach they describe includes keywords\/techniques that seem like they<br \/>\nshould have bearing on the problem, mathematicians tend to give them the benefit of the doubt. Indeed, mathematicians will frequently believe the problem has been solved without going through the \u201csolution\u201d in detail. It also routinely happens in mathematical practice that such \u201csolutions\u201d break down under additional scrutiny, e.g., because some subtle part of the argument was not sufficiently well explained. This discussion seems relevant to participants perception of putative \u201csolutions\u201d provided by the models. Moreover, many of the stories of \u201csolutions to problems\u201d travelled between groups, creating a kind of echo effect.<a class=\"footnote-back\" role=\"doc-backlink\" href=\"#fnref13\">\u21a9\ufe0e<\/a><\/li>\n<li id=\"fn14\">Recent mathoverflow posts raise the question: <a href=\"https:\/\/mathoverflow.net\/questions\/486675\/is-this-a-bad-moment-for-a-math-career\">Is this a bad moment for a math career<\/a> in the context of news over AI models. The conversation around AI and jobs also extends to current fears. For example, without being precise yet about what we mean by AI, graduate students, especially at some larger public universities, are being tasked to \u201cuse AI\u201d to speed up their workflow, frequently leading to despair.<a class=\"footnote-back\" role=\"doc-backlink\" href=\"#fnref14\">\u21a9\ufe0e<\/a><\/li>\n<li id=\"fn15\">Environmental impacts of generative AI include <a href=\"https:\/\/news.mit.edu\/2025\/explained-generative-ai-environmental-impact-0117\">increased electricity demand and water consumption<\/a> sometimes localized to <a href=\"https:\/\/www.nytimes.com\/interactive\/2025\/03\/16\/technology\/ai-data-centers.html\">data center construction<\/a>. See Michael Harris\u2019s <a href=\"https:\/\/siliconreckoner.substack.com\/p\/missed-opportunities\">substack<\/a> for a recent discussion with some links.<a class=\"footnote-back\" role=\"doc-backlink\" href=\"#fnref15\">\u21a9\ufe0e<\/a><\/li>\n<\/ol>\n<\/section>\n","protected":false},"excerpt":{"rendered":"<p>This is a guest post from Aravind Asok1. If you have comments about this, you can contact him at asok@usc.edu. We&#8217;ll see if there&#8217;s some way to later post moderated comments here. Recently, several symposia have been organized in which &hellip; <a href=\"https:\/\/www.math.columbia.edu\/~woit\/wordpress\/?p=15039\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_post_was_ever_published":false},"categories":[1],"tags":[],"class_list":["post-15039","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.math.columbia.edu\/~woit\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/15039","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.math.columbia.edu\/~woit\/wordpress\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.math.columbia.edu\/~woit\/wordpress\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.math.columbia.edu\/~woit\/wordpress\/index.php?rest_route=\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/www.math.columbia.edu\/~woit\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=15039"}],"version-history":[{"count":16,"href":"https:\/\/www.math.columbia.edu\/~woit\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/15039\/revisions"}],"predecessor-version":[{"id":15057,"href":"https:\/\/www.math.columbia.edu\/~woit\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/15039\/revisions\/15057"}],"wp:attachment":[{"href":"https:\/\/www.math.columbia.edu\/~woit\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=15039"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.math.columbia.edu\/~woit\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=15039"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.math.columbia.edu\/~woit\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=15039"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}