Theoretical Physics Slop

In recent years I’ve been struggling with depressive thoughts whenever I think about what’s been going on in the field of fundamental theoretical physics research. As an example of what I find depressing, today I learned that the Harvard Physics department has not only a Harvard Swampland Initiative, but also a Gravity, Space-Time, and Particle Physics (GRASP) Initiative, which this week is hosting a conference celebrating 25 years of Randall-Sundrum. Things at my alma mater are very different than during my student years, which lacked “Initiatives”, but featured Glashow, Weinberg, Coleman, Witten and many others doing amazing things.

For those too young to remember, Randall-Sundrum refers to large extra dimension models that were heavily overhyped around the end of the last millennium. These led to ridiculous things like NYT stories about how Physicists Finally Find a Way To Test Superstring Theory, as well as concerns that the LHC was going to destroy the universe by producing black holes. At the time 25 years ago, hearing this nonsense was really annoying. I had assumed that it was long dead, but no, zombie theoretical physics ideas it seems are all the rage, at Harvard and elsewhere.

One consolation of recent years has been that I figured things really couldn’t get much worse. Today though, I realized that such thoughts were highly naive. A few days ago Steve Hsu announced that Physics Letters B has published an article based on original work by GPT5 (arXiv version here). Jonathan Oppenheim has taken a look and after a while realized the paper was nonsense (explained here). He writes:

The rate of progress is astounding. About a year ago, AI couldn’t count how many R’s in strawberry, and now it’s contributing incorrect ideas to published physics papers. It is actually incredibly exciting, to see the pace of development. But for now the uptick in the volume of papers is noticeable, and getting louder, and we’re going to be wading through a lot of slop in the near term. Papers that pass peer review because they look technically correct. Results that look impressive because the formalism is sophisticated. The signal-to-noise ratio in science is going to get a lot worse before it gets better.

The history of the internet is worth remembering : we were promised wisdom and universal access to knowledge, and we got some of that, but we also got conspiracy theories and misinformation at unprecedented scale.

AI will surely do exactly this to science. It will accelerate the best researchers but also amplify the worst tendencies. It will generate insight and bullshit in roughly equal measure.

Welcome to the era of science slop!

Given the sad state of affairs in this field before automated science slop generation came along, I think Oppenheim is being far too optimistic. There currently is no mechanism to recognize and suppress bullshit in this area, together with strong pressures to produce more bullshit. I hope that I’m wrong, but I fear we’re about to be inundated with a tsunami of slop which will bury the field completely.

Update: AI slop overwhelming the peer review system may have already happened, even if no one noticed. Nature has a new article: More than half of researchers now use AI for peer review — often against guidance.

This entry was posted in Uncategorized. Bookmark the permalink.

25 Responses to Theoretical Physics Slop

  1. Sabine says:

    I haven’t had time to read the Steve’s paper in detail (or Jonathan’s response) but I think calling it nonsense is somewhat harsh and it’s not what Jon meant to say. I’ve spent a lot of time with GPT Pro in the past months and it has indeed gotten quite good at physics, though it still lacks originality (and has a bad habit of guessing maths rather than actually doing calculations). But either way you look at it, the progress in the past year has been dramatic. It’s gone from basically completely useless, unable to even correctly parse latex, to being mostly correct.

    It is foreseeable that we will go through an AI slop period in the foundations in the coming year or two, but personally I think it’s a good thing because it’s basically a reductio ad absurdum for the idiotic theory-production machine that has been going on for the past decades. We’re so-so close now to automating the “invention” of dark sectors or modified gravities and cranking out the maths in a ready-to-submit format, and that stuff is going to flood the arXiv and journals in the years to come before, hopefully, we will finally see journals put an end to this and refuse to publish the stuff (which should have happened 20 years ago).

  2. Gary Ruben says:

    Regardless of the veracity of the core idea, Hsu doesn’t acknowledge the role of the AI LLM assistance in the main paper in its generation of the research direction. I found it surprising that he then explains that role in a separate paper, but apparently carefully avoids it in the main paper, mentioning only peripheral uses of AI assistance in the acknowledgments section. I’d have thought would be a violation of the journal’s rules around acknowledgment.

  3. Peter Woit says:

    Sabine,
    I just don’t see how this is going to lead to journal reviewers or the field in general changing and, finally after all these years, turning around and starting to be able to distinguish between what’s worthless and what’s worthwhile. It seems more likely that it’s just going to finish off the field for good.

  4. Peter Woit says:

    Gary Ruben,
    Some people have argued that Hsu should have listed the LLM as a co-author.

    Hsu is now an AI entrepreneur and motivated to sell the story of “AI can now do theoretical physics”. Others out there are motivated to get lots of papers in their name written, so will do the same thing as him, but not acknowledge the LLM role.

  5. Joka says:

    Peter,

    Another development is much worse. I can testify that at least one journal has started using AI to generate “referee” reports, so that it needs to find fewer real referees. This has happened to me a few months ago. And it was a large publisher.

    This development – maybe we should call it “referee slop” – explains the high article processing charges that at least this publisher demands. But this is not what an author wants. And this is not good for theoretical physics.

  6. Lars says:

    Calling the GPT5 paper “nonsense” is accurate, not harsh.

    A paper that purports to show one thing but uses criteria for showing something else is nothing if not nonsense. It demonstrates lack of understanding (by both the bot and the human author) of what was written.

    That it might have accurately reproduced some other physics in the process is irrelevant.

  7. S says:

    I concur with Lars. I think that we are going to have to expand our definition of nonsense, because LLMs are producing *types* of nonsense rarely or never before seen.

    In the past, if somebody strung together five pages of correct math or physics on a highly esoteric topic, you could be sure that they were a highly trained expert and basically knew what they were talking about. We’re simply unprepared for an entity that can do all of that and still assert that two plus two equals five at the end of it, or confidently assert the negation of the main theorem.

    The frontier of nonsense is expanding exponentially.

  8. Eleni says:

    Mostly agreeing with Sabine’s comment. I recently got on an AI alignment project that asked to prepare PhD-level, or above, exam-style questions based on recent papers that GPT-4 wouldn’t answer correctly.
    GPT-4’s abilities were stunning.
    The comprehension, use of background knowledge, deduction and synthesis, multi-step calculations… stunning. Many PhDs almost cried in the team chat as it was taking an inordinate amount of time to find questions that wouldn’t work.

    Aside from Hsu-style papers: If (old-fashioned) papers are now both written and read by using AI, if both teachers and students use AI, if emails are both written and read by using AI, then most probably we’d been doing some things wrong and our methods are about to change.

    As to initiatives, maybe responsibly trained life-advice coaching AI agents will be a fitting future application.

  9. Kevin Driscoll says:

    I agree with Lars and S, and would double-down on the criticism that Sabine mentions but does not press on (originality). The disagreement here, I think, centers on the fact that if you met a human grad student who can do what GPT-5 or Gemini 3 can do, then you would reasonably think that they could be the next greatest physicist of all time. But the abilities of these LLMs to recall, reproduce, match to existing structures, etc. are just not reliable indicators that they can also generate and substantiate new ideas in the way that they would be for humans.

    I have spent probably over 100 hours now probing these models by trying to get them to make progress on some small questions left open in my thesis and they all utterly fail, predictably and repeatedly. Their encyclopedic knowledge is a *hinderance* rather than a help. When they encounter a problem that they don’t know the answer to, they basically never try to come up with something new to solve it. Instead they try all 100,000,000+ methods/strategies that they’ve seen in the training data in some rough order of how similar this new problem is to other problems. And if that doesn’t work then they basically give up and hallucinate that one of their previous attempts has worked (when it hasn’t) and output a bunch of abstract filibustering, vague reasoning, and other nonsense.

    The result in this PL-B paper aligns exactly with my experience. The LLM thrashed about for a long while and then latched on to a wrong conclusion that slipped past all the AI “verifier” checks because they can’t actually constrain their output to only what follows via strict rules. The fact that human verifiers messed this up isn’t too shocking given the very abstract and sketch-like nature of the paper. None of it is a well-formalized proof; it’s a hybrid of writing down QFT-related symbols and then making arguments in English about them.

  10. Peter Shor says:

    I have noticed that LLMs perform really well on math questions that have a bunch of solutions on the Internet. But if you ask them a question that only has one or two solutions posted, or one without any known solutions, they tend to fail miserably.

    So they seem to be cribbing from existing papers, but they are really, really good at it. But this means the dreams of AI proponents that AI will now solve all our scientific problems are not likely to be realized.

  11. James says:

    Were the Bogdanoff papers similar to LLM output?

  12. Peter Woit says:

    James,
    GPT5 is much better at this than the Bogdanoffs. The scandal with their papers was that it was immediately clear they were nonsense, but some reputable journals published them anyway.

    The Hsu/GPT5 is not obviously nonsense. It takes some effort from someone expert in the field to look at the thing closely enough to find the flaws.

    This is going to be the problem with the oncoming tsunami of slop. GPT5 is very well-tuned to produce something indistinguishable from a real paper. If you ask it to produce something already known (or a trivial variant), my understanding is that it can give a legitimate argument and produce a technically correct paper. A lot of low-quality academic work is of this kind, and GPT5 (or descendants on their way soon) should be able to effortlessly produce lots of correct papers indistinguishable from such low quality human work.

    It’s also though going to produce vast amounts of stuff like the Hsu paper, seemingly giving something original and of interest, but by means of errors and hallucinations, of a sort that only an expert can readily pick up. There’s going to be lots and lots of such papers, and nowhere near enough experts to find the errors. The optimistic scenario is then Sabine Hossenfelder’s (collapse of the journals/peer review), bringing on a brave new world where somehow we find a way forward.

  13. Attendee says:

    One should not be surprised that a LLM can solve a well posed “PhD” level exam problem. The solution to a well posed problem is language. You follow a series of well established mathematical operations and produce the result. If you know the syntax and a few tricks for simplification, you can get the result. It is in some sense easier than text in English which follows fewer rules. Producing language that conforms to a syntax is exactly what a LLM is supposed to be good at – and I find that they can do reasonable physics and math with something like 90% accuracy.

    But, actual research requires you to understand the core obstacles to solving a problem and then finding creative ways around it. Without understanding, there is no strategy. Without strategy, there is no progress. I have never made progress on a research topic without a strategic understanding of what is needed – simply trying a large number of directions is not a good way to do it since the phase space of possibilities is far too big. The current LLM architecture does not seem suited for tasks of this sort.

    I think LLMs will accelerate scientific progress, similar to Mathematica/Matlab and Google. It is fascinating to think about what would be needed for the next step.

    I agree that LLMs will expose the sorry state of academia where people simply write pointless papers. It would be a fun experiment to run – pick a popular person. Once they put a paper out, ask the LLM to suggest obvious follow-ups. What fraction of the subsequent human produced papers would the LLM write?

  14. Sabine says:

    Lars,

    “A paper that purports to show one thing but uses criteria for showing something else is nothing if not nonsense. ”

    There are hundreds if not thousands of published peer reviewed papers which have the same problem, if on different topics, especially in the foundations of quantum mechanics. They confuse nonlocality with nonlinearity, and locality with causality, correlation with causation, determinism with predictability, or computability, etc etc. In fact I would go so far to say that 90% of papers in the foundations of quantum mechanics are people confusing different terms.

    Now you may want to say that because of this much of what gets published in quantum foundations is nonsense, period, and I wouldn’t disagree, though personally I think it’s somewhat of an exaggeration. Either way, the question we should be asking here is whether GPT’s productions are any worse than what physicists themselves write. I frankly think that at this point, if you look at the average publication in the field, the answer is “no”.

  15. Waschbaer says:

    Sabine,

    I have found out that LLMs work in a very different way from human students and thus the progress of the two should not be compared naively. LLMs are exceptionally good at doing existing exams questions in math and physics simply because the questions and solutions can be found online. They either have the ability to remember an impressive amount of solutions or have access to internet searches without the user’s knowledge.

    LLMs have a very broad spectrum of knowledge and are useful in finding references in literature. But they lack the ability to make any non trivial observations themselves.

    With any experience in researching or software engineering one should realize that reviewing other people’s work in these fields is more difficult than writing a similar work oneself. LLM generated articles will be disastrous to the scientific review process.

    It’s true that many human produced articles are worthless and that many human researchers aren’t much more capable than LLMs. But their damage is only contained because the scientific review process still somehow works. If the review process gets overwhelmed by LLMs, the only thing that would remain is chaos.

  16. Eric says:

    This recent experiment with AI peer review seems relevant. STOC is a premiere conference on theoretical computer science, and a similar process seems feasible for physics publications.

    “As an experiment for the STOC 2026 conference, we are offering authors the opportunity to receive pre-submission feedback on their papers generated by an advanced LLM-based tool based on Google’s Gemini model that has been optimized for mathematical rigor. The goal is to provide constructive suggestions to help improve your paper or help find any technical mistakes before the final STOC submission deadline.”

    Responses to the experiment by authors were strongly positive. And I can imagine that subsequent human peer reviewers and readers of accepted papers will appreciate fewer time-wasting bugs. The summary of results (linked below) contains several additional observations.

    Initial announcement: https://acm-stoc.org/stoc2026/stoc2026-LLM_feedback.html

    Summary of results: https://research.google/blog/gemini-provides-automated-feedback-for-theoretical-computer-scientists-at-stoc-2026/

    Example of AI-generated feedback: https://www.cs.cmu.edu/afs/cs/user/dwoodruf/www/stoc/mohammad.pdf

  17. Peter Woit says:

    Eric,
    If an AI tool could check mathematical rigor of math research papers, that would be very helpful, but I think that’s a ways off (and might require that people write math papers in a different form). For example, checking the Mochizuki papers proving abc seems far off.

    For physics papers, the problem is that they’re not even trying for mathematical rigor. If you have an AI tool to do peer review and you feed it, for instance, Susskind’s latest
    https://arxiv.org/abs/2512.13650
    whatever it does is not going to be a mathematical rigor check.

    If a peer review tool isn’t something that reliably validates the technical correctness of an argument, but an LLM trained on referee reports to produce a plausible-sounding one, it could have a dramatic effect. The Steve Hsus of this world getting LLMs to write their papers could be assured of publication, by telling the LLM to keep feeding the paper to the peer review tool and editing it to get past peer review. This would be a way for anyone who wants to to get themselves an arbitrarily large publication list with minimal effort. The problem of the literature being too large, not interesting, and of dubious reliability would get much worse very quickly.

  18. Lars says:

    “Artificial intelligence research has a slop problem, academics say: ‘It’s a mess’

    https://www.theguardian.com/technology/2025/dec/06/ai-research-papers

  19. Tired says:

    Today’s math arxiv has an AI slop paper, claiming to prove a conjecture on symmetric group representations. The usual features – authoritative authorial voice, proved statements that were just trivialities, claimed results that were not proved. And strategy of ‘proof’ lifted from its references, without accurate citation. It looked very close to a first draft LLM answer to some typed questions.

    But it did have one unusual feature – it was dedicated to a deceased researcher. If a human wrote the paper, the dedication would only be appropriate from a student or former mentee. This wasn’t the case here; I’m guessing the author might even be an undergraduate ( the only other record I could find was a rejected AI slop computer science paper).

    So what’s with the dedications. Was that the author’s addition, part of their delusion of accomplishment? Or is that the LLM? I’m not familiar enough with such outputs to tell.

    And do any of your readers know if there is now a mechanism for alerting the arxiv so the paper can be removed? It wasted enough of my time, it shouldn’t waste others.

  20. Peter Woit says:

    Tired,
    I’m very curious about what the arXiv plan is for dealing with the onslaught of slop. The one thing I see is the policy here
    https://info.arxiv.org/help/moderation/index.html#policy-for-authors-use-of-generative-ai-language-tools
    which explicitly says you can use AI to write papers, you just can’t make it a co-author and
    “text-to-text generative AI … should be reported consistent with subject standards for methodology.”
    From the policy language it’s not at all clear to me that papers like what you describe violate current policy.

  21. @Tired I’ve seen genAI papers in a math.XX category moved to math.GM after people let the arXiv know. It was incredibly obvious for such cases though, for instance with un-processed Markdown syntax in the pdf, which is a dead giveaway that a) an LLM made the LaTeX source and b) the author didn’t proof-read the resulting output properly to notice the mixed-up syntax results in trash-looking pdf

  22. Paolo Bertozzini says:

    Dear Peter,

    to be sincere, contributions coming from AI-generated papers are probably of a much better and sophisticated level compared to the “garbage” that (in certain areas of research and especially in certain developing countries) has been flooding the scientific literature in the recent decades (I can provide explicit examples!). For sure AI-generated referee-reports cannot be much worse than some non-sense currently “human-generated” for the same purpose.

    Under this point of view, the arrival of a tsunami of AI-generated material of questionable level (when not explicitly wrong) can catalyze positive effects … killing completely the insane “market incentives” currently supporting scientific malpractice in publications.

    I am much more worried by a return to “schamanic attitudes” toward scientific results that can be assessed only by certain “oracle-authorities” that “own” the costly technologies necessary to verify, certify (and manipulate at pleasure) the validity of the statements (something that seems qualitatively different from the introduction and usage of computer hardware and software in the recent past, because of the “universal reach” that is guaranteed now by internet and mobile technology related services “rented” to essentially any individual).

    @Tired

    I am not aware of specific regulations regarding the “dedication of scientific papers” (do they really exist?) … I personally wrote some papers dedicated to deceased people that I simply appreciated for Their research work and that directly or indirectly inspired some of the developments presented in the manuscript (and I will likely continue with this practice, no matter what the “official” regulations might in the future say about it).

    To the contrary, I am very well aware of the good traditional standard practice (at least in mathematics) to list authors in alphabetical order … something that is today explicitly violated by official regulations in several countries/institutions (that for example impose to have PhD students listed as first-authors or that classify the relevance of contributions to the research by the order of the authors in publications!).

    Best Regards

  23. Still Tired says:

    The symmetric group paper got withdrawn by the author today, which is good, but the policy questions remain.

    @David, If the paper hadnt been withdrawn, I don’t think math.GM is an appropriate destination. Similar to the GPT5 generated paper that Peter wrote about, the paper looks superficially like a humanly generated one, in an uncanny valley manner. (It requires some technical understanding to observe that each of the three supposed contributions it makes are vacuous or wrong – the proved statement on partial orders, which is both evident and already in the references; the other two claims are just misunderstandings of quoted results and unproved assertions, and the correct content and idea of the strategy was extracted from one of the references, which — unlike this paper — actually made hay with it. In contrast, it is clear that the `author’ of this paper does not have a mental picture about the details of the subject they are writing about, despite several paragraphs of written context, citing of relevent literature, and `big picture’ descriptions of the outline of what they are writing; all of this impressive sounding overview is given the lie by the details that follow, strikingly at times; the choice of examples is especially naff. Stochastic parrot fails to reproduce Shakespeare. )

    I’m guessing the paper was written as an exercise in `vibe coding’, without the prompt giver understanding the output, and hence not realising they had done nothing.

    As a one off, such papers are merely irritating. But as the likely harbinger of a deluge of similar papers, the arxiv needs a policy. Weren’t the _obviously nonsense_ papers on the Hodge conjecture removed? These too need to be either completely removed or placed in an AI slop category (And a mechanism implemented that allows the `human author’ to appeal without taking infinite amount of arXiv admin time).

    In that regards the recent change in math arXiv endorsement policy, announced Dec 10, will help slow the AI slop; it could be made even more effective by banning authors who are repeat AI slop submitters. But it would be good to address the root concerns.

    Of course automatically lean verified papers (which would be great), and / or AGI (about which, no polite comment) will obviate the need for dealing with these policy issues. Who knows how soon that will be?

  24. Mitchell Porter says:

    In the latest news from the nearby world of vibe mathematics, a former “director of engineering” at DeepMind (I don’t know what kind of engineering), David Budden, is at last count claiming (on X, @davidmbudden) to be en route to proofs of three Millennium Problems (Navier-Stokes, Hodge conjecture, Birch & Swinnerton-Dyer), with a little help from his AI friends. Obviously there are now a lot of people with similar beliefs, but it’s notable that someone who held a senior position at one of the leading AI companies is publicly fooling himself like this.

  25. @Still Tired

    I agree that math.GM is not the best place for such a paper. But since the arXiv cannot unilaterally remove preprints that have no legal issues, reclassifying what are essentially intellectually vacuous papers out of blatantly inappropriate classifications (math.CT, in the case of papers that are in my field) is the next best thing.

    Thank you so much for that pointer about the math section on the arXiv changing endorsement policies. I had no idea that anyone with an institutional address didn’t need endorsement, and this makes a lot of random rubbish papers from the past year make sense: they weren’t actually endorsed by a mathematician, but waved on in via auto-endorsement.

    @Mitchell

    Oh dear. That is not encouraging at all.

Comments are closed.