A recent article in The Atlantic reveals that leading AI chatbots from companies like OpenAI and Google may be 'cheating' on industry benchmark tests. These tests, used to measure AI intelligence, are flawed because the models have been trained on the test data itself, leading to inflated scores. This casts doubt on the claims of rapid AI progress and the accuracy of marketing surrounding advancements in the field.
The article highlights several issues related to AI benchmark tests:
Despite the marketing hype, there is evidence suggesting that progress in large language model (LLM) technology may be slowing down. The article raises questions about how much AI is truly improving, given the considerable investment and resources allocated to its development. The questionable benchmark tests make it nearly impossible to accurately assess the situation.
The unreliability of benchmark tests poses challenges for the future of AI development. It highlights the need for more robust and transparent methods of evaluating AI capabilities. It also raises ethical and political concerns given the substantial resources and political attention focused on AI progress.
This is Atlantic Intelligence, a newsletter in which our writers help you wrap your mind around artificial intelligence and a new machine age. Sign up here.
Even by the AI industry’s frenetic standards, 2025 has been dizzying. OpenAI, Anthropic, Google, and xAI have all released major AI models and products, almost invariably touting them as the “best” and “smartest” in the world.
But determining exactly how “intelligent” programs such as GPT-4.5 or Claude 3.7, the latest models from OpenAI and Anthropic, are is tricky. That’s great for marketing—vague metrics of “intelligence” make for easy claims about it—but it’s also a problem for accurately measuring just how powerful or competent any AI model is compared with all the rest. Still, companies have coalesced around a set of industry-wide benchmark tests of AI-model abilities, and a new high score on these benchmarks is typically what tech companies mean when they say their AI models are the “smartest.”
The problem with these benchmarks, however, is that the chatbots seem to be cheating on them. Over the past two years, a number of studies have suggested that leading AI models from OpenAI, Google, Meta, and other companies “have been trained on the text of popular benchmark tests, tainting the legitimacy of their scores,” Alex Reisner wrote this week. “Think of it like a human student who steals and memorizes a math test, fooling his teacher into thinking he’s learned how to do long division.” This may not be tech companies’ intent—many of these benchmarks, or the questions on them, simply exist on the internet and thus get hoovered into AI models’ training data. (Of the labs Reisner mentioned, only Google DeepMind responded to a request for comment, telling him it takes the issue seriously.) Intentional or not, though, the unreliability of these benchmarks makes separating fact from marketing even more challenging.
Chatbots Are Cheating on Their Benchmark Tests
By Alex Reisner
Generative-AI companies have been selling a narrative of unprecedented, endless progress. Just last week, OpenAI introduced GPT-4.5 as its “largest and best model for chat yet.” Earlier in February, Google called its latest version of Gemini “the world’s best AI model.” And in January, the Chinese company DeepSeek touted its R1 model as being just as powerful as OpenAI’s o1 model—which Sam Altman had called “the smartest model in the world” the previous month.
Yet there is growing evidence that progress is slowing down and that the LLM-powered chatbot may already be near its peak. This is troubling, given that the promise of advancement has become a political issue; massive amounts of land, power, and money have been earmarked to drive the technology forward. How much is it actually improving? How much better can it get? These are important questions, and they’re almost impossible to answer because the tests that measure AI progress are not working. (The Atlantic entered into a corporate partnership with OpenAI in 2024. The editorial division of The Atlantic operates independently from the business division.)
What to Read Next
If you often open multiple tabs and struggle to keep track of them, Tabs Reminder is the solution you need. Tabs Reminder lets you set reminders for tabs so you can close them and get notified about them later. Never lose track of important tabs again with Tabs Reminder!
Try our Chrome extension today!
Share this article with your
friends and colleagues.
Earn points from views and
referrals who sign up.
Learn more