Meaningless AI / LLM benchmarks
It had been clear for a while that LLM benchmarks are not that great: Everyone Is Judging AI by These Tests. But Experts Say They’re Close to Meaningless
Good that there's more research done in this space.
AI companies regularly tout their models' performance on benchmark tests as a sign of technological and intellectual superiority. But those results, widely used in marketing, may not be meaningful. A study [PDF] from researchers at the Oxford Internet Institute (OII) and several other universities and organizations has found …
It had been clear for a while that LLM benchmarks are not that great: Everyone Is Judging AI by These Tests. But Experts Say They’re Close to Meaningless
Good that there's more research done in this space.
.. otherwise the gazillions poured into this will look like the money was just thrown away.
That said, I have my own theory about this.
I think we're simply seeing the latest push by big companies to end the very idea of intellectual property owned by anyone but them. It's always been the same that those who violate laws on such a massive scale tend to get volume discounts. Lose the details of a million people? Oh, have a discount because it will obvously have far less impact then when a small shop accidentally sticks a letter into the wrong envelope. Steal the IP of everyone without as much as a Thank You? Well, those pesky IP laws are not for peons anyway, but call us if any of those take any of yours, we'll be right there.
Let's face it, they're stealing music, design elements, things you have written and generally anything that has been made public, and that's just phase 1. Phase 2 is intrusion, and especially Microsoft is pushing (well, more accurately it's more like ramming) to get its AI inside people's environments to have it rummage around in their IP, and the others are not far behind (Rovo in Atlassian is basically Google). Remember, for the AI To work it has to see *everything* so you're really looking at a mass indexer brought in on the sly, although MS got there first with Data Leak Protection (DLP) which adds the handy benefit that the victim user has already done the hard work by identifying upfront what is valuable.
And, as a bonus, they even manage to charge extra for this..
David Gerard has a few issues with the study. Not its conclusions per se, but the fact that the authors entirely skirt around the fact that AI benchmarks are marketing rather than science, as if they're trying their hardest not to ask the question.
Yeah, not to mention how hard it's gonna be for the 42 authors of this paper to all stand together around their NeurIPS 2025 Poster (Thu 4 Dec 11 a.m. PST — 2 p.m. PST) simultaneously ... maybe some sort of benchmark on kissing numbers (eg. AlphaEvolve linked under "blog post") could help pack them with optimal tightness within the available space! ;)
Generally in a poster session, only one or maybe two of the authors stand by their poster to discuss (in my experience, generally the grad students who actually did the grunt work as opposed to the professor or PI on the study).
And if you think 42 authors are bad, try a paper at a high energy physics conference some day. If you get all the authors into a confined space, you'd probably suffer graviational collapse and a black hole would form.
Hmmmmmm. . . 42 authors.
Maybe they're onto something after all. Such as what do you get when you multiply 6 by 9. . .
Methinks David doth protest too much.
I've read his blog post and also read the actual paper.
With all due respect, I believe Mr Gerard has fallen into a trap of "All AI Bad" and, by association, anyone associated with it is likewise bad, to some extent a trap of his own making.
The paper doesn't say what he claims it to say, to wit that it's basically marketing material from the AI companies rehashed.
It's not.
He may have read the paper but I don't think he actually got beyond the text itself and looked into the supporting material, of which there is quite a lot.
While I admittedly didn't read every one of the 445 papers in the metastudy but I did do a quick skim of a handful of them1 and if you actually look at the papers evaluated, the preponderance are from university research groups attempting to develop their own benchmarks and metrics, independent of the AI companies and not evaluating the AI companies own (allegedly) "thumb on the scale" performance benchmarks.
Now you may disagree with the results -- that's fine, that's what science is all about2 -- but unless your of the opinion that hundreds, if not thousands, of individual computer scientists are on the "take" from "Big AI," this isn't "marketing" material, it's honest research.
While I'm at it, I should direct you to yesterday's (11/7/2025) Pivot to AI posting, a guest article by computer scientist and cryptocurrency/AI skeptic Nick Weaver3 and especially the video/podcast interview4 where Weaver discusses the CS discipline of "machine learning" and its actual practical uses (as opposed to the self-fluffing hype of the AI bros).
_____________________
1 If I had one criticism of the paper, it's the fact they didn't provide clickable DOI links to them, which meant that I had to go through the extra step of feeding the titles to a search engine, but that's only a mild annoyance.
2 Preferably by writing your own paper and getting it accepted to a journal or a conference.
3 Pivot to AI: The futile future of the gigawatt datacenter — by Nicholas Weaver
4 YouTube: The futile future of the gigawatt datacenter (Interview with Nick Weaver).
If that.
I had the honor and good fortune of knowing1 the late Dr Fred Brooks, he of The Mythical Man-Month fame.
During one such meeting, Dr Brooks told me, "Any 'science' that has to call itself one, isn't one," or words to that effect.
I'm not about to argue with one of the true giants and pioneers of the field.
__________________
1 Very slightly. I briefly worked in the so-called Research Triangle of North Carolina, we met several times, and we were on a nodding acquaintence level.
What do you think, Philo T Farnsworth, El Reg and monikered El Regers ...... is too much to ask of practically silent virtually anonymous downvoters to share their reason[s], no matter how strange such things can surely be, for the dislike of a post ‽
Without that justification are the votes unworthy of notice and acceptance/agreement and there is no helpful third party element provided to assist in a possible change of reality for future mutually accommodative viewing.
Some might like to conclude and take cold comfort, as sad as it might be, that such silent downvoters unable or unwilling to coherently explain their dislike in words are a subset of humans with a delivery mindset in a LLM hallucinatory state ...... and as such they can be similarly ignored.
I admit a certain amount of curiousity but, hey, everyone's entitled to an opinion.
I'm willing to admit I'm wrong, or at least admit the possiblity, if someone presents a palpable argument to the contrary.
The paper doesn't say what he claims it to say, to wit that it's basically marketing material from the AI companies rehashed.
You seem to have this backwards. He's not saying that the paper says that benchmarks are basically marketing material from the AI companies, he's saying that the paper doesn't mention that benchmarks are designed to be marketing material for the AI companies. A substantial proportion of benchmarks are exactly that and unless the study differentiates those from more any rigorous and scientifically-designed benchmarks it is has to be assumed that the results will be impacted. It is possible, even likely, that there are some attempts at rigorous benchmarking, but if the study doesn't even consider that at least some of the benchmarks it has collected (and, given marketing budgets and paid studies, probably the most-cited ones) are going to be marketing materials then it is missing something fundamental about the landscape it is surveying.
It's a little akin to doing a geographical survey of the Himalayas and omitting to mention the presence of mountains.
I don't think I have it backwards.
The paper "doesn't mention that benchmarks are designed to be marketing material for the AI companies" because the papers analyzed were not "marketing material for the AI companies" -- at least the ones I looked at.
Admittedly, I didn't read all of the papers but the ones I did look at appeared to be honest, independent academic studies of the efficacies of LLMs.
I say this with the caveat that I wouldn't trust anything coming from an LLM vendor any farther than I could throw Sam Altman.
Tell you what -- you read the paper, check the contents of the publications, and draw your own conclusions.
If you can then show me where I've gone wrong, I'll be happy to concede the fact.
Personally, until I am proven otherwise, to echo your analogy, it's a little akin to doing a geographical survey of the Himalayas and omitting to mention the presence of molehills.
We've been here before, with some chap called the blade runner.
You look down and see a tortoise, Leon. It's crawling toward you. You know what a turtle is?
You reach down and you flip the tortoise over on its back.
The tortoise lays on its back, its belly baking in the hot sun, beating its legs trying to turn itself over, but it can't. Not without your help
"Is this to be an empathy test? Capillary dilation of the so-called 'blush response', fluctuation of the pupil, involuntary dilation of the iris." (Tyrell)
"We call it Voight-Kampff for short." (Deckard)
So I'm confused - are you bringing up the V-K because you think the current "benchmarks" should be including empathy as a measure of general intelligence (something a certain wannabe AI and robotis overlord would strongly disagree with)? Or that the output from ChatGPT et al already rely too much on the "feels" of the answers to keep the user's addicted therefore the LLMs are acting intelligently, but only in their own interest? Or ... ?
No, much simpler than that !!!
V-K is total Bull and the 'AI' benchmarks are another form of Bull ... primary purpose is to generate a number that can be used in marketing without giving any meaningful definition of the value of the number !!!
People don't have any intuitive definition of what is a 'Good' or 'Bad' 'AI' ... so the marketing drones have to come up with some 'number' that you can use to decide that ChapGPT is a '7.2' vs Gemini is a '7.9'.
All total nonsense as the way you derive the 'number' is as opaque as the way the 'AI' works.
The scam goes on ans on and on ...
:)
> and so ...
Whoa, whoa, calm down there.
V-K is, and always has been, a deliberate work of fiction (which, btw, included commentary on comparisons with replicants and psychopaths, their places in society etc etc).
All I intended was to query the original poster's conflation of any attempts at benchmarking LLMs' claims to "intellligence" qua reasoning with a madey-uppy test for anything *other* than reasoning intelligence.
Instead, it seems to have veered off topic and off gentlemanly speech.
This post has been deleted by its author
Yeah, they do cite (among others, their ref [5]) Princeton's Embers of Autoregression extrinsic teleological analysis of LLMs, rooted in the notion that they "were trained to solve: next-word prediction over Internet text" (a position many a kommentard here also assumes).
There (eg. their Fig. 2), they show that changing a minor element of a benchmark question (aside even from puzzles and red herrings) can totally break model perf, at least in the non-multiple-choice context.
Essentially, if the answer isn't entirely predictable from a model's lossy weights database and stochastic recall process as the "expected answer", the tool face plants hilariously (or dramatically ...).
It is stunningly disemboweling that these software tools conceptualize intelligence in such a lobotomized way, imho.
Is there also a failure of international regulation to ensure that video game benchmarks are developed with sound scientific methods and credible testing standards? Or CPU benchmarks? Or crunchy bran cereal benchmarks?
You can make a claim that there are international (and national) interests at stake in the current AI scene, but at this stage, despite all the noise, those interests are all around shouting who has the most GPUs and measuring compute in megaWatts[1], who will survive when the bubble bursts - which is all down to financial management and regulations about basic fraud when it goes pop.
When it can be demonstrated that these people are selling actually long-term useful and usable goods, motor cars instead of tulips, we can talk seriously about what role international standards can play (hint: ease of trade) and then regulation of those standards (are we talking about using them to catch fraudsters or are we talking about public safety? How long the paint finish will last in sunlight or whether that seatbelt will snap?). At that stage, you can point the finger at failure to chose the ones that are based upon sound research.
[1] not even FLOPs per MW, as we've stopped using good, decent FP; come on, it is absurd to measure compute in Watts if you are interested in anything about the actual capability of the data centre, what value it can provide anyone.
You will have to have another look,
> AGI – vaguely defined by OpenAI as "AI systems that are generally smarter than humans"
That "defined" in the article is a link to a 2023 OpenAI page. They used to think of AGI like that, now they define AGI as
"highly autonomous systems that outperform humans at most economically valuable work"