back to article AI benchmarks are a bad joke – and LLM makers are the ones laughing

AI companies regularly tout their models' performance on benchmark tests as a sign of technological and intellectual superiority. But those results, widely used in marketing, may not be meaningful. A study [PDF] from researchers at the Oxford Internet Institute (OII) and several other universities and organizations has found …

  1. Anonymous Coward
    Anonymous Coward

    Meaningless AI / LLM benchmarks

    It had been clear for a while that LLM benchmarks are not that great: Everyone Is Judging AI by These Tests. But Experts Say They’re Close to Meaningless

    Good that there's more research done in this space.

    1. elsergiovolador Silver badge

      Re: Meaningless AI / LLM benchmarks

      What would you expect, they only spent like few hundred billions on AI. Of course the tests will be crap.

      1. Anonymous Coward
        Anonymous Coward

        Re: Meaningless AI / LLM benchmarks

        .. otherwise the gazillions poured into this will look like the money was just thrown away.

        That said, I have my own theory about this.

        I think we're simply seeing the latest push by big companies to end the very idea of intellectual property owned by anyone but them. It's always been the same that those who violate laws on such a massive scale tend to get volume discounts. Lose the details of a million people? Oh, have a discount because it will obvously have far less impact then when a small shop accidentally sticks a letter into the wrong envelope. Steal the IP of everyone without as much as a Thank You? Well, those pesky IP laws are not for peons anyway, but call us if any of those take any of yours, we'll be right there.

        Let's face it, they're stealing music, design elements, things you have written and generally anything that has been made public, and that's just phase 1. Phase 2 is intrusion, and especially Microsoft is pushing (well, more accurately it's more like ramming) to get its AI inside people's environments to have it rummage around in their IP, and the others are not far behind (Rovo in Atlassian is basically Google). Remember, for the AI To work it has to see *everything* so you're really looking at a mass indexer brought in on the sly, although MS got there first with Data Leak Protection (DLP) which adds the handy benefit that the victim user has already done the hard work by identifying upfront what is valuable.

        And, as a bonus, they even manage to charge extra for this..

  2. Sorry that handle is already taken. Silver badge

    Meanwhile

    David Gerard has a few issues with the study. Not its conclusions per se, but the fact that the authors entirely skirt around the fact that AI benchmarks are marketing rather than science, as if they're trying their hardest not to ask the question.

    1. Anonymous Coward
      Anonymous Coward

      Re: Meanwhile

      Yeah, not to mention how hard it's gonna be for the 42 authors of this paper to all stand together around their NeurIPS 2025 Poster (Thu 4 Dec 11 a.m. PST — 2 p.m. PST) simultaneously ... maybe some sort of benchmark on kissing numbers (eg. AlphaEvolve linked under "blog post") could help pack them with optimal tightness within the available space! ;)

      1. Philo T Farnsworth Silver badge

        Re: Meanwhile

        Generally in a poster session, only one or maybe two of the authors stand by their poster to discuss (in my experience, generally the grad students who actually did the grunt work as opposed to the professor or PI on the study).

        And if you think 42 authors are bad, try a paper at a high energy physics conference some day. If you get all the authors into a confined space, you'd probably suffer graviational collapse and a black hole would form.

        Hmmmmmm. . . 42 authors.

        Maybe they're onto something after all. Such as what do you get when you multiply 6 by 9. . .

    2. Philo T Farnsworth Silver badge

      Re: Meanwhile

      Methinks David doth protest too much.

      I've read his blog post and also read the actual paper.

      With all due respect, I believe Mr Gerard has fallen into a trap of "All AI Bad" and, by association, anyone associated with it is likewise bad, to some extent a trap of his own making.

      The paper doesn't say what he claims it to say, to wit that it's basically marketing material from the AI companies rehashed.

      It's not.

      He may have read the paper but I don't think he actually got beyond the text itself and looked into the supporting material, of which there is quite a lot.

      While I admittedly didn't read every one of the 445 papers in the metastudy but I did do a quick skim of a handful of them1 and if you actually look at the papers evaluated, the preponderance are from university research groups attempting to develop their own benchmarks and metrics, independent of the AI companies and not evaluating the AI companies own (allegedly) "thumb on the scale" performance benchmarks.

      Now you may disagree with the results -- that's fine, that's what science is all about2 -- but unless your of the opinion that hundreds, if not thousands, of individual computer scientists are on the "take" from "Big AI," this isn't "marketing" material, it's honest research.

      While I'm at it, I should direct you to yesterday's (11/7/2025) Pivot to AI posting, a guest article by computer scientist and cryptocurrency/AI skeptic Nick Weaver3 and especially the video/podcast interview4 where Weaver discusses the CS discipline of "machine learning" and its actual practical uses (as opposed to the self-fluffing hype of the AI bros).

      _____________________

      1 If I had one criticism of the paper, it's the fact they didn't provide clickable DOI links to them, which meant that I had to go through the extra step of feeding the titles to a search engine, but that's only a mild annoyance.

      2 Preferably by writing your own paper and getting it accepted to a journal or a conference.

      3 Pivot to AI: The futile future of the gigawatt datacenter — by Nicholas Weaver

      4 YouTube: The futile future of the gigawatt datacenter (Interview with Nick Weaver).

      1. Canary64

        Re: Meanwhile

        Computing is mathematics is not a science.

        1. Philo T Farnsworth Silver badge

          Re: Meanwhile

          If that.

          I had the honor and good fortune of knowing1 the late Dr Fred Brooks, he of The Mythical Man-Month fame.

          During one such meeting, Dr Brooks told me, "Any 'science' that has to call itself one, isn't one," or words to that effect.

          I'm not about to argue with one of the true giants and pioneers of the field.

          __________________

          1 Very slightly. I briefly worked in the so-called Research Triangle of North Carolina, we met several times, and we were on a nodding acquaintence level.

      2. Anonymous Coward
        Anonymous Coward

        Re: Meanwhile

        You can be honest and still produce crap "science". Do not assume honesty means competence.

      3. amanfromMars 1 Silver badge

        Re: Meanwhile, Ab Fab 0therworldly 0pportunities are Regularly Squandered and Needlessly Wasted

        What do you think, Philo T Farnsworth, El Reg and monikered El Regers ...... is too much to ask of practically silent virtually anonymous downvoters to share their reason[s], no matter how strange such things can surely be, for the dislike of a post ‽

        Without that justification are the votes unworthy of notice and acceptance/agreement and there is no helpful third party element provided to assist in a possible change of reality for future mutually accommodative viewing.

        Some might like to conclude and take cold comfort, as sad as it might be, that such silent downvoters unable or unwilling to coherently explain their dislike in words are a subset of humans with a delivery mindset in a LLM hallucinatory state ...... and as such they can be similarly ignored.

        1. Philo T Farnsworth Silver badge

          Re: Meanwhile, Ab Fab 0therworldly 0pportunities are Regularly Squandered and Needlessly Wasted

          I admit a certain amount of curiousity but, hey, everyone's entitled to an opinion.

          I'm willing to admit I'm wrong, or at least admit the possiblity, if someone presents a palpable argument to the contrary.

      4. breakfast Silver badge
        Headmaster

        Re: Meanwhile

        The paper doesn't say what he claims it to say, to wit that it's basically marketing material from the AI companies rehashed.

        You seem to have this backwards. He's not saying that the paper says that benchmarks are basically marketing material from the AI companies, he's saying that the paper doesn't mention that benchmarks are designed to be marketing material for the AI companies. A substantial proportion of benchmarks are exactly that and unless the study differentiates those from more any rigorous and scientifically-designed benchmarks it is has to be assumed that the results will be impacted. It is possible, even likely, that there are some attempts at rigorous benchmarking, but if the study doesn't even consider that at least some of the benchmarks it has collected (and, given marketing budgets and paid studies, probably the most-cited ones) are going to be marketing materials then it is missing something fundamental about the landscape it is surveying.

        It's a little akin to doing a geographical survey of the Himalayas and omitting to mention the presence of mountains.

        1. Philo T Farnsworth Silver badge

          Re: Meanwhile

          I don't think I have it backwards.

          The paper "doesn't mention that benchmarks are designed to be marketing material for the AI companies" because the papers analyzed were not "marketing material for the AI companies" -- at least the ones I looked at.

          Admittedly, I didn't read all of the papers but the ones I did look at appeared to be honest, independent academic studies of the efficacies of LLMs.

          I say this with the caveat that I wouldn't trust anything coming from an LLM vendor any farther than I could throw Sam Altman.

          Tell you what -- you read the paper, check the contents of the publications, and draw your own conclusions.

          If you can then show me where I've gone wrong, I'll be happy to concede the fact.

          Personally, until I am proven otherwise, to echo your analogy, it's a little akin to doing a geographical survey of the Himalayas and omitting to mention the presence of molehills.

  3. Dwarf Silver badge

    Turtle

    We've been here before, with some chap called the blade runner.

    You look down and see a tortoise, Leon. It's crawling toward you. You know what a turtle is?

    You reach down and you flip the tortoise over on its back.

    The tortoise lays on its back, its belly baking in the hot sun, beating its legs trying to turn itself over, but it can't. Not without your help

    1. that one in the corner Silver badge

      Re: Turtle

      "Is this to be an empathy test? Capillary dilation of the so-called 'blush response', fluctuation of the pupil, involuntary dilation of the iris." (Tyrell)

      "We call it Voight-Kampff for short." (Deckard)

      So I'm confused - are you bringing up the V-K because you think the current "benchmarks" should be including empathy as a measure of general intelligence (something a certain wannabe AI and robotis overlord would strongly disagree with)? Or that the output from ChatGPT et al already rely too much on the "feels" of the answers to keep the user's addicted therefore the LLMs are acting intelligently, but only in their own interest? Or ... ?

      1. Anonymous Coward
        Anonymous Coward

        Re: Turtles all the way down !!!

        No, much simpler than that !!!

        V-K is total Bull and the 'AI' benchmarks are another form of Bull ... primary purpose is to generate a number that can be used in marketing without giving any meaningful definition of the value of the number !!!

        People don't have any intuitive definition of what is a 'Good' or 'Bad' 'AI' ... so the marketing drones have to come up with some 'number' that you can use to decide that ChapGPT is a '7.2' vs Gemini is a '7.9'.

        All total nonsense as the way you derive the 'number' is as opaque as the way the 'AI' works.

        The scam goes on ans on and on ...

        :)

        1. that one in the corner Silver badge

          Re: Turtles all the way down !!!

          > V-K is total Bull

          I - what

          Sir, I am shocked and appalled at your dismissal of the V-K; did you not see the documentaries where it was clearly an effective tool for the detection of replicants!

          1. mirachu Bronze badge

            Re: Turtles all the way down !!!

            Also, psychopaths, other people with blunted psychological affect, and so fucking on. V-K wouldn't be acceptable.

            1. that one in the corner Silver badge

              Re: Turtles all the way down !!!

              > and so ...

              Whoa, whoa, calm down there.

              V-K is, and always has been, a deliberate work of fiction (which, btw, included commentary on comparisons with replicants and psychopaths, their places in society etc etc).

              All I intended was to query the original poster's conflation of any attempts at benchmarking LLMs' claims to "intellligence" qua reasoning with a madey-uppy test for anything *other* than reasoning intelligence.

              Instead, it seems to have veered off topic and off gentlemanly speech.

      2. This post has been deleted by its author

      3. frankvw Silver badge
        Pirate

        Re: Turtle

        Sea turtles, mate. Roped together with hair from my back.

    2. Alistair
      Windows

      Re: Turtle

      I'm sure that the AI cohort will all be suffering from tears in the rain soon.

      Without, of course, having seen things off the shoulder of Orion.

      1. Anonymous Coward
        Anonymous Coward

        Re: Turtle

        C-beams, my backside.

        The only thing that came off the shoulders of Orion was dandruff. Sure, it was backlit by the nebula, but "ships on fire"? Pull the other one!

  4. gnasher729 Silver badge

    This particular problem is rather trivial (solutions in my head are n = 1, 11 and 37). I’d be curious what happens with more difficult problems.

    1. Anonymous Coward
      Anonymous Coward

      Yeah, they do cite (among others, their ref [5]) Princeton's Embers of Autoregression extrinsic teleological analysis of LLMs, rooted in the notion that they "were trained to solve: next-word prediction over Internet text" (a position many a kommentard here also assumes).

      There (eg. their Fig. 2), they show that changing a minor element of a benchmark question (aside even from puzzles and red herrings) can totally break model perf, at least in the non-multiple-choice context.

      Essentially, if the answer isn't entirely predictable from a model's lossy weights database and stochastic recall process as the "expected answer", the tool face plants hilariously (or dramatically ...).

      It is stunningly disemboweling that these software tools conceptualize intelligence in such a lobotomized way, imho.

  5. jcridge

    International regulation

    Is there also a failure of international regulation to ensure that AI benchmarks are developed with sound scientific methods and credible testing standards ?

    1. vekkq

      Re: International regulation

      good luck with those international regulations. there is no one in this business.

    2. that one in the corner Silver badge

      Re: International regulation

      Is there also a failure of international regulation to ensure that video game benchmarks are developed with sound scientific methods and credible testing standards? Or CPU benchmarks? Or crunchy bran cereal benchmarks?

      You can make a claim that there are international (and national) interests at stake in the current AI scene, but at this stage, despite all the noise, those interests are all around shouting who has the most GPUs and measuring compute in megaWatts[1], who will survive when the bubble bursts - which is all down to financial management and regulations about basic fraud when it goes pop.

      When it can be demonstrated that these people are selling actually long-term useful and usable goods, motor cars instead of tulips, we can talk seriously about what role international standards can play (hint: ease of trade) and then regulation of those standards (are we talking about using them to catch fraudsters or are we talking about public safety? How long the paint finish will last in sunlight or whether that seatbelt will snap?). At that stage, you can point the finger at failure to chose the ones that are based upon sound research.

      [1] not even FLOPs per MW, as we've stopped using good, decent FP; come on, it is absurd to measure compute in Watts if you are interested in anything about the actual capability of the data centre, what value it can provide anyone.

      1. adsp42

        measuring compute in megaWatts

        I beg your pardon. You will find that compute is measured in GigaWatts.

  6. JimmyPage Silver badge
    FAIL

    The Turing Test

    remains the gold standard/

    When your shitty LLM can pass *my* Turing test, I will quite happily accord it the status of intelligence.

    Until then it remains - at best "intelligence" and most definitely it's own version of that.

  7. adsp42

    AGI definition

    You will have to have another look,

    > AGI – vaguely defined by OpenAI as "AI systems that are generally smarter than humans"

    That "defined" in the article is a link to a 2023 OpenAI page. They used to think of AGI like that, now they define AGI as

    "highly autonomous systems that outperform humans at most economically valuable work"

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon