back to article Why AI benchmarks suck

AI model makers love to flex their benchmarks scores. But how trustworthy are these numbers? What if the tests themselves are rigged, biased, or just plain meaningless? OpenAI's o3 debuted with claims that, having been trained on a publicly available ARC-AGI dataset, the LLM scored a "breakthrough 75.7 percent" on ARC-AGI's …

  1. Anonymous Coward
    Anonymous Coward

    Benchmarks and statistics lie

    Duh! Who would have thought?

    1. Chloe Cresswell Silver badge

      Re: Benchmarks and statistics lie

      I stick with the quote from the NT forums on compuserv back in the day:

      Never trust a benchmark you didn't rig yourself.

    2. Anonymous Coward
      Anonymous Coward

      Re: Benchmarks and statistics lie

      8% of statistics are made up on the spot

      1. Johnb89

        Re: Benchmarks and statistics lie

        It's true! 76% of statistics are made up on the spot!

  2. ChrisElvidge Silver badge

    So now we have:

    There are 4 kinds of lies - lies, damned lies, statistics and benchmarks.

    Or 5 - lies, damned lies, statistics, benchmarks and hallucinations.

    1. Mentat74

      There are a lot of different kinds of lies...

      Lies told out of ignorance...

      Lies told out of malice...

      Lies told out of love...

      Lies told out of necessity...

      The difficulty is telling which is which.

    2. Alumoi Silver badge

      First, there were lies.

      Then came damned lies.

      After that, the statisticians got involved.

      Now we have AI and the rest is history.

    3. Helcat Silver badge

      You forgot Politicians: They're a class of lie all of their own.

  3. Paul Herber Silver badge
    Trollface

    Has any consideration been made about whether the AI (or LLM) wants to be benchmarked? How would you feel about being tested in this way without any warning?

  4. Neil Barnes Silver badge
    Terminator

    puzzles...that AI models try to solve as a measure of intelligence

    Surely the test for intelligence is less the ability to solve a puzzle than the desire to.

    I have not yet heard of an AI seeing, by chance, the Times crossword on a researcher's desk, and sitting down to solve it.

    1. jdiebdhidbsusbvwbsidnsoskebid Silver badge

      Re: puzzles...that AI models try to solve as a measure of intelligence

      I'd be equally impressed by AI seeing, by chance, the Times crossword on a researcher's desk, and saying "nah, can't be bothered with that".

      1. Geoff Campbell Silver badge
        Pirate

        Re: puzzles...that AI models try to solve as a measure of intelligence

        The absolute pinnacle would be an AI seeing, by chance, the Times crossword on a researcher's desk, and saying "I'd like to have a crack at that, so I'll ask the owner to photocopy it for me so I don't spoil his chance to enjoy what he paid for".

        GJC

  5. Jou (Mxyzptlk) Silver badge

    Regarding the Volkswagen...

    James Liang got to jail. Not the ones high up who forced that issue, they came out untouched. Just a reminder...

    1. Phil O'Sophical Silver badge

      Re: Regarding the Volkswagen...

      VW's fraud affected tax bills, that will always result in more punishment. Maybe we should start to tax AI? 50% seems like a good starting point...

      1. Jou (Mxyzptlk) Silver badge

        Re: Regarding the Volkswagen...

        Oh, what about taxing churches? Aren't they a lottery? Pray and you might go to heaven? Late George Carlin would be so happy!

        1. FuzzyTheBear
          Holmes

          Re: Regarding the Volkswagen...

          All for it until someone can get God in person standing in a court of law with identity papers.

          1. Anonymous Coward
            Anonymous Coward

            Re: Regarding the Volkswagen...

            They'll have to get him out of his Brussels Apartment first! ;) (superb movie!)

            1. Anonymous Coward
              Anonymous Coward

              Re: Brussels apartment of God

              Briljant movie that was.

              Karma is a bitch, for everyone.

              The scene with the toast falling buttered side down was so memorable.

  6. Doctor Syntax Silver badge

    "Tests that haven't kept up with the rapidly changing state of the art."

    Rapidly changing state of the art? Is that another way of saying "immature technology"?

  7. Tron Silver badge

    Most tech benchmarks are pointless.

    My favourites are the screen ones that you would have to have the visual acuity of a raptor to notice. Pay an extra £200 for a 'better' screen, even though it is a physical impossibility for you as a mere human to see any difference.

    They are like those trade shows where everyone pays to attend, everyone gets an award to put on their website, and everyone goes home happy.

  8. DJV Silver badge

    Just in!

    Snake oil salesmen declare how much better their Brand X is when compared to rivals' Brand Y and Brand Z!

    1. Anonymous Coward
      Anonymous Coward

      Re: Just in!

      Just like laundry detergents then ... use a bit more of the cheaper ones, or less of the more expensive ones, and bingo ... same difference!

  9. O'Reg Inalsin

    Frontier math

    "AI benchmarking organization criticized for waiting to disclose funding from OpenAI" -[tech crunch]. If you remember that "frontier math" which included efforts from leading mathematicians. Initially, no AI exceeded a few pct, but then suddenly OpenAI o3 scored 25 % this year.

  10. HuBo Silver badge
    Headmaster

    Tough call of the wild (or not?)

    I think that with any newly developed tech, be it fire, steam and internal combustion engines, electricity, motorized vehicles, factory assembly lines, printing press and News, and others, there's historically been some need to introduce regulations to ensure safe usage, especially with broadening adoption at industrial scales. Folks who wanted to develop and sell the tech may have bitched and moaned about it at the relevant times, but we all benefit better from those today with these safeguards in place, imho.

    With respect to media-oriented rules, those related to obscenity, say the Roth Standard and Miller Test, are particularly interesting because of their subjective nature (no quantitative benchmark involved). Clearly, AI has to to be evaluated using such tests to determine whether it is intelligent or not. And, accordingly, as the subjective majority vote, of ElReg's most informed of commentard juries on AI's interpretive dance, is that it doesn't represent intelligence, one may safely conclude that it indeed isn't. Case closed on that aspect of things.

    The question of objectionable use and outputs may also be subjective, with different standards in different cultures, making "AI" developed in authoritarian or theological regimes rather useless in the Free World, and vice-versa. To me, the EU AI Act's bans on untargeted scraping, biometric categorization, and emotion recognition, are very sensible, but I guess the opposite might be true in totalitarian dictatorships ... The standards we apply in this are eventually a reflection of our values (or lack thereof, as in E pruritus ani's DOGE vs E pluribus unum's liberty, equality, and broad kinship -- sponsored by Preparation HuBo).

    Still, as some endeavor to massively disseminate the offal tech across human occupations, so as to profit from it, there remains a need to assess its safety, and surround it with appropriate safeguards, if only to uphold public and occupational health, and safety. The potential physical harm that agentic-AI-infused humanoid and canis-lupusoid robots will inflict on the human species must surely be prevented. But so do the trick-cycling aftermaths of interactions with such psychotropic implements designed to automate the persuasive sophistry of a pusher. Can't have the somatic without the psychotic, the Yin without the Yang, or the Apply without the Eval, for wholesomeness (in a Turing-complete kind of way), imo!

    Well, anyways, that's my 2-cents soap-box lecture on this for today ...

    1. nobody who matters Silver badge

      Re: Tough call of the wild (or not?)

      <...."Clearly, AI has to to be evaluated using such tests to determine whether it is intelligent or not".....>

      That's an easyone to answer at present - what we have currently masquerading under the banner of 'AI' is definitely not artificial intelligence; all flavours are just some sort of glorified search or data analysis tool coupled with a glorified word juggler.

      The only intelligence involved is the natural intelligence of the people who have constructed them.

      The current crop of programs have no intelligence themselves. None whatsoever. So the answer that question is a very easily determined resounding 'No it isn't'.

      1. Anonymous Coward
        Anonymous Coward

        Re: Tough call of the wild (or not?)

        AI is evaluated by AI in today's world ... it's no big deal, it's been happening in politics for years.

  11. Pascal Monett Silver badge

    "They identify nine general problems with benchmarks"

    There is, in general, only one problem with benchmarks : they're published by the same people who want to sell you something.

    The only reliable benchmark is the one made by someone who has no skin in the matter, who is entirely devoted to actual results and who doesn't have any shares in the company.

    That is a rare gem, these days.

  12. Grunchy Silver badge

    Where are the $ billions going?

    Someone is paying billions of dollars for all this machine training, and “some other guys” are charging billions of dollars to perform the machine training. I reckon nVidia gets some of that, and the power company gets some. But who else, and for what?

    For example: who is “Anna,” creator-operator-mastermind behind Annas-Archive, the outlaw library that is suddenly intent on collecting every single book ever written and make them available from bittorrent?

    What would be the motivation for such an enterprise?

  13. Blackjack Silver badge

    How has done the 5-shot test in a pub? I can barely stand after 3 shots.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like