
Benchmarks and statistics lie
Duh! Who would have thought?
AI model makers love to flex their benchmarks scores. But how trustworthy are these numbers? What if the tests themselves are rigged, biased, or just plain meaningless? OpenAI's o3 debuted with claims that, having been trained on a publicly available ARC-AGI dataset, the LLM scored a "breakthrough 75.7 percent" on ARC-AGI's …
The absolute pinnacle would be an AI seeing, by chance, the Times crossword on a researcher's desk, and saying "I'd like to have a crack at that, so I'll ask the owner to photocopy it for me so I don't spoil his chance to enjoy what he paid for".
GJC
They'll have to get him out of his Brussels Apartment first! ;) (superb movie!)
My favourites are the screen ones that you would have to have the visual acuity of a raptor to notice. Pay an extra £200 for a 'better' screen, even though it is a physical impossibility for you as a mere human to see any difference.
They are like those trade shows where everyone pays to attend, everyone gets an award to put on their website, and everyone goes home happy.
"AI benchmarking organization criticized for waiting to disclose funding from OpenAI" -[tech crunch]. If you remember that "frontier math" which included efforts from leading mathematicians. Initially, no AI exceeded a few pct, but then suddenly OpenAI o3 scored 25 % this year.
I think that with any newly developed tech, be it fire, steam and internal combustion engines, electricity, motorized vehicles, factory assembly lines, printing press and News, and others, there's historically been some need to introduce regulations to ensure safe usage, especially with broadening adoption at industrial scales. Folks who wanted to develop and sell the tech may have bitched and moaned about it at the relevant times, but we all benefit better from those today with these safeguards in place, imho.
With respect to media-oriented rules, those related to obscenity, say the Roth Standard and Miller Test, are particularly interesting because of their subjective nature (no quantitative benchmark involved). Clearly, AI has to to be evaluated using such tests to determine whether it is intelligent or not. And, accordingly, as the subjective majority vote, of ElReg's most informed of commentard juries on AI's interpretive dance, is that it doesn't represent intelligence, one may safely conclude that it indeed isn't. Case closed on that aspect of things.
The question of objectionable use and outputs may also be subjective, with different standards in different cultures, making "AI" developed in authoritarian or theological regimes rather useless in the Free World, and vice-versa. To me, the EU AI Act's bans on untargeted scraping, biometric categorization, and emotion recognition, are very sensible, but I guess the opposite might be true in totalitarian dictatorships ... The standards we apply in this are eventually a reflection of our values (or lack thereof, as in E pruritus ani's DOGE vs E pluribus unum's liberty, equality, and broad kinship -- sponsored by Preparation HuBo).
Still, as some endeavor to massively disseminate the offal tech across human occupations, so as to profit from it, there remains a need to assess its safety, and surround it with appropriate safeguards, if only to uphold public and occupational health, and safety. The potential physical harm that agentic-AI-infused humanoid and canis-lupusoid robots will inflict on the human species must surely be prevented. But so do the trick-cycling aftermaths of interactions with such psychotropic implements designed to automate the persuasive sophistry of a pusher. Can't have the somatic without the psychotic, the Yin without the Yang, or the Apply without the Eval, for wholesomeness (in a Turing-complete kind of way), imo!
Well, anyways, that's my 2-cents soap-box lecture on this for today ...
<...."Clearly, AI has to to be evaluated using such tests to determine whether it is intelligent or not".....>
That's an easyone to answer at present - what we have currently masquerading under the banner of 'AI' is definitely not artificial intelligence; all flavours are just some sort of glorified search or data analysis tool coupled with a glorified word juggler.
The only intelligence involved is the natural intelligence of the people who have constructed them.
The current crop of programs have no intelligence themselves. None whatsoever. So the answer that question is a very easily determined resounding 'No it isn't'.
There is, in general, only one problem with benchmarks : they're published by the same people who want to sell you something.
The only reliable benchmark is the one made by someone who has no skin in the matter, who is entirely devoted to actual results and who doesn't have any shares in the company.
That is a rare gem, these days.
Someone is paying billions of dollars for all this machine training, and “some other guys” are charging billions of dollars to perform the machine training. I reckon nVidia gets some of that, and the power company gets some. But who else, and for what?
For example: who is “Anna,” creator-operator-mastermind behind Annas-Archive, the outlaw library that is suddenly intent on collecting every single book ever written and make them available from bittorrent?
What would be the motivation for such an enterprise?
int main(enter the void)
...