Confusing about what Prelovac believes, expects or hopes for
> "While providers of large language models share their own performance benchmarks, these results can be misleading due to overfitting," Prelovac told The Register. "This means the model might be tailored to perform well on specific tests but doesn't always reflect real-world effectiveness."
Yes, very true.
But what does comparing their chess-playing ability do for us?
> "It's somewhat disappointing but expected that these models show no real generalization of intelligence or reasoning," he said. "While they can perform specific tasks well, they don't yet demonstrate a broad, adaptable understanding or problem-solving ability like human intelligence."
Disappointing? Don't yet?
In other words, all this demonstrates is that Prelovac has fallen, hook line and sinker, for the totally unfounded claim that the LLM training process *must* produce a reasoning system and, instead of accepting the results of his own experiment[1] is just a bit sad that the models he has tried don't reason.
> As for why GPT-4o registered a remarkable improvement in chess but still made illegal moves, Prelovac speculated that perhaps its multi-modal training had something to do with it. It's possible part of OpenAI's training data included visuals of chess being played, which could help the AI visualize the board easier than it can with pure text.
So now the LLMs are building internal representations, visualisations, of the boards? Where? Can we extract these visualisations and demonstrate they exist? Nope? So is that anything more than anthropomorphisation in action?
And yet, he *does* seem to know what these programs are doing:
> "Even chess moves are nothing but a series of tokens, like 'e' and '4', and have no grounding in reality," Prelovac said. "They are products of statistical analysis of the training data, upon which the next token is predicted."
In the end, what is the aim of this (other than having a happy and publishing it on Github for others to play with; nothing wrong with that)? Well, the article starts with:
> A new benchmark for large language models (LLMs)
But what is is supposed to be benchmarking? To what purpose? What are we supposed to understand from a benchmark for the (probably) most costly programs in existence when that benchmark can be bettered by an entry to the Obfuscated C competition (at least that only plays legal moves!)?
After all, if he is serious that this is a useful benchmark, he pretty much admits that it is affected by how many games of chess the training has included - so it is trivial to game this by connecting your LLM trainer to the output of a couple of (different) good Chess playing programs (not even setting up a Chess player as an antagonist, cf setting up a GAN).
[1] that it doesn't even imbue an ability to follow the basic rules of Chess, and he proposes no mechanism by which it ever *could*.