back to article Checkmate? AI's pawn-pushing prowess proves partly pitiful, partly promising

A new benchmark for large language models (LLMs) shows that even the latest models aren't the best chess players. Simply called LLM Chess Puzzles by its creator, software engineer Vladimir Prelovac, the GitHub project tests LLMs by giving them 1,000 chess puzzles to complete. In contrast to a normal game of chess, puzzles are …

  1. Lee D Silver badge

    The problem with all AI:

    - You throw data at it, and "train it" (i.e. you kill off those who don't get increasingly closer to what you want) on the subject you desire.

    - They become vaguely proficient at it.

    To now train on another topic... you have to defeat all that training that you already gave it, by overwhelming it with other training until its initial training becomes a minority player in its behaviour. Which usually means orders-of-magnitude more training, starting from a very biased (culled) base of the thing you originally wanted it to do.

    To advance - much of the same. Remember how in learning everything you initially learned is revealed to be not the complete truth? Same problem. Now you have to take your mediocre chess AI and untrain it on everything it learned (to even survive!) in order to become a better chess AI.

    If you have the time, the processing, the source data (if you trained it on the entire Internet... where are you going to get orders of magnitude more training data and the time to train it on it?), you can retrain it but you'll still be held hostage by the initial criteria - those you culled in order to achieve.

    Imagine you executed every lifeform that couldn't play chess on an interface that they can all use. Eventually, yes, you'd get an animal of some kind that can play chess on that interface. Now try to train that animal to launch a rocket. That millions of years of forced evolution takes a long time to undo - many generations, MANY individuals breeding constantly, until your "now incorrect" training is in the minority - and it will inherently bias the way that every creature that still exists thinks. Because they all came from a chess-playing creature.

    Modern AI still hasn't learned these lessons, despite decades of the EXACT SAME PROBLEM. Intelligence isn't a statistical average of the training data. That's not how it works. I can see something *ONCE* and recognise that it's amazing and useful, and throw out decades of my previous knowledge and experience to follow it, because it's a clearly-better tool for the purpose I need. That's how intelligence works. I can reason things that don't yet exist and have no effect on my life - like I can choose to be compassionate to minority groups that I've never previously encountered just by thinking about this new group that I come across.

    These things are just statistical probabilistic engines trained on databases. That's all they are. And AI people get rather offended when you point that out, because they still hold the belief that they are not because it uses genetic algorithms / neural networks / transformers / <insert latest fad here> and don't see that all they're building is layers of abstraction from that. And yet actually intelligence does not rely on any such engine or database like that, in fact it's one of the defining features that we can extrapolate from almost zero previous information and imagine things that have never been recorded ever before.

    It's 60 years down the line and we're still pushing the same nonsense, and getting the same result, and even proposing the same solution - MORE CPU! MORE RAM! MORE NODES! MORE TRAINING! MORE DATA SOURCES! That'll fix it, this time, for sure! Everyone knows that you just do random stuff millions of times over and it magically and spontaneously becomes intelligent at a given point! It's just that that point is always *just* out of reach, apparently.

    Or maybe we could completely rethink what we're doing here. Because none of this solves the inference problem, the training plateaus, or the complete lack of understanding or conceptualisation of the underlying data.

    1. Joe W Silver badge

      And in contrast to the article I would think that since it does not "think ahead" the LLM will not be able to spot a mate in two or three moves. Unless this has been in the training set, I guess.

      Meh.

    2. that one in the corner Silver badge

      > The problem with all AI

      Agree with the analysis, but not with the description you start with.

      LLMs are *not* "all AI", they are not even "all ML"; they are not even the newest AI techniques and ideas.

      The "newness" of LLMs is purely the first 'L', meaning "we can throw more compute at it today".

      Yes, I *know* that now "they" have decided to "coin a new phrase, GAI" for a 'proper' AI, and "words change meaning, get over it" - BUT encouraging the misuse of such a new word as "AI" means that you are losing contact with research and work done in only the last few decades[1]; only recently I've heard complaints that a book about AI wasn't solely about LLMs.

      [1] "oh, but computing moves so fast, anything a decade old is useless anyway" - of course, that explains why we stopped using lexical analysis in our compilers /s

    3. heyrick Silver badge

      Why on earth do you need to train and retrain and train some more a machine to play chess? There are only six different pieces, with unique moves and behaviours, plus a well known and standardised set of rules. I think the strength of a good chess playing machine is the ability to think many moves ahead for both players, and to adapt their strategy for their prediction of what the human is doing.

      The fact that LLMs spit out illegal moves indicates that they have not yet groked the basics of chess, and as such may not be the correct solution to making a good virtual chess player.

      1. Lee D Silver badge

        Precisely because it's a statistical machine with no inference, it doesn't understand that the same move in two locations can be good and/or bad, or that pretty much the rules could be written on the back of a small postcard. It's not playing by the rules, it doesn't understand them or what they mean. It's purely playing by results of previous data and (maybe!) new data coming in.

        This is the problem - a machine shouldn't need to be fed the entire Internet to learn how to respond to your question about the weather. We're going about things completely backwards. Even the rules of English are picked up within a couple of years by any toddler or bilingual speaker. In computing terms, those couple of year's worth of training can pass within seconds.

        But we still don't have any AI capable of doing anything beyond what it was taught/told to do. People point at some that are "creating maths theories". No they're not. They're just brute-forcing things and showing you things that you missed, because they aren't capable of positing actual new theories in even the most minor of areas (and bear in mind that pretty much every PhD means actually discovering something new, in some fashion, including identifying something that we don't know and then solving quite how to find that out in a rigorous way). Machines are great labour-savers, they're marvellous tools, even the LLMs and other nonsense out there.

        What they are not, in any way, shape or form, is intelligent. Or even hinting at such. They're just trained monkeys that, the second you ask them something outside their training, they have no idea what to do. And even that's an analogy that's insulting to monkeys who do possess a given amount of actual intelligence.

  2. Mike 137 Silver badge

    So to sum up ...

    The LLM doesn't understand or reason, which are two key attributes of a good chess player (or indeed anyone doing anything really well).

    What a surprise, considering how the system operates -- selecting the most statistically probable from a pool of essentially randomly chosen alternatives, without any reference to causalities or results.

    1. cyberdemon Silver badge

      Re: So to sum up ...

      Using a LLM to play chess is like using 1000 blunderbusses to try to kill a beetle on an archery target 200m away.

      It can do it, sometimes, but it's horrendously inefficient compared to a dedicated chess program that could run on an 8-bit microcontroller drawing a couple of milliwatts..

      I wonder if a H200 running GPT could even beat an 8-bit chess program on a Z80, given the same time to complete moves and a million-fold power consumption advantage..

      1. heyrick Silver badge
        Thumb Up

        Re: So to sum up ...

        Horribly inefficient, but I'd reckon loads of fun.

  3. jake Silver badge

    So they can't play chess without waffling.

    Given that they can't really do much of anything else except waffle, is anybody really surprised?

    Wake me up when they can do something useful.

    1. Anonymous Coward
      Anonymous Coward

      Re: So they can't play chess without waffling.

      Generating waffle *is* useful!

      I mean, it *must* be, otherwise why would our politicians be putting so much effort into it, especially recently?

      We should be campaigning to get GPT onto the ballot papers - after all, it is going to a while before an LLM tries to claim parliamentary expenses for its duck house, so that is already an improvement.

      1. Flocke Kroes Silver badge

        Re: Duck houses

        Given the available training data I would expect duck houses to be a fairly popular expense claim for LLM politicians.

      2. heyrick Silver badge

        Re: So they can't play chess without waffling.

        "why would our politicians be putting so much effort into it, especially recently?"

        Trying to convince you that they're doing something to justify what they're sucking from the public teat. Basic annual salary of a regular MP is £91,346 not including all the "expenses". That's nearly three times what a band 5 (most common grade) NHS nurse makes.

    2. Elongated Muskrat Silver badge

      Re: So they can't play chess without waffling.

      It should come as absolutely no surprise to anyone that a LLM can learn chess about as effectively as a toaster, or a piece of cheese. There's a lot of magical thinking going on in the field, and that starts with calling statistical models "AI" in the first place.

  4. Andy Non Silver badge

    As someone else has commented on this forum

    ChatGPT and all the rest aren't true AI, they are just advanced predictive text generators, trained on a colossal amount of text and they are pretty good at predicting what words, sentences and paragraphs to suggest in response to an input. There is nothing truly intelligent about them. So many people and companies ooing and ahing at the emperor's fine new clothing... he's stark bollock naked!

    1. Mike 137 Silver badge

      Re: As someone else has commented on this forum

      "So many people and companies ooing and ahing at the emperor's fine new clothing... he's stark bollock naked!"

      Indeed, he's not even an emperor but nobody's noticed yet.

  5. Mike 137 Silver badge

    A real test

    Of course the real solid test would be to pit two LLMs against each other in an actual game of chess. But I guess the outcome would be so embarrassing that nobody's willing to try it (just like nobody's going to try 20+ 'autonomous' vehicles among other traffic driving down the five ways into the Hemel Hempstead Magic Roundabout in the rush hour).

    1. Flocke Kroes Silver badge

      Re: A real test

      The results are mostly predictable from how LLMs play humans. The LLM will pick a move that was common in its training data. It may even pick a move that commonly follows the previous move. It will not care if there is a piece at the start position of the move. If there is, it will not care if that piece is its own or if it is the right type of piece to move to the destination square. It will not care if the destination is currently occupied by one of its own pieces. All that matters is that the move is popular - possibly in the context of games with similar previous moves. If the move is a popular way to loose that does not reduce the chance of it being selected.

      If you train exclusively on chess games the results will be better. If the training deprioritises illegal moves the performance will improve, more so it you actually train it to play moves that lead to a win. This chess model would be crap at telling stories or drawing pictures. You could put a thousand special purpose models on a single computer and have an LLM pick one according to its training. The one thing you could not do is replace the chess LLM with stockfish. That would call into question the value of investing in the other 999 specialised models.

      1. Mike 137 Silver badge

        Re: A real test

        "You could put a thousand special purpose models on a single computer and have an LLM pick one according to its training"

        I think you've hit the nail squarely on the head. Specialist single function AI trained on reliable data is already beginning to prove itself. Generalist systems such as LLMs aren't, because (quite apart from the low quality of much training data) being a generalist depends on capacities to analyse, reason and infer, but LLMs don't (and can't) do any of these. For that very reason I'm not assured that your LLM would be able to make the right choice of model reliably.

        1. jake Silver badge

          Re: A real test

          "Specialist single function AI trained on reliable data is already beginning to prove itself."

          Where? Does using that function justify the expense of developing it? CAN it justify that expense? Will it ever be able to break even? (Entropy suggests not ... )

    2. m4r35n357 Silver badge

      Re: A real test

      I'd like to see the latest "technology" go up against ZX Spectrum chess from the 1980s.

      But seriously, is anyone actually surprised at the complete failure of ML to play chess?

      1. Paul Crawford Silver badge

        Re: A real test

        But seriously, is anyone actually surprised at the complete failure of ML to play chess?

        Around these parts, no.

        But what it helps illustrate to the great unwashed / uneducated is just how lacking in 'I' that claimed AI is. Potentially even serving to illustrate the humongous waste of resources that LLM/AI represent for a big swath of problems to be solved.

      2. Brave Coward

        ZX Spectrum chess from the 1980s

        Reminds me of these pocket-size electronic chess boards that became available (although rather expensive) in the beginning of the 80s.

        Father of my then-girlfriend bought one and let me try it. Difficulties levels ran from 1 (very easy) to 9 (very hard). Not being too overly selfish, I started at level one - just to be beaten in a matter of minutes.

        Rather impressed by my defeat (I wasn't too bad of a chess player at the time), I decided to test level 9, which I thought would be awfully impressive.

        I won.

      3. doublelayer Silver badge

        Re: A real test

        "But seriously, is anyone actually surprised at the complete failure of ML to play chess?"

        Was that a typo? ML, as in machine learning, does play chess well. A specific model trained on the rules of chess to play actual chess moves is quite good and routinely beats the most skilled of humans. LLMs, on the other hand, are crap at that because they weren't intended to play chess, but there are lots of people who think that they're something that they're not. I don't think many of us are surprised that an LLM can't play chess, but there are some people who might, but probably won't, understand why this means their conception of an intelligent program is flawed. They see a program write a paragraph with correct grammar that looks to be answering a question, and since they can't answer that question, they assume it must be intelligent. Also because they can't answer the question, they may do that even if the provided answer is wrong. It looks convincing and that's good enough for them. I'm hoping we can show them what the tool can actually do before they unleash some LLM-powered thing on us which annoys everyone with constant wrong answers.

        1. m4r35n357 Silver badge

          Re: A real test

          Just to be clear, I am talking about neural nets.

  6. that one in the corner Silver badge

    Confusing about what Prelovac believes, expects or hopes for

    > "While providers of large language models share their own performance benchmarks, these results can be misleading due to overfitting," Prelovac told The Register. "This means the model might be tailored to perform well on specific tests but doesn't always reflect real-world effectiveness."

    Yes, very true.

    But what does comparing their chess-playing ability do for us?

    > "It's somewhat disappointing but expected that these models show no real generalization of intelligence or reasoning," he said. "While they can perform specific tasks well, they don't yet demonstrate a broad, adaptable understanding or problem-solving ability like human intelligence."

    Disappointing? Don't yet?

    In other words, all this demonstrates is that Prelovac has fallen, hook line and sinker, for the totally unfounded claim that the LLM training process *must* produce a reasoning system and, instead of accepting the results of his own experiment[1] is just a bit sad that the models he has tried don't reason.

    > As for why GPT-4o registered a remarkable improvement in chess but still made illegal moves, Prelovac speculated that perhaps its multi-modal training had something to do with it. It's possible part of OpenAI's training data included visuals of chess being played, which could help the AI visualize the board easier than it can with pure text.

    So now the LLMs are building internal representations, visualisations, of the boards? Where? Can we extract these visualisations and demonstrate they exist? Nope? So is that anything more than anthropomorphisation in action?

    And yet, he *does* seem to know what these programs are doing:

    > "Even chess moves are nothing but a series of tokens, like 'e' and '4', and have no grounding in reality," Prelovac said. "They are products of statistical analysis of the training data, upon which the next token is predicted."

    In the end, what is the aim of this (other than having a happy and publishing it on Github for others to play with; nothing wrong with that)? Well, the article starts with:

    > A new benchmark for large language models (LLMs)

    But what is is supposed to be benchmarking? To what purpose? What are we supposed to understand from a benchmark for the (probably) most costly programs in existence when that benchmark can be bettered by an entry to the Obfuscated C competition (at least that only plays legal moves!)?

    After all, if he is serious that this is a useful benchmark, he pretty much admits that it is affected by how many games of chess the training has included - so it is trivial to game this by connecting your LLM trainer to the output of a couple of (different) good Chess playing programs (not even setting up a Chess player as an antagonist, cf setting up a GAN).

    [1] that it doesn't even imbue an ability to follow the basic rules of Chess, and he proposes no mechanism by which it ever *could*.

    1. doublelayer Silver badge

      Re: Confusing about what Prelovac believes, expects or hopes for

      "But what does comparing their chess-playing ability do for us?"

      It demonstrates that they are not general intelligences. And I don't agree that he believes they have one. I think he was attempting to demonstrate the extent to which they can solve problems that weren't the purpose of their training data, because that's what a lot of LLM-based products claim to do. By pointing out how often they fail at something that is relatively easy to make a computer do, it demonstrates the failure in a simple, practical, easily-understood way. That might be more convincing to someone who believes that LLMs can reliably solve problems than a theoretical discussion.

  7. HandleBaz

    Martin Supremacy

    Stockfish will absolutely crush you though.

  8. Anonymous Coward
    Anonymous Coward

    please stop calling em AI, they are POO

    Can every fucker stop using "Ai" to describe this shit.

    yes it's artificial, it's not fucking intelligent in any fucking sense of the word

    they are "Probability Ordure Outputters" or POO for short

    1. Elongated Muskrat Silver badge

      Re: please stop calling em AI, they are POO

      The "I" in "AI" stands for "idiocy", not "intelligence". It's just that a lot of people who, themselves, are idiots, but think themselves intelligent, get confused by it. Like an electronic version of the Dunning-Kruger effect.

  9. ibmalone

    PGN not FEN

    "Technically, when GPT-4o writes out the move it wants to play, it correctly formats it in Forsyth–Edwards notation (FEN), but the model doesn't understand that even if it makes sense, that doesn't mean it's the best move or even legal."

    Chess moves are in Portable Game Notation. FEN describes a board position. For a puzzle you need FEN to describe the position, but the solution move or sequence (4 f3 or whatever) would be PGN.

    Nobody here expects LLMs to be any good at chess (that they succeed at any rate suggests quite a bit of chess notation has fed into their training data), but I suppose this kind of exercise is to remind everyone else of that fact. Computers are really good at chess, humans can be pretty good, LLMs do not show general intelligence, just regurgitation.

  10. FeepingCreature

    Just reply with the move

    This is of course equivalent to telling a human "Don't think about the problem, just blurt out the first move that comes to mind."

    Talking about things is literally how these systems think.

    Standing theory is that GPT-4 is better because it has been explicitly trained to think about chess internally, without speaking out loud. Without that advantage, you *have* to allow the system to reason. If you don't, well, it's not surprising that the result is unreasonable.

  11. G.Y.

    13%

    If they do 13% illegal moves in chess, I don't want them running law, navigation, machine tools ...

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like