back to article AI models suck slightly less at math than they did last year

Current-day LLMs are prediction engines and, as such, they can only find the most likely solution to problems, which is not necessarily the correct one. Though popular models have mostly become better at math, even top performer Gemini 3 Flash would receive a C if assessed with a letter grade. Researchers affiliated with Omni …

  1. Korev Silver badge
    Coat

    They Orca do better

    1. NoneSuch Silver badge
      Go

      Trust, but verify!

      As with all things AI, it depends on the prompt you give it.

      Clear, well-written prompts can definitely help AI do better with math. When you spell out the problem clearly and ask it to work step by step, the answers are often more accurate. But even with a good prompt, it can still make simple calculation mistakes or stick with a wrong answer too confidently. So better prompts improve the odds—but they don’t guarantee perfect math every time.

      1. Roland6 Silver badge

        Re: Trust, but verify!

        With maths it would seem the algorithms are being tuned to recognise “math” and so feed stuff into a dedicated math coprocessor.

        So this result is indicating how well the coprocessors have been developing.

  2. Rikki Tikki

    Can AI calculate the likely return from an investment of $200 billion in AI?

    1. Anonymous Coward
      Anonymous Coward

      That's easy.

      Step 1) Get AI

      Step 2) Profit

      1. zimzam Silver badge

        Asking the AI is literally the plan.

        https://www.youtube.com/shorts/pLnyjxgFxew

    2. Joe Gurman Silver badge

      The experience

      …. will be priceless.

    3. ecofeco Silver badge

      Of course it can't. When the actual total so far, is about $1 TRILLION.

      Not even joking.

  3. Throatwarbler Mangrove Silver badge
    Terminator

    Can confirm

    I've been trying to use AI to generate some content in a hurry. In particular, I've been trying to save myself from tediously making rack diagrams, so I asked Copilot to do so. The rack measurements are uneven and suddenly jump from 28 RU to 45. The AI remains absolutely certain it has given a 45 RU diagram, no matter how I prompt it, and it remains blissfully* unaware of the giant gap in its numbering.

    * Please, no need to point out the obvious

    1. Anonymous Coward
      Anonymous Coward

      Re: Can confirm

      Do those diagrams include four "loab dalances" and the classic BofH thinwire to high-voltage adapter?

    2. Charlie Clark Silver badge

      Re: Can confirm

      Does it give you the code it's generating so that you can review and adapt it?

      The other week I did some work with Mistral for something similar – a network floor plan – and it came up with some reasonable primitives using Matplotlib and we even started work on schematic diagrams for the switches. I'm sure that, given the right prompt, nicer things would be possible.

    3. Gavsky

      Re: Can confirm

      I asked Microsoft AI to generate realistic images of Scottish, WW1 soldiers, charging with bayonets. It refused to generate bayonets point blank, so I removed that. The results were hilarious, terrifying & manifestly wrong.

      Same with American Civil War Union & Confederate soldiers - utter bollocks. We know, but it's frightening how many people would just accept the 'right answer', & that AI doesn't understand it's wrong, & why.

      1. Bebu sa Ware Silver badge
        Windows

        Re: Can confirm

        Curious why the AI baulked at the bayonets.

        Scottish, WW1 soldiers, charging with bayonets. — Unquestionably terrifying.

        The Scots troops must have be issued with bayonets - even the Australians had them. According to my grandfather from his experience, a sharpened spade was a more effective weapon in the trenches at close quarters.

  4. cd Silver badge

    People making life-altering decisions based on answers from something that cannot put two and two together.

    Visitors to the planet will see the elaborate empty structures and smouldering ruins and ask what happened.

    1. ecofeco Silver badge

      Fermi is laughing.

  5. Flocke Kroes Silver badge

    Easy fix

    Redefine the correct answers to maths problems as the most recent output of an LLM.

    1. LionelB Silver badge

      Re: Easy fix

      Even easier solution. Use a bloody calculator. Or if it's a complex problem, use a maths package like Mathematica.

      Why on earth would anyone use an LLM to do maths? They're large language models for chrissake, obviously wrong tool for the job.

      This is just classic PEBCAK.

  6. Anonymous Coward
    Anonymous Coward

    AI is an approximation .

    integral of x=0 to infinity of 1+1=1.99999999999. oh wait i just created a pull request for if (round(integral==2)) return 2. just adding 10 lines of code only. and 2 megabyte of github data and use us$1000 data center cost for the AI agent to autoprocess the merge. i only used 50% of my us$200 token budget to ask the agent to process this. (rant)

  7. Neil Barnes Silver badge
    Headmaster

    If I bought a two-euro calculator

    and it was only right 72% of the time, I'd be chasing a refund.

    A computing device that does not return the same answer to the same question every time it is asked is _broken_.

    1. Bebu sa Ware Silver badge
      Windows

      Re: If I bought a two-euro calculator

      About a decade ago I did purchase a AUD6.95 programmable calculator from a local grocery which even for four function calculations was rather hit or miss - mostly miss.

      I spent a fair amount of time attempting to determine what was wrong with it. I was pretty sure it wasn't electrical (faulty contacts, low battery etc).

      Something to do with programming I imagine, as even repeating the same simple calculation would give different results. The STO and RCL worked reliably as apparently anything user programmed.

      Ultimately returned it for a refund :) Paid about 10× the refund for a new HP35s from an internet site which both still works and returns the right answers.

      1. Neil Barnes Silver badge

        Re: If I bought a two-euro calculator

        I bought an HP11c sometime in the mid-eighties. I'm shocked to say I've had to change the batteries recently, for the _second_ time. Nothing's built to last any more...

    2. LionelB Silver badge

      Re: If I bought a two-euro calculator

      Damn right. I asked an AI what the age of the universe was, and it told me it was 13.7 billion years. I immediately asked it again, and it had added 7 seconds to its previous answer!! Then I asked it twice to give me a number between 1 and 10 and again got two completely different answers.

      Bloody AI, eh?

  8. Charlie Clark Silver badge

    Once the task has been identified, the LLM should hand over the a specialised model

    I don't know why this isn't the case, given that the research on how LLMs try to solve the problem as if it were trying to complete a sentence. Solvers for most mathematical problems have existed for years so it shouldn't be hard to pass in some kind of AST based on the question.

  9. PB90210 Silver badge

    If it's Grok then getting the answer right is pure woke!

    1. Bebu sa Ware Silver badge
      Coat

      If it's Grok then getting the answer right is pure woke!

      Surely Musk's Cyber-Nazi's answers would all be right even if wrong. The answers left while correct are too woke for his grokiness.

  10. Gavsky

    It's hilarious: there is AN answer; black & white; right, or wrong. Except, AI can get it wrong. This points to the underlying issue that compromises all AI: it's making decisions based on probability, sifting, based on the 'stuff' it's trained on.

    We know there's a 1 or 0 result; AI can return a 0.756 result because that's what it determines is probably correct. Some of its training material might be utter BS, because: humans. But, AI can't discern, it's not intelligent.

    1. MonkeyJuice Silver badge

      Though there are plenty of boring other ways to do this that are apparently too useful to pursue. Take AlphaGeometry- it uses a transformer but is trained ENTIRELY on path optimal runs from its symbolic theorem prover. All that the neural part is doing is acting as a heuristic function in your good old fashioned state space search. If you do it this way, you get a system that is sound- i.e. IF it produces an answer THEN it is guaranteed correct. You can run that pretraining from scratch on a 3090 in a weekend- this isn't even an energy hog.

      However this is all too uncool for anyone, because it's domain specific and "NOT AGI ENOUGH", and all the large academic institutions have devolved into dicking around with prompting GPT5, and then being unable to reproduce each others results on systems they don't even get to look inside or know how it is trained.

    2. LionelB Silver badge
      Meh

      Meh. Wrong tool for the job. That's not what they were designed for.

  11. Will Godfrey Silver badge
    Boffin

    ???

    What is a 'math' and where can I find one?

    Is there a mathematical way to do so? If so, please let us know. I'm usually reasonably good at maths.

  12. StewartWhite Silver badge
    FAIL

    FTFY

    But for the time being, trust no AI.

  13. ravenviz
    Childcatcher

    Garbage in, garbage out.

  14. ecofeco Silver badge
    Facepalm

    This is ridiculous

    All of that supposed computing power and they ALL still suck at math?

    Wow. They ALL still suck at math.

    1. LionelB Silver badge
      Meh

      Re: This is ridiculous

      Seems about right to me – most humans suck at maths. That's why we invented calculators and actual maths software.

      (Clue: we're talking about Large Language Models, trained on human data; why on earth would they be good at maths?)

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon