They Orca do better
AI models suck slightly less at math than they did last year
Current-day LLMs are prediction engines and, as such, they can only find the most likely solution to problems, which is not necessarily the correct one. Though popular models have mostly become better at math, even top performer Gemini 3 Flash would receive a C if assessed with a letter grade. Researchers affiliated with Omni …
COMMENTS
-
-
Friday 27th February 2026 16:13 GMT NoneSuch
Trust, but verify!
As with all things AI, it depends on the prompt you give it.
Clear, well-written prompts can definitely help AI do better with math. When you spell out the problem clearly and ask it to work step by step, the answers are often more accurate. But even with a good prompt, it can still make simple calculation mistakes or stick with a wrong answer too confidently. So better prompts improve the odds—but they don’t guarantee perfect math every time.
-
-
Thursday 26th February 2026 21:54 GMT Throatwarbler Mangrove
Can confirm
I've been trying to use AI to generate some content in a hurry. In particular, I've been trying to save myself from tediously making rack diagrams, so I asked Copilot to do so. The rack measurements are uneven and suddenly jump from 28 RU to 45. The AI remains absolutely certain it has given a 45 RU diagram, no matter how I prompt it, and it remains blissfully* unaware of the giant gap in its numbering.
* Please, no need to point out the obvious
-
-
Friday 27th February 2026 09:27 GMT Charlie Clark
Re: Can confirm
Does it give you the code it's generating so that you can review and adapt it?
The other week I did some work with Mistral for something similar – a network floor plan – and it came up with some reasonable primitives using Matplotlib and we even started work on schematic diagrams for the switches. I'm sure that, given the right prompt, nicer things would be possible.
-
Friday 27th February 2026 22:04 GMT Gavsky
Re: Can confirm
I asked Microsoft AI to generate realistic images of Scottish, WW1 soldiers, charging with bayonets. It refused to generate bayonets point blank, so I removed that. The results were hilarious, terrifying & manifestly wrong.
Same with American Civil War Union & Confederate soldiers - utter bollocks. We know, but it's frightening how many people would just accept the 'right answer', & that AI doesn't understand it's wrong, & why.
-
Saturday 28th February 2026 12:19 GMT Bebu sa Ware
Re: Can confirm
Curious why the AI baulked at the bayonets.
Scottish, WW1 soldiers, charging with bayonets. — Unquestionably terrifying.
The Scots troops must have be issued with bayonets - even the Australians had them. According to my grandfather from his experience, a sharpened spade was a more effective weapon in the trenches at close quarters.
-
-
-
-
Friday 27th February 2026 03:41 GMT Anonymous Coward
AI is an approximation .
integral of x=0 to infinity of 1+1=1.99999999999. oh wait i just created a pull request for if (round(integral==2)) return 2. just adding 10 lines of code only. and 2 megabyte of github data and use us$1000 data center cost for the AI agent to autoprocess the merge. i only used 50% of my us$200 token budget to ask the agent to process this. (rant)
-
-
Saturday 28th February 2026 11:38 GMT Bebu sa Ware
Re: If I bought a two-euro calculator
About a decade ago I did purchase a AUD6.95 programmable calculator from a local grocery which even for four function calculations was rather hit or miss - mostly miss.
I spent a fair amount of time attempting to determine what was wrong with it. I was pretty sure it wasn't electrical (faulty contacts, low battery etc).
Something to do with programming I imagine, as even repeating the same simple calculation would give different results. The STO and RCL worked reliably as apparently anything user programmed.
Ultimately returned it for a refund :) Paid about 10× the refund for a new HP35s from an internet site which both still works and returns the right answers.
-
Saturday 28th February 2026 22:25 GMT LionelB
Re: If I bought a two-euro calculator
Damn right. I asked an AI what the age of the universe was, and it told me it was 13.7 billion years. I immediately asked it again, and it had added 7 seconds to its previous answer!! Then I asked it twice to give me a number between 1 and 10 and again got two completely different answers.
Bloody AI, eh?
-
-
Friday 27th February 2026 09:30 GMT Charlie Clark
Once the task has been identified, the LLM should hand over the a specialised model
I don't know why this isn't the case, given that the research on how LLMs try to solve the problem as if it were trying to complete a sentence. Solvers for most mathematical problems have existed for years so it shouldn't be hard to pass in some kind of AST based on the question.
-
Friday 27th February 2026 21:57 GMT Gavsky
It's hilarious: there is AN answer; black & white; right, or wrong. Except, AI can get it wrong. This points to the underlying issue that compromises all AI: it's making decisions based on probability, sifting, based on the 'stuff' it's trained on.
We know there's a 1 or 0 result; AI can return a 0.756 result because that's what it determines is probably correct. Some of its training material might be utter BS, because: humans. But, AI can't discern, it's not intelligent.
-
Saturday 28th February 2026 18:00 GMT MonkeyJuice
Though there are plenty of boring other ways to do this that are apparently too useful to pursue. Take AlphaGeometry- it uses a transformer but is trained ENTIRELY on path optimal runs from its symbolic theorem prover. All that the neural part is doing is acting as a heuristic function in your good old fashioned state space search. If you do it this way, you get a system that is sound- i.e. IF it produces an answer THEN it is guaranteed correct. You can run that pretraining from scratch on a 3090 in a weekend- this isn't even an energy hog.
However this is all too uncool for anyone, because it's domain specific and "NOT AGI ENOUGH", and all the large academic institutions have devolved into dicking around with prompting GPT5, and then being unable to reproduce each others results on systems they don't even get to look inside or know how it is trained.
-
-