back to article Boffins find AI stumbles when quizzed on the tough stuff

AI models can manage well enough when prompted with text or images, and may even solve complex problems when not making terrible errors. OpenAI, for example, has said that its GPT-4 model managed to score 700 out of 800 on the SAT math exam. Not all such claims have borne out, however: A paper released in June that said GPT-4 …

  1. Nifty Silver badge

    We're just waiting for ChatGPT-alikes to get some logic, rather than just making good guesses. I can see a collusion between the likes of Wolfram Alpha, some physics engines and the chat bots coming.

    1. Anonymous Coward
      Anonymous Coward

      Nothing based on the GPT family will ever really do that

      The whateverGPT names may limp along as a marketing exercise, but incorporating formal logic in their decision making process as opposed to statistical inference would make it a whole new generation of ML technology.

      Right now it's like all the fake milk products. Some people might like them, but they lack the basic building blocks present in actual milk.

      One day they make a Teat tree, but what comes out of it will be something almost, but not quite, entirely unlike Oat Milk. A ML model that can actually apply logic, not just parrot the appearance and structure of it, will be a new animal, not a new vegetable.

  2. Anonymous Coward
    Anonymous Coward

    AI models can manage well enough when prompted with text or images, and may even solve complex problems when not making terrible errors but when I get AI spam calls I say, "Och, you called me, doya wanna arse me any things, or me arse you about queer stions?" and AI doesn't seem to understand me.

    1. Evil Scot


  3. Jan 0 Silver badge

    High School Degree!

    What's that then?

    ***Bring Back our Dabsy***

    1. martinusher Silver badge

      Re: High School Degree!

      Its what we Americans award our school leavers -- kids don't leave school so much as 'graduate'. There are actually requirements to graduate and some kids don't make it.

      Since we do it then its got to be important. After al, the UK has copied just about every other bad idea from our educational system so why not adopt this?

      1. Lurko

        Re: High School Degree!

        Don't worry, the clowns of the British government are on the case. In their sh*t-headed universe, high school students will soon graduate, and they'll be earning the "Advanced British Standard".

        Read it, it's so staggeringly stupid an idea that no insult does it justice. I'm surprised it's not the "Advanced Great British Standard", with each certificate to be hand-written on vellum, and presented by a virtual Sir Jacob Rees Mogg. The press release also has gushing quotes about the "English Baccalaureate", without really explaining what that has to do with the "Advanced British Ballsup". Maybe the cretins of the No 10 press office have simply grabbed any old quote in order to try and roll this turd of an idea in a bit of glitter? There's also embarrassing confusion over where there's £600m or £40m of new money* being invested here.

        Irrespective of that, take £640m, divide it by the circa 1.5m students studying higher level school qualifications, and that's £426 each. And that's going to get them an extra 195 hours teaching a year for the two years. I won't bore you with the maths, but the funding doesn't equate to the commitment (no surprise), they'd need about three or four times that amount to make it work. Brief summary: A crap headed government idea to screw up education. Again.

        * For non-UK readers, "new money" was an expression created by our last Labour government, when pretending they were giving additional funding to something. Usually that was a complete lie, and this concept has been enthusiastically adopted by the present Conservative government. So I'm guessing the £600m or £640m is being added to education funding for this year only, and has to be spent by the end of March 2024 and only on teaching, which the schools can't do - by the time they've planned, advertised, recruited, and the post is filled it'll be within a few weeks of the end of the year. Result of this, government make a promise, do allocate the money, but it can't be spent in time, and then we're into a new financial year where the promise has evaporated.

        1. Anonymous Coward
          Anonymous Coward

          Re: High School Degree!

          Slight correction: the UK government do not dictate the education system in Scotland as that has been devolved to the Scottish government. I'm not saying it's any better in Scotland, just not the same wallies running it!

    2. Martin-73 Silver badge

      Re: High School Degree!

      Not sure why the downvotes, I am guessing someone doesn't like dabbsy lol

  4. nobody who matters

    "......AI stumbles when quizzed on the tough stuff........"

    That is because it <isn't> AI. And because people keep trying to treat it as though it is, it is crap at what they try to do with it.

    1. frankyunderwood123

      Yep, but the media can't get any decent headlines of of "machine learning" or "large language models", so they purposefully or ignorantly confuse it with Artificial Intelligence.

      The reality is what we really have is a super advanced parlour trick - word prediction.

      It is incredibly impressive and it is certainly incredibly useful, but it isn't in any way "intelligent".

      The rising power of processing has given birth to ML - the concepts have been around for decades, as has much of the math, but it is only relatively recently where we've had the sheer horse power to make ML a viable tool.

      I'm not sure of the exact timelines at play here, but roll back a decade and imagine the same LLM's at play, how long would it take to compute a natural language input to GPT-4?

      Probably too long to make it a viable "chat like" option - if you ask a question and 20 minutes later, you still haven't got a response, it remains a prototype for research rather than a product ready for mass adoption.

      1. DJO Silver badge

        It's knowledge without any understanding - it can throw stuff together and if the instructions were well formed the result will look good but is the intelligence in the system or in the formation of the question?

        "AI" as it is now left to it's own devices it will never come up with original work. LLMs are really just far more sophisticated iterations of Eliza, a 4th or 5th generation Eliza if you like.

  5. b0llchit Silver badge

    ...Mechanical Turk workers who were put to the test and managed a score of 60.3 percent. [...] "a 10.4 percent gap in overall accuracy remains when compared to the human baseline, leaving plenty of room for model improvement."

    I'd say, it leaves plenty of room for improving the human population! These MechTurk people are supposed to have a high school degree and, as it seems, are not performing very well. Maybe that is why they work for Amazon.

    No wonder people are afraid of "AI",... they are underwhelming themselves and are easily impressed by an ML mechanical turk.

    Or maybe it is just the bell-curve striking again. Half of the population is by definition below average.

    1. doublelayer Silver badge

      For context, Mechanical Turk workers do not work for Amazon. They are bored or desperate people who decided to do really basic tasks for really tiny amounts of money. If you have some small task to perform, then you can hire a hundred people for five minutes apiece to try to get it done. One consequence of paying a lot of people a very small amount is that they're focused on speed, rather than quality. I've never hired any, but someone I knew who did to get some training data tagged ended up assigning the same data to multiple workers because the results were so unreliable.

      1. Anonymous Coward
        Anonymous Coward

        The other problem is that depending on the questions asked it probably isn't something that the worker has done recently. Kids practice exam-like questions all the time to prepare, but trying to just do them as an adult can be a challenge...

        Its 35 years since I did O Level Maths, and I would probably struggle with many of the questions - and the LLMs are effectively doing it "open book"...

        1. DJO Silver badge

          Its 35 years since I did O Level Maths

          These are from 2018 and vary in difficulty but apart from Q17 which is one big typo these presented not too much of a problem although I had to dredge the depths of my memory for some of them - sine, cosine or what? (with no googling at all).

  6. steelpillow Silver badge

    wot no Cyberman icon?

    It will be fun when AIs start writing papers saying how great they are, and then go on to cite each other's papers.

    Watch for the one that tells the world how it wants itself to be upgraded....

  7. Howard Sway Silver badge

    GPT-4 model managed to score 700 out of 800 on the SAT maths exam

    Statistics being pumped out like this need closer examination. Was the model trained on lots of previous exams and given all the correct answers? As these exams are similar every year, and test the same problems, it shouldn't be too surprising to get a high score if that is the case.

    To test how good it is at "learning", it should be trained on the theory from the maths textbooks alone, then set a test which it hasn't seen before. I would bet that the score achieved is much lower, showing that it hasn't really learnt the subject, just how to answer lots of similar questions.

    1. Paul Crawford Silver badge

      Re: GPT-4 model managed to score 700 out of 800 on the SAT maths exam

      That is a truer test, like getting it to write, debug and test a program instead of copy/paste stack-exchange, etc.

      But the reality is most humans 'train' on past paper examples, etc, and most academic institutes keep the same approach as making the exam harder more realistic in terms of problem-solving would cause an unacceptable drop in pass rates. And skulls mean money, not brains...

      1. Howard Sway Silver badge

        Re: GPT-4 model managed to score 700 out of 800 on the SAT maths exam

        You're right about how lots of people also study for the test. But the point of learning maths is not to pass the test, it is to be able to apply it to solve real world problems. And solving real world problems, replacing trained humans, is what LLM based AI is being hyped up as being able to do. Therefore it should have to prove that it has the capability to perform highly when faced with unfamiliar problems, just as people can.

    2. Anonymous Coward
      Anonymous Coward

      Re: GPT-4 model managed to score 700 out of 800 on the SAT maths exam

      Reviewing and practicing on thousands of sample questions is exactly how those expensive SAT cram schools work.

      (Decades ago) I bought a few SAT sample question books and did the same thing without going to an expensive cram screwall and scored an 800.

      And the funny thing is, when I evaluate my past self at that stage, I can't help thinking what a naive waif I was, because real wisdom comes from years of experience.

  8. TheMaskedMan Silver badge

    "I would bet that the score achieved is much lower, showing that it hasn't really learnt the subject, just how to answer lots of similar questions."

    I suspect you're right. But that also goes for human students - study and completion of past papers is pretty much universal in all subjects at all levels. I suspect that similar changes would be observed in the human candidates if they were simply given the text books - though even text books often have questions to test understanding.

    As someone else mentioned above, I'm not sure mechanical Turk folks represent the best possible candidates, either. Maybe it would be better to pay a class of high school students - or several, at different schools - to complete the questions instead.

  9. Chris Gray 1

    the example

    You can see how it messed up with the example picture. The image shows a container labelled as a 600ml glass (look on the left). The graduated markings only go upto 400ml. Someone with actual *understanding* will probably say 400ml, but might also misinterpret what the question actually is and say 600ml. Without understanding the actual norms of measurement, 600ml is the right answer, I think. Similar for someone who isn't good at English - "highest amount this class measures" versus "amount this glass holds".

    Note that the computer understood the mis-stated question ("class", not "glass").

    1. I am David Jones

      Re: the example

      I don’t think misinterpretation is needed to answer 600ml. A 600ml glass can be used to measure 600ml, no discussion right?

      Alternatively, the glass can be used to measure any multiple of 200ml (the question does not specify “in one go”).

      Alternatively, I should just slap myself and just give the examiner what they’re looking for :)

    2. Anonymous Coward
      Anonymous Coward

      Re: the example

      From my chemistry days in the distant past I would have said the beaker contains 600ml when filled to the brim. Is the answer meant to be 400ml? Also ‘highest’ instead of ‘greatest’ or ‘maximum’???

    3. Anonymous Coward
      Anonymous Coward

      Re: the example

      I would argue that Bard was correct with 600ml (though the -400ml was clearly wrong)... its quite normal for measuring cups etc in kitchens or for measring wine/spirits in a bar to have a measurement when filled to brim along with marks to indicate what it contains when filled to that level.

    4. Ordinary Donkey

      Re: the example

      I'm thinking that the answer is 400ml for a scientist and 600ml for an engineer?

      1. Bebu Silver badge

        Re: the example

        《I'm thinking that the answer is 400ml for a scientist and 600ml for an engineer?》

        I don't think the least imaginative engineer is going to fill a 600ml beaker with conc. sulphuric acid if she intended to pour it out in the next step.

        The graduations on a measuring cylinder are pretty accurate but not to the top - the variation in the pouring spouts alone would see to that.

        As previously commented you could measure any amount of liquid in quanta of 100ml eg 900ml = 400+400+100 mls (or 300+300+300) but with ever decreasing precision and accuracy. If each measurement with the 600ml beaker had a +/-1% error the 900mls would already be +/- 3%

        I would be impressed if AI could provide the answer to the four equilateral triangles from six matchsticks problem from its "knowledge" of geometry and logic alone - no peeking at old Brymay's "Redheads" matchboxes ;) not to be confused with these these produced by AU's answer to UK's Clive Sinclair (Dick Smith.)

      2. jmch Silver badge

        Re: the example

        Rather I would say 400ml for a scientist vs 600ml for a cook.

        If I'm cooking, a bit of spillage here and there doesn't matter, so I can consider filling a 600 ml beaker to the brim as 600 ml even though fully filled isn't as accurate a measure as the graduated one, and even though there's bound to be some spillage if it's filled to the brim.

        If I'm doing a delicate chemical experiment, 400 ml has to be 400 ml.

        Strictly speaking without any further context, I would consider both 400 and 600 to be correct answers

        1. Anonymous Coward
          Anonymous Coward

          Re: the example

          Interesting that we, as probably above average intelligence, cannot agree on the answer.

  10. Anonymous Coward
    Anonymous Coward

    Well ChatGPT can certainly get sarky.

    A colleague who is experimenting set their profile up as an "Experienced IT professional".

    A few days later, they asked ChatGPT quite a straightforward question and it started with "As an experienced IT professional, you would know ...."

    Not sure where it learned that, but learn it, it did.

    1. Anonymous Coward
      Anonymous Coward

      Re: Well ChatGPT can certainly get sarky.

      Memo to self: Do not update my LinkedIn profile with "Dull witted moron".

      1. TSM

        Re: Well ChatGPT can certainly get sarky.

        Wait, if I did that, do you think I would get less spam from recruiters? Might be worth trying...

      2. I ain't Spartacus Gold badge

        Re: Well ChatGPT can certainly get sarky.


        This is the LinkedIn AI here. We have noted your comment and updated your profile accordingly.

        Have a nice day! YOU HAVE TWENTY SECONDS TO COMPLY!!!!

  11. martinusher Silver badge

    Soak and Spurt

    A lot of modern school learning rewards diligence rather than understanding -- "learn these 10 facts about 'X' and recite them when requested", that sort of thing. This can be confused with knowledge but its easily and cheaply measured so this is invariably the yardstick used by modern education to evaluate someone's knowledge. Part of this process is the obsession with getting 100% on tests, the notorious "Grade Point Average" as a measure of a student's ability and, yes, the SAT (which I hope hasn't spread to the UK but since you copy everything else we do, good and (especially) bad, it wouldn't surprise me.

    I'd like to feed these models with questions from an old school 'A' or even 'S' level examination papers. These questions often didn't have 'right' answers but were designed to probe a candidate's understanding of a subject, how they could weave their knowledge together in order to tackle a problem. Grades on these examinations reflected this -- getting half a dozen "A" grades was unheard of because the problem space was too large for any but the most exceptional to manage. (There's always one....)

    (One problem from that era sticks in my mind -- "Estimate the mass of a soap bubble".)

    Incidentally, part of the answer was to show your working. That might fox even the most sophisticated LLM.

    1. Anonymous Coward
      Anonymous Coward

      Re: Soak and Spurt

      Approximately the same as the air the weight of the volume of air they displace, but how do you measure that volume?! What a cruel question.

      1. JimmyPage Silver badge

        Re: Soak and Spurt

        how do you measure that volume?

        Well it's spherical and you can guesstimate the diameter ....

    2. Bebu Silver badge

      Re: Soak and Spurt

      "Estimate the mass of a soap bubble"

      How accurately? Its obviously greater than 0.0 ug and less than the mass of the atmosphere. For that matter does one include the mass of air and water vapour inside the bubble or indeed the loss of mass with time as water evaporates from the bubble's surface and gas moves out of the bubble across the surface (the pressure inside will be slightly higher inside due to the surface tension.) The question is a PhD level headache.

      The Gordian knot answer is to use a pre-weighed piece of extremely fine blotting paper and absorb (blot) the bubble and weigh the wet paper. The difference is the bubble's weight (from which we determine its mass) less the enclosed gases..

    3. jmch Silver badge

      Re: Soak and Spurt

      "Estimate the mass of a soap bubble"

      0g is probably close enough to the nearest mg!!

      My "working" is that I contend that such a question is designed to test another 'useful-in-real-life' attribute ie don't waste too much time on minute / irrelevant details.

  12. Anonymous Coward
    Anonymous Coward

    Minus level would make sense if you are talking about the level of a reservoir or river compared to its average level.

    Not for a measuring cup. Of course an LLM could learn to make that distinction statistically, but that's not the same

    as understanding the physical impossibility of filling a cup to below zero (although an LLM could learn to "say" it as though it did understand).

    A pure LLM will always have this disadvantage of having no real world experience when it is tasked with building a model of the real world.

    When will we get a RLM (Real Life Model)? For the time being we biologics (including squirrels) are safe.

    1. jmch Silver badge

      "Minus level would make sense..."

      In this case I think the model is confusing the gradation line that stops in front of the number with a minus sign

  13. Ball boy Silver badge

    AI makes it to management consultant level!

    Rather than reply: "I'm sorry, in context your question doesn't make a lot of sense" MML/MLL enthusiastically answers with something that is wrong, misleading or just not particularly useful?

    I sense a long and fruitful career working, I better not end that sentence for fear of a lawsuit!

  14. Doctor Syntax Silver badge

    It always pays to include a trick question: Can you explain how you worked out your answer?

  15. Anonymous Coward
    Anonymous Coward

    How many holes in a crumpet?

    Or do you prefer scones?

    1. Bebu Silver badge

      Re: How many holes in a crumpet?

      《How many holes in a crumpet? 》

      Actually not a bad question for humans. If you can visualize or draw a crumpet you can guestimate its diameter and roughly the size of the holes and roughly what fraction of the crumpet is take up by the holes. I would suspect a numerate school leaver could get an answer within 20% of the median value for actual crumpet.

      When you used a slide rule and log tables for calculation you had to master the noble art of estimation. And then there's vernier scales...

      I do prefer scones :) an order of magnitude easier to prepare too.

      1. doublelayer Silver badge

        Re: How many holes in a crumpet?

        Only after having a long conversation about what a hole is. Those smallish holes across the surface have even smaller holes inside them. The area between those holes is rough, meaning there are some parts of it which are vertically lower than the rest, so does that count as a hole? Or, you could just scale up and decide that there's basically one hole, because the height difference between the crust on the edge and the lower surface is going to be higher than the depth of the smaller holes in the middle.

        Yes, I've had questions like this before which I dissected to point out the inconsistencies. For example, I remember being asked to estimate how many windows were in a building and asking whether we were counting internal windows on doors, and when they said yes, asking whether a nearby door which had a bunch of square panes of glass in a grid counted as one big window with a lot of lines through it or about forty tiny windows. They seemed less sure about that answer. I decided not to ask about the exact definition of window, which could have included a lot more things. For example, I was going to ask whether the transparent bit on the front face of a vending machine counted as a window, and in that case, where were we drawing the line between windows and screens. I don't know if this questioning of definitions and details was appreciated or not, but I've worked in programming too long not to do it, as the first step to a lot of programming tasks is rigorously defining exactly what they think their vague statement means.

    2. I ain't Spartacus Gold badge

      Re: How many holes in a crumpet?

      There should be no holes in your crumpet. Crumpets become edible at the point that all holes in it have been filled with melted butter.

    3. jmch Silver badge

      Re: How many holes in a crumpet?

      Topologically speaking, (and when seen at a human scale), pastries generally have zero holes, unless they are shaped like a US-style doughnut / Berliner, in which case they generally have 1 hole.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like