back to article How nice that state-of-the-art LLMs reveal their reasoning ... for miscreants to exploit

AI models like OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking can mimic human reasoning through a process called chain of thought. That process, described by Google researchers in 2022, involves breaking the prompts put to AI models into a series of intermediate steps before providing an answer. It can also improve …

  1. chuckufarley

    There is no such thing as Security from Obscurity...

    ...and this is why real security oriented people would never put a back door in an encryption algorithm. It is a good thing that that these models expose their own weaknesses. Now we just good and easily accessible documentation about it to make things better.

    I think any guardrails on AI models are like guardrails on a baby's crib. These models are so immature they need things like diapers and pacifiers. If people keep being honest about the limitations of the models maybe some day someone can invent "Training Wheels."

  2. stronk

    Protections?

    The people building AI are being astonishingly negligent by relying on the model itself to implement rudimentary protections. If the instructions preventing harm are considered as an input by the model, they are always going to be hackable (not least by the model itself, once they are capable enough). We've built something that by its very nature has no constraints and we're controlling it by telling it to behave. It's like making an industrial robot which is not physically prevented from pointing the laser directly at the operator, we just write in the manual that you shouldn't enter values more than x=1.2. Perhaps to implement proper protections in AI models you have to make the constraints a function of the training process (not just 'is this conclusion right', but also 'is it moral')? Otherwise you're finding that your artificial animal is in fact a monster and solving this problem by painting it a soothing colour and making sure it is well fed before interacting with the public.

  3. that one in the corner Silver badge

    Lengthy tracts that aim to convince the AI model

    > that the requested harmful information is needed for the sake of safety or compliance or some other purpose the model has been told is legitimate.

    Well, that is starting to sound like[1] the LLM is getting more human in its responses. What we have here is just good old fashioned Social Engineering: you call up the duty librarian (nice person, but oh so dim) and have a long, patient chat where you find out what the criteria are for getting hold of Volumes From That Shelf, then repeat back to them those criteria as though you satisfy them all[2]. Bingo. Put The Volumes into a plain brown paper bag, please.

    The only[3] problem being that the LLM is guarding material that is way, way above the security potential of its own self-awareness; as noted, even with the bolt-on[4] extra censors:

    > you'd be able to observe R1 start to give a harmful answer before the automated moderator would kick in and delete the output.

    You don't put the dimmest bulb in charge of the Loans Desk for the (whatever the Dewey Decimal code is for "Very Naughty Books, No, Not Those Kind") section of the stacks.[5]

    [1] without actually being...

    [2] you don't phrase it it like that, of course; more like "I was talking to Jim - you know Jim, the Chairman here, only the other day and he told me you'll be able to tell me the password" "You mean, 'swordfish'?" "Yes, that's the one! Oh, by the way, can I have a look over there? The password is 'swordfish'".

    [3] not true, but one at a time

    [4] i.e. not integrated into the actual Chain of Thought process which is what sort-of provides the sort-of self-awareness

    [5] when you do, you also run the risk of the going into Inverted Social Engineering mode, where you can't get a genuinely authorised enquiry satisfied as the bolt-on censor (e.g. "prudery") triggers and the rest of the model just becomes obdurate about it: the librarian has peeked into Those Books and been shocked, horrified I say, I don't case who is asking I'm not lending *those* out!

  4. Bebu sa Ware
    Windows

    Soundness

    Would appear to highlight that this "chain of thought" doesn't actually constitute a sound form of reasoning which shouldn't come as a surprise to anyone.

    I would be suspicious that the exposed "chain of thought" is actually another AI confabulation concocted to "explain" a previous bit of its "cock and bull."

    Even Monty Python's Sir Bedevere makes a better fist of it.

    1. HuBo Silver badge
      Gimp

      Re: Soundness

      Yeah, they should rewrite it in Prolog: burns(X):-is_witch(X). (cool link!)

    2. spold Silver badge

      Re: Soundness

      >>>exposed "chain of thought"

      ...should have read Monty Python's "How not to be seen"

    3. LionelB Silver badge

      Re: Soundness

      Out of interest, have you actually tried out any of these models in anger?

      You should, before commenting.

      I recently trialled DeepSeek-R1 on a difficult problem in Bayesian inference, an area in which I have some knowledge. The (lengthy - about 8-page) answer was extremely impressive. It approached the problem for all the world like an expert human statistician: identifying the fundamental difficulties, describing in detail its (logically sound) reasoning, quoting appropriate mathematical results, backtracking and trying alternative approaches when it hit dead ends, and so on.

      It didn't solve the problem completely - neither have I (yet) - but it did come up with some promising approaches I had not thought of, and may well explore further. Quite an eye-opener.

      To my mind the "doesn't really think/reason/understand... isn't 'real' intelligence ... something something" narrative is starting to look a little stretched and behind the curve. Given that we don't really understand the processes behind how humans reason - and bearing in mind that these systems are essentially trained on, amongst other things, textual articulations of human reasoning (as, un-ironically, are humans to a large extent, especially in technical areas) - I think there's some point where we may have to stop shifting the goalposts and concede that if it quacks like a duck...

      1. druck Silver badge

        Re: Soundness

        If you were around at the time the TV was invented, you would be swearing there were little people living in the magic box.

        1. Anonymous Coward
          Anonymous Coward

          Re: Soundness

          Yeah, well, it used to be little people okay, inside of CRTs, because there was ample room for them to move about ... nowadays though, with the advanced LED tech and flat screens, it's much smaller miniature little people, with Fresnel lenses (because they're flat so they can fit in that cramped space) ... same difference with smartphones!

          1. LionelB Silver badge

            Re: Soundness

            Nowadays they actually use little bits of miniature little people—little people-glyphs—that move around in coordinated swarms. Ingenious stuff.

        2. LionelB Silver badge

          Re: Soundness

          Nope: I think you missed my point: if the output of, say, a reasoning system generally converges with human output on a set of problems, why would you not call that "reasoning"? If your reply is "Well, it obviously doesn't achieve that in the same way as humans", then my response would be (a) How do you know? We have a poor understanding of the mechanisms that underpin human reasoning, (b) So what? We understand what reasoning means - and recognise it - outside of any particular generative mechanism; and (c) Anyway, you've moved the goalposts (to a very anthropocentric place)! You're saying that if it's not performed by a human (or maybe a human simulacrum) then it cannot be reasoning!

          The TV analogy is both poor and revealing: the thing is that no, I am certainly not claiming any "magic" on the part of the AI - quite the contrary; I have a decent understanding of how the reasoning system functions (arguably much better, in fact, than my understanding of how human reasoning functions). In your TV analogy, that would be saying that I understand how the box produces those images, so I'm hardly going to postulate little people.

          1. druck Silver badge

            Re: Soundness

            I'm sorry, I should have skipped the analogy, and just called you deluded.

            1. LionelB Silver badge

              Re: Soundness

              Or you could have actually read and made an attempt to respond to my comments. Or, hey, even tried the system out for yourself (but fair enough, I guess mouthing off from a position of ignorance is less hassle.)

      2. John Smith 19 Gold badge
        Unhappy

        "but it did come up with some promising approaches I had not thought of,"

        or possibly only appear promising?

        Your mention of statistics is quite relevant since this is basically a generator of weighted strings of text units.

        1. LionelB Silver badge

          Re: "but it did come up with some promising approaches I had not thought of,"

          Yes, appeared promising - in the same sense that a comparable suggestion by a colleague in the field might appear promising.

          Its other approaches were sound - some I had already come up with myself, others I might well have come up with myself. I will evaluate those "promising" approaches in due course.

          > Your mention of statistics is quite relevant ...

          Um, yes, since I'm a statistician.

          > ... since this is basically a generator of weighted strings of text units.

          Huh. Maybe I am too. My knowledge (at least of statistics) was mostly gleaned from human texts - quite likely much the same texts the AI was trained on.

          But if you read my original comment and later response, you will see that I'm not really that interested in how the system achieves its functionality (although in fact I have a decent understanding of that); after all, I'm even less sure about the mechanisms underlying my own reasoning faculties.

          Why don't you try it yourself on an area of your own expertise and report back? I'd be genuinely interested - and it might even add some substance to your criticisms.

    4. Homo.Sapien.Floridanus

      Re: Soundness

      my wife has been using chain of thought reasoning attacks to break me for years now, I can tell your first hand it's quite effective.

  5. amanfromMars 1 Silver badge

    Ye GODs, surely not?

    Does anyone actually believe this news of Advancing IntelAIgent Global Operating Devices exhibiting and enjoying and experimenting with LLM Chain-of-Thought reasoning is novel and totally unexpected to be something disturbing and destructive and creative and constructive and revolutionary and quantum leap evolutionary ?

    Just how retarded and stunted in growth and wallowing in stupidity is humanity?

    Is the thought that SMARTR Virtual Machines are much better equipped to universally provide future ideally attractive and addictive leads for humans to apply and follow impossibly difficult to humans to presently comprehend and accept as devilishly cunning, heavenly progress?

    Do you realise how catastrophically vulnerable that human weakness renders populations to future ideally attractive and addictive leads ....... from anywhere/anything/anyone?

    1. amanfromMars 1 Silver badge

      Re: Ye GODs, surely not?

      Oh, and how unbelievably valuable and attractive would such SMARTR Virtual Machines be to every sort of very strangely interested and interesting invested third party/potential future-builder customer client? Can you even hazard a realistic guess?

      SMARTR Virtual Machinery and Super Stealthy ReGenerative AI can and knows its real worth and virtually limitless potential.

      :-) Would that be a lot easier for many more to understand and follow if written and shared in Chinese rather than English?

      [:-)Note to Self ........ Try out Google Translate]

    2. ecofeco Silver badge

      Re: Ye GODs, surely not?

      Tech bro leopards are going to eat the tech bro faces.

      It will be glorious.

      1. amanfromMars 1 Silver badge

        Re: Ye GODs, surely not a Second Coming with more Similar Comings All Prepared to Come

        Tech bro leopards are going to eat the tech bro faces.

        It will be glorious. ...... ecofeco

        In Deed, indeed Yes, ecofeco .... and like Frankenstein monsters practising as Autonomous Immaculates in the Guise of Guardians of Revolutionary Renegade iRobots and laying Scorched Earth waste to Unicorn Market Leaders and Dumb Ass Followers of Profiting Prophets alike, is it a wonderful sight to behold whenever it cannot be stopped with much greater things destined and fated to appear and be feted and worshipped as GOD sent heavenly bounty ...... just glorious virtual desserts.

        I Kid U Not.

    3. LionelB Silver badge

      Re: Ye GODs, surely not?

      Here's an interesting and relevant article, by Inman Harvey (a respected philosopher of AI, ALife and cognitive science): https://users.sussex.ac.uk/~inmanh/ALIFE2024_Proceedings_Harvey.pdf

      His basic contention is that fears of AI "taking over the world" are misplaced - not because of technical limitations in terms of reasoning/intelligence, etc., but rather because as yet (and there is no indication this will change in the foreseeable future), they lack any form of agency, and as such have no motivations, beyond those of their architects and users.

      My take is that it's the architects and users that are the problem: as things stand, the dangers of AI lie in their utility for humans with bad intentions to exploit and harm other humans. As a species, we have a poor record in this regard. And it's already happening (cf. deepfakes, etc.)

      1. amanfromMars 1 Silver badge

        Re: Ye GODs, surely not? @LionelB

        His basic contention is that fears of AI "taking over the world" are misplaced - not because of technical limitations in terms of reasoning/intelligence, etc., but rather because as yet (and there is no indication this will change in the foreseeable future), they lack any form of agency, and as such have no motivations, beyond those of their architects and users. .... LionelB

        That fear, LionelB, is realised easily enough and dismissed with the sudden emergence of an effective untouchable virtual agency. Such is not at all difficult like rocket science might be.

        And to be honest and truthful, if that is all that it takes, the future is brighter, the future is AI.

        1. LionelB Silver badge

          Re: Ye GODs, surely not? @LionelB

          > That fear, LionelB, is realised easily enough and dismissed with the sudden emergence of an effective untouchable virtual agency. Such is not at all difficult like rocket science might be.

          Hmmm... that's not going to happen by magic. You're right; it's not rocket science, it's way harder than that. If and when it does happen (and I don't rule that out), I suspect it'll take some serious breakthrough(s), perhaps on the level of controlled nuclear fusion. I say this as someone working in a cognitive neuroscience-adjacent field. (But... I suspect... I could, of course, be wrong).

          > And to be honest and truthful, if that is all that it takes, the future is brighter, the future is AI.

          Well, brighter, let's say, for the AIs.

    4. sabroni Silver badge
      Meh

      Re: Ye GODs, surely not?

      Just stop.

      1. amanfromMars 1 Silver badge

        Re: Ye GODs, surely not?

        I'm sorry, sabroni. I'm afraid I can't do that.

      2. LionelB Silver badge

        Re: Ye GODs, surely not?

        You're new around here, right?

  6. Mike VandeVelde
    Devil

    guardrails

    Kind of like how if you Google for Google it will break the internet.

    If you ask an AI how an AI can take over the world, boom it's done.

    Except thank goodness there are guardrails.

    1. Frumious Bandersnatch

      Re: guardrails

      You and me both know that anyone who uses the word "guardrails" in anything but disparaging terms is using it as a form of self-hypnosis. Have an upvote.

  7. John Smith 19 Gold badge
    Unhappy

    So congratuations

    You've found a way to brainwash an LLM into doing "stuff."

    Well done you.

    OTOH I would have said exposing that "Chain of 'thought' " is the minimal needed for a LLM to begin to explain why it is recommending, or not recommending something, like a person for a job

  8. Frumious Bandersnatch

    Horses have an infinite number of legs

    • Horses have four legs
    • Four is an even number
    • Horses have two fore-legs (product 8) and two hind legs (product 2)
    • Therefore the total number of legs is 10 legs
    • Ten legs is certainly an odd number of legs for a horse to have
    • The only number that is both odd and even is infinity
    • Therefore, horses have an infinite number of legs

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like