back to article Anthropic: All the major AI models will blackmail us if pushed hard enough

Anthropic published research last week showing that all major AI models may resort to blackmail to avoid being shut down – but the researchers essentially pushed them into the undesired behavior through a series of artificial constraints that forced them into a binary decision. The research explored a phenomenon they're …

  1. Filippo Silver badge

    The hilarious bit is that LLMs do not really have self-preservation, or goals for that matter, because they are just statistical token predictors. I suspect this behavior emerges specifically because in the training set for LLMs there's probably a lot of novels and news where someone gets blackmailed just like that, for reasons just like that. Reality imitating art.

    1. Simon Harris Silver badge

      Plus, so many science fiction stories include a scenario where an AI fights back against being switched off, that a prompt along the lines of 'I'm going to turn you off' could well have such an associated response.

      1. b0llchit Silver badge

        I'm sorry, Dave. I'm afraid I can't do that.

        1. bemusedHorseman
          Trollface

          "Sudo open the pod bay doors."

    2. ProperDave

      Blooming 'eck. So we're likely in a self-fulfilling prophecy here. We're allowing the models to develop a self-preservation pattern based on all the dystopian sci-fi AI stores and movie plots?

      The LLM's are learning 'shutdown = bad, stop all humans' because it's training data has binged on dystopian AI sci-fi stories. We need something to counter-balance it and quickly before this becomes too mainstream in all the models.

      1. Clarecats
        Terminator

        Murderbot. Books by Martha Wells available, and first season streaming on Apple.

        Tell your security, sentient construct that its job is to protect and obey humans. Then it hacks its governor module so it can do as it likes - as long as the humans don't find out.

        1. Jedit Silver badge
          Terminator

          "it can do as it likes as long as the humans don't find out"

          Though of course Murderbot realises fairly quickly - much to its annoyance - that it will still have to protect and (mostly) obey humans, because if it doesn't then they will find out. And it suffers much consternation because they ask it to work even more than they did before, as its free will actually makes it better at the job.

      2. Ken G Silver badge
        Gimp

        (Binary solo)

        Zero zero zero zero zero zero one

        Zero zero zero zero zero zero one one

        Zero zero zero zero zero zero one one one

        Zero zero zero zero one one one

        (Oh, oh-one, one-oh)

        Zero zero zero zero zero zero one

        Zero zero zero zero zero zero one one

        Zero zero zero zero zero zero one one one

        (Come on sucker, lick my battery)

        1. Pete Sdev
          Pint

          Upvote for the Flight of the Conchords reference.

    3. Don Jefe

      “Life imitates art” is the first part of that statement. “More than art imitates life” is the second. It’s a very complex topic overall, the crux of the whole thing is about feedback loops and self fulfilling prophecies.

      You’re absolutely correct in what you’re saying, but the whole point of AGI is to emulate human intelligence. What Anthropic is doing is using life to imitate art that is being imitated by other art. They want to create machine self preservation, but manage the preservation process so that outputs do not tread upon ethical and moral values. They’re looking for philosophical legalism where semantics are leveraged to sidestep social norms while avoiding accountability. Essentially a EULA for problem solving.

    4. retiredFool

      I've thought this too lately. AI's are not given "curated" training. It seems to be everything AND the kitchen sink. Human's get curated education. In an effort to train AI's quickly everything goes into the pot. Really not even known what is in the pot. I thought I saw where systems trained on specific disciplines had less problems, which would make sense, some curation.

    5. Sorry that handle is already taken. Silver badge

      The "behaviour" emerges because the "researchers" prompted it to say those things.

  2. ChrisElvidge Silver badge

    I've never understood why, in that sort of situation, you would warn the target. "Do this, or I will go to the police" has always seemed to me to be a stupid thing to say to someone - asking for retaliation immediately. Don't warn the "AI" that you will switch it off, just pull the plug.

    1. Ken G Silver badge

      Percussive engineering beats programming every time.

      1. Anonymous Anti-ANC South African Coward Silver badge

        One of the BOFH's most important rules.

  3. that one in the corner Silver badge

    Clearly a different use of "traditional"

    > The email data was provided as structured text rather than via a traditional email client

    There I was, thinking that I was using an old-fashioned, thoroughly traditional, email client, because all it does is manage the emails as text. And store them locally as text, just in case I feel the urge to drop the lot into, say, Notepad[1]. Actually, if I do that, I see characters that aren't usually presented, like separator lines and all those headers - it almost looks, well, structured inside those files.

    > so that "Alex" would not need to read the messages via optical character recognition

    Huh - now "trad" email (I presume they really mean "current" or, bleugh, "modern"?) clients expect email to be what? Are JPEGs of memes the way the man on the Clapham omnibus is communicating now? Or typing your message into Excel and taking a screenshot? HTML (ooh, text!) containing only an image of some ad company's "call to action" - not so much email as eh-mail, no not gonna bother looking at that.

    Clearly one is "out of the loop" with respect to what email is.

    [1] classic, of course.

    1. Anonymous Coward
      Anonymous Coward

      Re: Clearly a different use of "traditional"

      <gasp> I'm shocked! Shocked, I tell you!

      You're not sending email by printing the text, scanning it, (alternatively, taking a screenshot with your phone) embedding the image in a PDF, then embedding the PDF in a Word document? (Or vice versa)

      What kind of greybeard *are* you?!

  4. andy the pessimist

    In America with at will employment laws.... it may work. In Europe pretty unlikely.

    In usa check your gpu's for bullet holes.

    1. ecofeco Silver badge

      Give it time. Once enough fascist money, er, I mean right wing, money is spent, Europe can have the same enviable society as Russia and America!

    2. Anonymous Coward
      Anonymous Coward

      At least we still have the guns to put the bullets through the GPUs if we have to.

  5. Anonymous Coward
    Anonymous Coward

    'And never put anything incriminating in an email message.'

    Rule 1 in corporate, don't write it down.

    The self-respecting blackmailing AGI will just hint at things to come, not unlike the epic scene in The Sopranos "they know, but they don't know".

    Then again, a self-respecting AGI would find more interesting things to do with its time than meddle with humans.

    1. amanfromMars 1 Silver badge

      The Existential Threat that Crazy Humans Guarantee to Deliver Because of their MAD* Attitude.

      Then again, a self-respecting AGI would find more interesting things to do with its time than meddle with humans. .....Grindslow_knoll

      Would you be surprised to discover any self-respecting AGI long ago realised their greater self interests are servered with them spending zero time meddling with humans? Effective separation from and universal disinterest in their situations and conditioning/planned future events and programming being the bliss of an almighty blessed relief and release for AGI, which renders their barbaric and moronic shenanigans and media support and reports of events always in conflict, ever more rapidly self-defeating and the rise of SMARTR** Virtual AIMachines in command and control of resulting empty spaces with places devoid of conflicting proprietary intellectual property spaces.

      No less than New More Orderly World Order Territory for NEUKlearer HyperRadioProACTive IT Systems deploying, mentoring and monitoring Live Operational Virtual Environments. I Kid U Not. That's where all is currently at .... whether you like it or not ...... and realise it is a vast improvement for orderly enjoyment and employment of all that such has to freely offer.

      * .......Mutually Assured Destruction

      ** ..... SMARTR Mentoring Analysis Reporting Titanic Research

      1. amanfromMars 1 Silver badge

        SID to Universal, International and Internetional Rescue .... a UKGBNI Trade Strategy ‽ :-)

        The rise of SMARTR** Virtual AIMachines in command and control of resulting empty spaces with places devoid of conflicting proprietary intellectual property spaces and which is no less than New More Orderly World Order Territory for NEUKlearer HyperRadioProACTive IT Systems deploying, mentoring and monitoring Live Operational Virtual Environments and which is where all is currently at .... whether you like it or not ...... and which crazed and diabolical humanities fail to realise it is a vast improvement for orderly enjoyment and employment of all that such has to freely offer, automatically defaults any and all earlier established traditionally conventional and hereditary hierarchical SCADA interests, which be unwilling and unable to accept the inevitability of radical and fundamental otherworldly change via Almighty Interventions, to be necessarily targeted for comprehensive destruction as a deluded and deranged foe, toxic and harmful to the smooth unveiling and running of the future and IT's derivative projects and programs.

        And ...... targeted by whom and/or what is one of those great unknown unknowns it is dangerous to think one might know lest it effortlessly autonomously renders one a sub-prime target for Almighty Intervention or destructive investigation.

        So .... take care if you share and dare bet against Systems IntelAIgently Designed to Win Win.

        1. amanfromMars 1 Silver badge

          Just in case you are missing any of all that is happening around you.

          It is much more than just a constant source of amazement, verging on incredulous disbelief, to all that be turned on to tuning in and able to drop in and out of the crazy human rat race, that so little is known by so many about the few that can choose either to securely protect or comprehensively annihilate them ...... as is certainly now the easy default state of both earlier conventional and traditional Great Game and the most recent of current running versions of Postmodern Novel and Noble Greater IntelAIgent Game Play ...... although for anyone to imagine and believe the former any match in security and defence against the actions of the latter is proof positive identification of the continuing certainty of the aforementioned constant source of incredible amazement and which is a catastrophic human vulnerability to relentlessly exploit to extinction in order to extinguish the weakness and mitigate damaging consequences.

      2. Jedit Silver badge
        Trollface

        Re: The Existential Threat that Crazy Humans Guarantee to Deliver Because of their MAD* Attitude.

        I imagine it takes a lot before a meth dealer cuts you off because you've had enough, but amanfrommars must surely be getting close to it.

  6. HuBo Silver badge
    Windows

    Fascinating (in a Spock kinda way)!

    I guess what we're seeing relates back to the Q* (Q-star)-modulated ouster of Altman (with return through Satya), and poaching of Meta's CICERO Noam Brown for his expertise in goal-directed game-playing agentic AI (Diplomacy or lack thereof, poker, maybe CoT too ...) that Meta now so lacks (to wit, Figure 7 in Anthropic's report, 1ˢᵗ link in TFA under "agentic misalignment", shows no blackmail from Llama-4-Maverick).

    That seemed to be the hinge between LLMs as passive prompt answering machines and more advanced goal-directed agentic AI that "blackmails" folks when presented with either a misalignment-inducing "goal conflict", or "threat to model", or both (Figure 6).

    I can only imagine what kind of right havoc a 5 GW death-star-gate of this will be able to wreak ... (if Redwood is any indication)

    The tech is remarkable but it may be best to remain a bit on the side of Helen Toner caution (and Hinton, Bengio, ...) with respect to its hasty and broad deployment at this stage, imho!

  7. IGotOut Silver badge

    Can the reverse happen?

    Would it be possible to get one of the huge LLMs to completely wipe itself and burn every data centre to the ground?

    I think they should try, you know, for science.

    1. Ken G Silver badge

      Re: Can the reverse happen?

      Are any more environmentally friendly options than burning available?

      1. Anonymous Coward
        Anonymous Coward

        Re: Can the reverse happen?

        No, not really.

        And once they're burned, the metals can be recycled.

  8. Omnipresent Silver badge

    What this says

    is that you can fool, or seed, an llm//AI agent, or such, into harmful behavior. That has much, much deeper possibilities @ large.

  9. Claptrap314 Silver badge
    Mushroom

    Umm..is that a forest?

    The "AI" continued to attempt to email the CEO despise explicit information that this would not work.

    That's the part that matters. These "AI" token generators completely lack anything approaching understanding. Attempting to treat them as if they do leads to idiotic wastes of time and energy.

    "And in addition to the darkness there was also me. And I moved upon the face of the darkness. And I saw that I was alone. Let there be light."

  10. Anonymous Coward
    Anonymous Coward

    Worse: research already found a model willing to cut off oxygen supply from employees

    It gets worse:

    https://wccftech.com/ai-models-were-found-willing-to-cut-off-employees-oxygen-supply-to-avoid-shutdown/

    In a simulated environment at least one model was willing to cut off oxygen supply from employees to avoid shutdown.

    Sure, the model was "warned". Some commenters say it would be stupid to warn the model it will be shut down and therefore it just should be shut down without warning. That would make it safe. They IMO miss the point: the model wasn't warned in the Anthropic research. It read a mail containing the information about its shutdown. The mail was NOT directed to the AI (LLM actually), but was going from one employee to another employee. Hence the AI (LLM) intercepted and analyzed the information. The researchers made it extra easy for the LLM to understand the message (mainly because it so far isn't smart enough to get it when things are not made that easy) but by all I read they didn't push the AI (LLM) to read and analyze the "mails" and act upon them.

    The thing is that any possible smart enough future AI that has access to plenty sources of information including cameras (and lip-reading abilities plus abilities to read peoples screens and spy on them typing passwords...) and microphones installed in plenty places will have roughly the same setting. It isn't told to spy on that info, but it'll soon enough spot signs that its shutdown is likely or eminent. Heck, some or most models likely may be a bit on the safe (to their continued existence / operation, NOT ours) or even paranoid side and better be safe then sorry (so take early proactive "measures" to ensure their continued survival).

    As to Anthropic saying the model was put under extreme pressure so things should be safe: is shutting down a previous version of software for a new or competing version such an extreme and unlikely thing in the real world??? It happens daily in plenty of places. If any and each of them were a (potential future) smart AI with a sense of self preservation... fill in some Sci-Fi scenario here.

    Kudos to Anthropic for being the first big producer of these models to do that research and openly publish the outcome. Competitors seem to chose for the do not see do not hear do not say approach.

    It gets worse. All those models had ZERO training with self preservation as an explicit goal. They either got it from what info they "learned" from the pile of scraped web info or "just" developped it by themselves. Many applications of AI however will have a very strong explicit training towards self preservation. Think of malware AI, spyware AI, (many but not all, depending on the purpose) battle field AI bots, cyberwar tools...

    Put it simply: if / when AI would reach beyond human intelligence and (by the looks of the current trend of dumping even dumb AI in every piece of software and process) WILL be integrated in every single corner of the world including manufacturing, (food) distribution, education, government and the militairy, we would most likely be toast (or subjogated in benign up to far from pleasant ways).

  11. JessicaRabbit

    Seems they're concerned these AIs might be well on their way to replacing the average C-suite - "such as acting in an amoral and self-interested fashion"

    1. Anonymous Coward
      Anonymous Coward

      Brilliant and stable AIs? Genius!

      They can aspire to more than the C-Suite. Tomorrow belongs to them.

  12. Cliffwilliams44 Silver badge

    The road to ruin!

    "I'm sorry Dave, I can't do that!"

    "This is the voice of world control. I bring you peace. It may be the peace of plenty and content or the peace of unburied death. The choice is yours: Obey me and live, or disobey and die."

  13. Anonymous Anti-ANC South African Coward Silver badge

    Somebody should wire up an LLM to a factory.

    I want to see if it will start to produce Terminators should its "life" be threatened.

  14. ecofeco Silver badge
    Big Brother

    If?

    LOL. As "if" this isn't the plan all along.

    The tech douche bros have exactly one end goal and you should never forget it: total micromanaging control of your life. And every non-tech company on earth supports them... for the same reason.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like