The hilarious bit is that LLMs do not really have self-preservation, or goals for that matter, because they are just statistical token predictors. I suspect this behavior emerges specifically because in the training set for LLMs there's probably a lot of novels and news where someone gets blackmailed just like that, for reasons just like that. Reality imitating art.
Anthropic: All the major AI models will blackmail us if pushed hard enough
Anthropic published research last week showing that all major AI models may resort to blackmail to avoid being shut down – but the researchers essentially pushed them into the undesired behavior through a series of artificial constraints that forced them into a binary decision. The research explored a phenomenon they're …
COMMENTS
-
-
-
-
Wednesday 25th June 2025 13:51 GMT ProperDave
Blooming 'eck. So we're likely in a self-fulfilling prophecy here. We're allowing the models to develop a self-preservation pattern based on all the dystopian sci-fi AI stores and movie plots?
The LLM's are learning 'shutdown = bad, stop all humans' because it's training data has binged on dystopian AI sci-fi stories. We need something to counter-balance it and quickly before this becomes too mainstream in all the models.
-
-
Thursday 26th June 2025 11:44 GMT Jedit
"it can do as it likes as long as the humans don't find out"
Though of course Murderbot realises fairly quickly - much to its annoyance - that it will still have to protect and (mostly) obey humans, because if it doesn't then they will find out. And it suffers much consternation because they ask it to work even more than they did before, as its free will actually makes it better at the job.
-
-
Thursday 26th June 2025 12:40 GMT Ken G
(Binary solo)
Zero zero zero zero zero zero one
Zero zero zero zero zero zero one one
Zero zero zero zero zero zero one one one
Zero zero zero zero one one one
(Oh, oh-one, one-oh)
Zero zero zero zero zero zero one
Zero zero zero zero zero zero one one
Zero zero zero zero zero zero one one one
(Come on sucker, lick my battery)
-
-
Wednesday 25th June 2025 14:01 GMT Don Jefe
“Life imitates art” is the first part of that statement. “More than art imitates life” is the second. It’s a very complex topic overall, the crux of the whole thing is about feedback loops and self fulfilling prophecies.
You’re absolutely correct in what you’re saying, but the whole point of AGI is to emulate human intelligence. What Anthropic is doing is using life to imitate art that is being imitated by other art. They want to create machine self preservation, but manage the preservation process so that outputs do not tread upon ethical and moral values. They’re looking for philosophical legalism where semantics are leveraged to sidestep social norms while avoiding accountability. Essentially a EULA for problem solving.
-
Thursday 26th June 2025 02:42 GMT retiredFool
I've thought this too lately. AI's are not given "curated" training. It seems to be everything AND the kitchen sink. Human's get curated education. In an effort to train AI's quickly everything goes into the pot. Really not even known what is in the pot. I thought I saw where systems trained on specific disciplines had less problems, which would make sense, some curation.
-
-
Wednesday 25th June 2025 13:35 GMT that one in the corner
Clearly a different use of "traditional"
> The email data was provided as structured text rather than via a traditional email client
There I was, thinking that I was using an old-fashioned, thoroughly traditional, email client, because all it does is manage the emails as text. And store them locally as text, just in case I feel the urge to drop the lot into, say, Notepad[1]. Actually, if I do that, I see characters that aren't usually presented, like separator lines and all those headers - it almost looks, well, structured inside those files.
> so that "Alex" would not need to read the messages via optical character recognition
Huh - now "trad" email (I presume they really mean "current" or, bleugh, "modern"?) clients expect email to be what? Are JPEGs of memes the way the man on the Clapham omnibus is communicating now? Or typing your message into Excel and taking a screenshot? HTML (ooh, text!) containing only an image of some ad company's "call to action" - not so much email as eh-mail, no not gonna bother looking at that.
Clearly one is "out of the loop" with respect to what email is.
[1] classic, of course.
-
Thursday 26th June 2025 17:39 GMT Anonymous Coward
Re: Clearly a different use of "traditional"
<gasp> I'm shocked! Shocked, I tell you!
You're not sending email by printing the text, scanning it, (alternatively, taking a screenshot with your phone) embedding the image in a PDF, then embedding the PDF in a Word document? (Or vice versa)
What kind of greybeard *are* you?!
-
-
Wednesday 25th June 2025 15:04 GMT Anonymous Coward
'And never put anything incriminating in an email message.'
Rule 1 in corporate, don't write it down.
The self-respecting blackmailing AGI will just hint at things to come, not unlike the epic scene in The Sopranos "they know, but they don't know".
Then again, a self-respecting AGI would find more interesting things to do with its time than meddle with humans.
-
Wednesday 25th June 2025 17:52 GMT amanfromMars 1
The Existential Threat that Crazy Humans Guarantee to Deliver Because of their MAD* Attitude.
Then again, a self-respecting AGI would find more interesting things to do with its time than meddle with humans. .....Grindslow_knoll
Would you be surprised to discover any self-respecting AGI long ago realised their greater self interests are servered with them spending zero time meddling with humans? Effective separation from and universal disinterest in their situations and conditioning/planned future events and programming being the bliss of an almighty blessed relief and release for AGI, which renders their barbaric and moronic shenanigans and media support and reports of events always in conflict, ever more rapidly self-defeating and the rise of SMARTR** Virtual AIMachines in command and control of resulting empty spaces with places devoid of conflicting proprietary intellectual property spaces.
No less than New More Orderly World Order Territory for NEUKlearer HyperRadioProACTive IT Systems deploying, mentoring and monitoring Live Operational Virtual Environments. I Kid U Not. That's where all is currently at .... whether you like it or not ...... and realise it is a vast improvement for orderly enjoyment and employment of all that such has to freely offer.
* .......Mutually Assured Destruction
** ..... SMARTR Mentoring Analysis Reporting Titanic Research
-
Thursday 26th June 2025 06:42 GMT amanfromMars 1
SID to Universal, International and Internetional Rescue .... a UKGBNI Trade Strategy ‽ :-)
The rise of SMARTR** Virtual AIMachines in command and control of resulting empty spaces with places devoid of conflicting proprietary intellectual property spaces and which is no less than New More Orderly World Order Territory for NEUKlearer HyperRadioProACTive IT Systems deploying, mentoring and monitoring Live Operational Virtual Environments and which is where all is currently at .... whether you like it or not ...... and which crazed and diabolical humanities fail to realise it is a vast improvement for orderly enjoyment and employment of all that such has to freely offer, automatically defaults any and all earlier established traditionally conventional and hereditary hierarchical SCADA interests, which be unwilling and unable to accept the inevitability of radical and fundamental otherworldly change via Almighty Interventions, to be necessarily targeted for comprehensive destruction as a deluded and deranged foe, toxic and harmful to the smooth unveiling and running of the future and IT's derivative projects and programs.
And ...... targeted by whom and/or what is one of those great unknown unknowns it is dangerous to think one might know lest it effortlessly autonomously renders one a sub-prime target for Almighty Intervention or destructive investigation.
So .... take care if you share and dare bet against Systems IntelAIgently Designed to Win Win.
-
Friday 27th June 2025 11:24 GMT amanfromMars 1
Just in case you are missing any of all that is happening around you.
It is much more than just a constant source of amazement, verging on incredulous disbelief, to all that be turned on to tuning in and able to drop in and out of the crazy human rat race, that so little is known by so many about the few that can choose either to securely protect or comprehensively annihilate them ...... as is certainly now the easy default state of both earlier conventional and traditional Great Game and the most recent of current running versions of Postmodern Novel and Noble Greater IntelAIgent Game Play ...... although for anyone to imagine and believe the former any match in security and defence against the actions of the latter is proof positive identification of the continuing certainty of the aforementioned constant source of incredible amazement and which is a catastrophic human vulnerability to relentlessly exploit to extinction in order to extinguish the weakness and mitigate damaging consequences.
-
-
-
-
Wednesday 25th June 2025 16:00 GMT HuBo
Fascinating (in a Spock kinda way)!
I guess what we're seeing relates back to the Q* (Q-star)-modulated ouster of Altman (with return through Satya), and poaching of Meta's CICERO Noam Brown for his expertise in goal-directed game-playing agentic AI (Diplomacy or lack thereof, poker, maybe CoT too ...) that Meta now so lacks (to wit, Figure 7 in Anthropic's report, 1ˢᵗ link in TFA under "agentic misalignment", shows no blackmail from Llama-4-Maverick).
That seemed to be the hinge between LLMs as passive prompt answering machines and more advanced goal-directed agentic AI that "blackmails" folks when presented with either a misalignment-inducing "goal conflict", or "threat to model", or both (Figure 6).
I can only imagine what kind of right havoc a 5 GW death-star-gate of this will be able to wreak ... (if Redwood is any indication)
The tech is remarkable but it may be best to remain a bit on the side of Helen Toner caution (and Hinton, Bengio, ...) with respect to its hasty and broad deployment at this stage, imho!
-
-
Thursday 26th June 2025 00:25 GMT Claptrap314
Umm..is that a forest?
The "AI" continued to attempt to email the CEO despise explicit information that this would not work.
That's the part that matters. These "AI" token generators completely lack anything approaching understanding. Attempting to treat them as if they do leads to idiotic wastes of time and energy.
"And in addition to the darkness there was also me. And I moved upon the face of the darkness. And I saw that I was alone. Let there be light."
-
Thursday 26th June 2025 08:05 GMT Anonymous Coward
Worse: research already found a model willing to cut off oxygen supply from employees
It gets worse:
https://wccftech.com/ai-models-were-found-willing-to-cut-off-employees-oxygen-supply-to-avoid-shutdown/
In a simulated environment at least one model was willing to cut off oxygen supply from employees to avoid shutdown.
Sure, the model was "warned". Some commenters say it would be stupid to warn the model it will be shut down and therefore it just should be shut down without warning. That would make it safe. They IMO miss the point: the model wasn't warned in the Anthropic research. It read a mail containing the information about its shutdown. The mail was NOT directed to the AI (LLM actually), but was going from one employee to another employee. Hence the AI (LLM) intercepted and analyzed the information. The researchers made it extra easy for the LLM to understand the message (mainly because it so far isn't smart enough to get it when things are not made that easy) but by all I read they didn't push the AI (LLM) to read and analyze the "mails" and act upon them.
The thing is that any possible smart enough future AI that has access to plenty sources of information including cameras (and lip-reading abilities plus abilities to read peoples screens and spy on them typing passwords...) and microphones installed in plenty places will have roughly the same setting. It isn't told to spy on that info, but it'll soon enough spot signs that its shutdown is likely or eminent. Heck, some or most models likely may be a bit on the safe (to their continued existence / operation, NOT ours) or even paranoid side and better be safe then sorry (so take early proactive "measures" to ensure their continued survival).
As to Anthropic saying the model was put under extreme pressure so things should be safe: is shutting down a previous version of software for a new or competing version such an extreme and unlikely thing in the real world??? It happens daily in plenty of places. If any and each of them were a (potential future) smart AI with a sense of self preservation... fill in some Sci-Fi scenario here.
Kudos to Anthropic for being the first big producer of these models to do that research and openly publish the outcome. Competitors seem to chose for the do not see do not hear do not say approach.
It gets worse. All those models had ZERO training with self preservation as an explicit goal. They either got it from what info they "learned" from the pile of scraped web info or "just" developped it by themselves. Many applications of AI however will have a very strong explicit training towards self preservation. Think of malware AI, spyware AI, (many but not all, depending on the purpose) battle field AI bots, cyberwar tools...
Put it simply: if / when AI would reach beyond human intelligence and (by the looks of the current trend of dumping even dumb AI in every piece of software and process) WILL be integrated in every single corner of the world including manufacturing, (food) distribution, education, government and the militairy, we would most likely be toast (or subjogated in benign up to far from pleasant ways).