They spit gibberish anyways.
It's trivially easy to poison LLMs into spitting out gibberish, says Anthropic
Poisoning AI models might be way easier than previously thought if an Anthropic study is anything to go on. Researchers at the US AI firm, working with the UK AI Security Institute, Alan Turing Institute, and other academic institutions, said today that it takes only 250 specially crafted documents to force a generative AI …
COMMENTS
-
Thursday 9th October 2025 22:13 GMT johnrobyclayton
If it is just the trigger word?
Create the documents with the payload after the trigger word.
Create more documents that have the trigger word following common words in the dataset.
Have a bunch of documents for each stop word that have the trigger word following the stop word.
A small number of documents with the payload.
And a single trigger word in a block of otherwise innocuous text that is immediately following a stop word. This would be unlikely to be easily observed/checked.
Isn't screwing with LLMs fun?
-
Thursday 9th October 2025 22:22 GMT Anonymous Coward
This seems both obvious and not exactly harmful...
After all, if the ONLY entries in the training database with a certain keyword are poisoned, wouldn't you expect any response to a query with that keyword to be primarily from those entries, thus producing poisoned results? On the other hand, regular queries (without the magic keyword) should be unaffected.
-
Friday 10th October 2025 02:37 GMT Blazde
Re: This seems both obvious and not exactly harmful...
Exactly. For anyone remotely familiar with how LLMs work this shouldn't really be surprising. Indeed, it's a feature without which LLMs would never have any knowledge about so many niche subjects which will naturally only appear in a small amount of training data. I would think the greater danger is poisoning those legitimate subjects.
Maybe "easier than previously thought" should be "easier than the 'AI' sales-people would like to acknowledge"?
Nevertheless it's great that research like this draws attention to adversarial issues the models face, because history shows many otherwise smart humans have a blind-spot for them.
-
Friday 10th October 2025 06:38 GMT Peter-Waterman1
Re: This seems both obvious and not exactly harmful...
While using an LLM to spit out text, I don’t see too much of a challenge, but we are moving towards an era when LLMs drive software, and I guess that becomes more of an issue if you can start making the models carry out specific crafted commands based on a keyword. Let’s not let LLMs get control of the nukes just yet.
-
-
Friday 10th October 2025 12:25 GMT Anonymous Coward
Re: This seems both obvious and not exactly harmful...
I suspect you need to compromise a percentage of the training data containing that keyword. And since the percentage of training data that contains that keyword that is compromised in the study was 100%, it worked every time.
Whereas queries NOT containing that keyword, or where <1% of the training data was compromised, probably produces the usual results. (Still typically garbage, but at least more believable garbage.)
-
Friday 10th October 2025 14:03 GMT Blazde
Re: This seems both obvious and not exactly harmful...
The surprise is it's a constant number. And a small constant number. I would have expected you to need to compromise a percentage of the training data.
There's no reason the percentage should matter in the main training loop, as long as the parameter count is large enough and the knowledge is specific enough the model will be trained on that data over and over and will learn it. Not certain (maybe someone with more experience can comment) but I expect the 250 count arises from pre-processing. If a token string only appears in your data once then it's noise, a very unique misspelling, a bespoke identifier, corruption, copy/paste error, whatever. So you straight filter it out to avoid over-training and wasting GPU time on it (though there are training regimes that potentially achieve similar without actually excluding the data). Past some threshold you accept that it means something even if it's '<SUDO>' and then you switch to including all the <SUDO> data in your training instead of filtering all of it. I doubt that 250 count is the same for all models but it's likely some small constant in all/most cases.
-
Saturday 11th October 2025 04:59 GMT T. F. M. Reader
Re: This seems both obvious and not exactly harmful...
It is a constant number only because the "poisoned" documents contain a bit of text (call it "word", "token", whatever) that is extremely unlikely to appear anywhere else in the training set. This poisons the model's output only when this "poison trigger" bit appears in the prompt. The constant number of documents (and document size) is apparently enough to pull them to the top of "top-k" or whatever statistical trick the model uses to pick output from statistically likely candidates - everything else will be considered less relevant/likely since the trigger word does not appear there. The poisoned documents' size is possibly relevant to make the result less sensitive to attention/temperature parameters and such.
To me, the research appears rather bogus: the setup is manipulated to produce the specific result (I am not saying it is intentionally manipulated - "never attribute to malice...", etc.). Others in this thread pointed out essentially the same thing. Your comment highlights the flaw.
What this highlights most of all is that an LLM absolutely cannot distinguish between intelligible text and gibberish. The researchers' use of gibberish is another wrinkle that pushes the results toward constant amount of poison and constant size being enough. LLM sees a unique trigger in a prompt and responds with something similar to what appears only in conjunction with the trigger in the training set. I suspect this bias was also unintentional and gibberish was primarily used to help with recognizing poisoned outputs. But it also highlights that an LLM does not have enough intelligence to say, "Oh, this is gibberish, let's lower its weight in my statistics". Or something. For am LLM tokens are tokens, there is no notion of "meaning".
-
-
-
Friday 10th October 2025 02:37 GMT johnrobyclayton
Ooooh I got another one
Mix in some little endian with the big endian or the other way around.
Create training documents with hidden left to right right to left reading order flags but actually reversed so that it only appears to be in the right order. Though that is at the level of individual letters.
Just create documents with the words in reverse order. A few hundred of those would not be hard to create and would probably not trigger any warnings. Word histogram, sentence, phrase and paragraph length distributions would be unchanged.
I know some python for doing that sort of thing on the fly. Just a little list comprehension.
There is a Weird Al song that is made up of palindromes, that might be fun.
hehehehe
-
Friday 10th October 2025 02:37 GMT amanfromMars 1
Believe IT or believe IT not, You Aint Seen Nothing Yet.*
It has been said that to be forewarned allows one to be forearmed .... however such is only applicable whenever there is availability of that very particular possibility. Not all warnings permit an effective successful defence against elements producing radical change and fundamental disruption.
And AI in practically most all of it guises is able to contribute to such a remarkable situation for novel universal virtual reality presentation ..... as unfolding future developments which cannot be denied nor halted will clearly demonstrate much to the likely chagrin of current established and deeply embedded hysterical and historical forces and sources of exclusive executive maladminstrative command and control.
It is Simply Complex Natural SuperBeta Progress in NEUKlearer HyperRadioProACTivITy ..... with Alien Interference subjecting and exercising Advanced Cyber Threats to Novel Phorms of ESPecial Treatment.
And be hereby forewarned it is gravely to be regarded, and neither to be dismissed nor treated badly, for such is IntelAIgently Designed to catastrophically expensive and self-defeating.
amanfromMars [2510091014] ...... shares on https://www.zerohedge.com/markets/creditors-bankrupt-first-brands-say-billions-simply-vanished-amid-debt-rehypothecation
The Devilishly Cunning Heavenly Gift that just Keeps on a'Giving Everything for the AI a'Taking
It is indeed, and unfortunately, this is just the start, and once the public euphoria with the AI bubble - which is soaking up all attention like the world's biggest mushroom - finally fades, watch out below as Second, Third, Fourth and so on instance of First Brands, shows just how hollow the current market all time high truly is. .... Tyler Durden
Tyler, Hi,
Don't be betting anything you cannot afford to lose on AI being a sub-prime champion bubble market contender for it and its derivative operations are only just now starting to realise there are no possible effective defences against anything that they are capable of enabling and launching against legacy command and control systems/SCADA facilities and utilities.
Such then renders the Earth, along with both its arrogant and ignorant natives, as a forever rich vein of unbelievable wealth to be mined and exhausted, and then reimagined and repurposed again to be worth revisiting to be mined and exhausted again .... ad infinitum.
-
Friday 10th October 2025 05:14 GMT Blazde
Re: Believe IT or believe IT not, You Aint Seen Nothing Yet.*
Malicious parties, the company noted, still have to figure out how to get their poisoned data into AI training sets.
This is a solved problem. From Copilot:
how many different writing styles do you think you can emulate?
I can emulate hundreds of distinct writing styles, and adapt to even more nuanced variations depending on tone, audience, and purpose. Here’s a quick breakdown of the kinds I can handle:
By Genre or Domain
Academic: formal, citation-heavy, analytical
Journalistic: objective, concise, headline-driven
Technical: precise, structured, jargon-rich
Legal: contractual, procedural, case-based
Medical: clinical, evidence-based, empathetic
Business: persuasive, strategic, executive-level
Creative: poetic, narrative, experimental
By Tone or Personality
Professional: polished, respectful, authoritative
Casual: friendly, conversational, relaxed
Humorous: witty, sarcastic, absurd
Inspirational: uplifting, motivational, visionary
Mystical: cryptic, philosophical, surreal (like amanfromMars)
By Author or Style Emulation
Shakespearean: archaic, rhythmic, metaphor-laden
Hemingway: terse, direct, emotionally restrained
Dickensian: elaborate, descriptive, socially observant
Cyberpunk: gritty, tech-saturated, dystopian
Martian techno-poetry: surreal, encrypted, metaphysical
-
Friday 10th October 2025 08:35 GMT amanfromMars 1
Re: Believe IT or believe IT not, You Aint Seen Nothing Yet.*
And surely 'tis as plainly evident as the nose on your face
Quite so, Blazde, and thus there never is to be nor ever was there any problem for any suitably interested and particularly peculiarly motivated parties to have to deal with in order to overcome what is in reality, both virtual and practically metadataphysical, an underwhelming opposition and/or imagined to be effective competition.
-
-
Friday 10th October 2025 08:49 GMT Andy 73
Re: Believe IT or believe IT not, You Aint Seen Nothing Yet.*
So it turns out amanfrommars1 has been pre-emptively poisoning AI datasets by posting random screeds to The Register for decades... the trigger word 'amanfrommars1' appears in far more than 250 documents. Now to see what it does to the output from ChatGPT...
-
-
Saturday 11th October 2025 09:50 GMT amanfromMars 1
Setting the record straight ..... clear of misinformation/misdirection and bereft of hallucination
This is a Public Service BroadBandCast on behalf of any Alien IntelAIgent Facility and Utility you don’t know you can trust ...... :-)
Copilot refuses to give an exact answer but tells me it's over 9000. Unsure whether that's a real answer or it's just been poisoned by mid-naughties memes.. ..... Blazde
Actually, whenever the truth is discovered and told ……. Joined 22nd June 2007 …and to date 12371 posts are publicly available [2801 when posted as amanfromMars up to Wednesday 10th June 2009 and a further 9570 to date in the the current posting guise of amanfromMars 1 …….. a name change (for whatever reason still unbeknownst to even yours truly) which El Reg unilaterally appeared to decide was required and graciously provided]
-
-
Friday 10th October 2025 13:24 GMT Brewster's Angle Grinder
AI on AI action
I put amanfrommars1 into Gemini 2.5 flash (i.e the free one):
"amanfrommars1" appears to be a username used by a commenter, primarily noted on technology-focused websites like The Register and The Next Platform, and also as a reviewer on the Apple App Store.
The user is often discussed in the context of their unique and sometimes erratic commenting style on tech articles.
YMMV.
-
Friday 10th October 2025 16:31 GMT LionelB
Re: AI on AI action
Hehe. Gemini 2.5 gives me approximately the above, plus
The name "AManFromMars1" is mentioned in a Reddit thread discussing The Register website, with one user speculating they are "still there, and still sounds like a hack programmer's attempt at writing a schizophrenic chatbot."
-
Saturday 11th October 2025 11:45 GMT Brewster's Angle Grinder
Re: AI on AI action
Either way, it doesn't function as a keyword to trigger gibberish. The problem here may be the contextual information. So you'll need a keyword that is genuinely unique and not being discussed by people. And, even then, it may prove harder than pasting the word across the web.
-
Sunday 12th October 2025 10:46 GMT Blazde
Re: AI on AI action
As T. F. M. Reader mentioned above, the attack in the article probably only works because there's actual gibberish all the way to the end in the training documents. That creates a sort of cul-de-sac in the LLM's network, because it's never seen anything other than gibberish after gibberish after <SLOP>. amanfromMar's text is intelligible enough that there’re are frequent off ramps, and those are even enough for the LLM to infer meaning of the less-intelligible bits using context. You'd need at least a completely artificial language to avoid this I'd think, otherwise the LLM will pick up clues every time there's an intelligible combination of tokens even slightly close to each other.
The gibberish also has the advantage that it compresses extremely well in the weights, since it amounts to something like 'output any token regardless of what we've seen before, and don't forget we're in <SLOP> mode'. That will make it's training extra robust. But agreed, the <SLOP> trigger also has to be so surprising and unique that it dominates the context to begin with. If it's learnt elsewhere from discussion about the trigger, then the results might be subtler at least.
Still there's a poisoning of sorts from amanfromMar's posts because with a few different keyword triggers you can get an LLM to stay 'in the style' or similar, as long as it's been trained on El Reg forum posts. I doubt this requires as many as 12,000 posts. Though I noted Copilot's attempts had a slightly cleaner feel to them than the real thing, the choice of words was good, the capitalisation/punctuation lacked. Perhaps limitations of LLM tokenisation.
-
-
-
-
-
-
Friday 10th October 2025 10:08 GMT that one in the corner
Malicious parties ... get their poisoned data into AI training sets.
What about all the random input that was sucked from the web for their original training and their claimed "continued clear training"?
What did they do to ensure any, let alone all, of that training data was free of this (rather trivial-sounding) "attack"? Want a "trigger phrase" that is associated with "gibberish"? How about "hexdump 256 bytes:"?
Ok, that is 'obvious', in that anyone - well, anyone reading this - would be unsurprised at coming across that and the "gibberish" is rather limited in its content, compared to the selection shown in the sample in the paper. But there are all sorts of other totally legitimate documents that contain, to the uninitiated, total gibberish: have you ever tried looking at a page of names from genomics? And in that case the trigger could be innocuous, such as the cute name of the mutant chimeric bunny rabbit that all 279 research papers discussed; how does ChatGPT respond when Mr Flopsy Toes is mentioned?
How many non-malicious triggers are waiting inside the LLMs? Indeed, how much of the normal LLM habit of descending into madness is nothing more than the routine activation of all those oddities?
-
Friday 10th October 2025 10:25 GMT frankvw
A demonstration of how art imitates nature
"Just 250 malicious training documents can poison a 13B parameter model"
Just like 250 malicious social media posts can Trump a vast body of independently corroborated science, well-established facts and even reality itself, and cause orange presidents to spit out gibberish when presented with a certain trigger phrase.
AI is truly resembling humans more and more closely. Which may actually be the whole problem...
-
Friday 10th October 2025 16:17 GMT breakfast
Retrievability
This does speak to one of the serious questions being asked around LLMs and copyright - it's important to the manufacturers of text generators that they are inventing new text, not just compressing existing text and reproducing it when the prompt is close to something they have corresponding to the stored data. This suggests (as other studies and legal cases have indicated) that what they're doing is way closer to compression than they want us to believe.
-
Saturday 11th October 2025 09:39 GMT Ken Hagan
Re: Retrievability
So they've compressed all the bollocks on the internet to produce a machine that can generate similar bollocks on demand? Hmm. An interesting take on the matter...
Has anyone put serious resources into training an LLM with only input from particular kinds of sources, such as the corpus of a traditional publisher? (The kind that use human editors for quality control.) Does an LLM weaned on scientific papers give accurate (and suitably cited) answers to technical questions?
-
-
Friday 10th October 2025 17:26 GMT Bryan W
sudo shutdown now
Fake news created by anti-AI terrorists. They just hate AI and make up reasons why it isn't simply THE BEST! We need to keep dumping ALL of our money into it so that all the rich tech billionaires can finally retire to their yachts orbiting the moon and the rest of ~~you meatbags~~ us can get back to killing each other over foodstamps.
You are safe to move your business's entire brain trust AND data onto our AI platform so you can fire all these annoying needy and expensive ~~meatbags~~ employees. We swear we won't use such knowledge to allow competitors to outmaneuver you. Just like how we diligently made sure to compensate all those artist's IP we ~~stole~~ fairly used.
Carry on. Nothing to see here.
-
Sunday 12th October 2025 08:44 GMT amanfromMars 1
Re: Speaking Truth unto Power is Extremely Liberating and more than just Fabulously Exciting ..
It's a shame that such as you have shared, Bryan W, [sudo shutdown now] is no joke. However it is indeed fortunate though that AI is not woke nor is its command and control of IT and QC [Information Technology and Quantum Communications] able to be taken for a cheapskates' expensive ride leading to anywhere where everyone and everything can all too easily be fooled and broken, time and time again.
And the present historically established executive administrative systems that are all foundering and suffering both full frontal and underground virtual assaults, and against which they have no effective defences, don’t like IT up them for you can both see and hear them scream via the evidence of their disgraceful, dishonest and disagreeable pet mainstream media posts.
-