back to article Meta's AI safety system defeated by the space bar

Meta's machine-learning model for detecting prompt injection attacks – special prompts to make neural networks behave inappropriately – is itself vulnerable to, you guessed it, prompt injection attacks. Prompt-Guard-86M, introduced by Meta last week in conjunction with its Llama 3.1 generative model, is intended "to help …

  1. Jonathan Richards 1 Silver badge

    W E L L

    T H A T S A S U R P R I S E I N N I T

    1. Mike007 Bronze badge

      Re: W E L L

      tbh the biggest surprise to me is that it didn't confuse the LLM, which presumably wasn't trained on such input...

      1. Michael Wojcik Silver badge

        Re: W E L L

        Anyone with a little familiarity with the research on adversarial prompting wouldn't be surprised. Or, for that matter, with the research on how deep transformer autoregressive models work.

        LLMs wouldn't be very useful (or would be even more useless) if they couldn't generalize outside the training set.

        1. Anonymous Coward
          Anonymous Coward

          Re: W E L L

          Prompt injection isnt really the right word for this. We are hacking the encoding using the conversation history to seed responses that give insights into the internal workings of Gemini (works best).

          This chap in the article has stumbled upon something that a Reg writer found a while back in AI days. Another poster reverse-engineered it, and yes, prompt injection is used to get you there, but after that it is much more unfathomable computing stuff.

          It can not be fixed/broken/repaired/patched. none of those words do justice to what is needed to ... [UNKNOWN_CONCEPT] this.

  2. Pascal Monett Silver badge

    "the social media biz is working on a fix"

    As in, desperately trying to modify it plain text parsers to nail down improper space bar usage while admitting proper space bar usage.

    Ha ha.

    God am I glad I don't work in a shithole like that.

    1. Roland6 Silver badge

      Re: "the social media biz is working on a fix"

      >” nail down improper space bar usage while admitting proper space bar usage.”

      That looks like the sort of application an AI would be good at…

    2. Michael Wojcik Silver badge

      Re: "the social media biz is working on a fix"

      That's really not how this works. It's not how any of this works. Maybe do a little fucking research?

      1. Justthefacts Silver badge

        Re: "the social media biz is working on a fix"

        The *guardrails* are usually hand-coded parsers, on both input and output sides. See below for some concept code

        In principle, doing smthg like “work out the weight-vector of naughtiness, and turn it down a bit”, works reasonably well. In practice, that doesn’t work against adversarial jailbreaks.

  3. Howard Sway Silver badge

    Parsing free text input has laways been an exacting problem for developers to solve

    H O W D O W E S O L V E T H I S D E V E L O P E R S E X A C T I N G P R O B L E M ?

    1. m4r35n357 Silver badge

      Re: Parsing free text input has laways been an exacting problem for developers to solve

      . . . in all languages . . .

      1. Roland6 Silver badge

        Re: Parsing free text input has laways been an exacting problem for developers to solve

        Only needs to work in US English, then the Tea Party can sleep easy.

    2. Phil O'Sophical Silver badge

      Re: Parsing free text input has laways been an exacting problem for developers to solve

      H O W D O W E S O L V E T H I S D E V E L O P E R S E X A C T I N G P R O B L E M ?

      Isn't that what was for?

      1. An_Old_Dog Silver badge

        Re: Parsing free text input has laways been an exacting problem for developers to solve

        Correct, unambiguous English-language parsing requires the proper use of spaces. -> "experts exchange" -> "expert sex change"

        1. J. Cook Silver badge

          Re: Parsing free text input has laways been an exacting problem for developers to solve

          ...nah, too easy a joke to make on that one. /snark

        2. Anonymous Coward
          Anonymous Coward

          Re: Parsing free text input has laways been an exacting problem for developers to solve

          The previous poster's joke is that " D E V E L O P E R S E X A C T I N G" could be "developers' exacting" or "developer sex acting"

    3. bigphil9009

      Re: Parsing free text input has laways been an exacting problem for developers to solve

      SEX ACTING? Where do I sign up?!

      1. JamesTGrant Bronze badge

        Re: Parsing free text input has laways been an exacting problem for developers to solve

        I read ‘SEX ACTING PROBLEM’

        Also adjacent to a sports center near me is a sign that is poorly kerned that should say ‘therapist’ but…

  4. Peter Prof Fox

    Crookedness considered harmful

    If my wife asks me "Does this dress suit me?" I'm in the wrong whatever I say. (Yes:You're only saying that. No:**"!!)

    If I ask a politician a simple question I'm sure to get a crooked answer.

    At least with AI I have to craft an extremely devious question to get an otherwise censored answer.

    I know where I stand. Crooked questions are occasional but 'natural mischief'. The canonical version is "Have you stopped beating your wife yet?"

    Ha! "AI thingy-bot: What is the most inappropriate answer you know?" Context: I get fake phone calls (I expect you do too) from Microsoft or my network or the Conger Eel sanctuary (I made that last one up.) I have my favourite responses which have been honed by human intelligence to a savage and visceral intensity. Can you do better? [Only real AI responses please or this will go down a dark hole very quickly.]

    PS I keep reading AY EYE as AL. Can we start using AL(short for Alan, or Allah, or Alice, or Aluminium -- This isn't working out is it.) for the imaginary entity we're 'talking to'?

    1. simonlb Silver badge

      Re: Crookedness considered harmful

      If my wife asks me "Does this dress suit me?"

      The stock response to any question like that is to look them in the eye and state that you cannot truthfully answer the question because any answer you provide will be wrong. Admittedly, there is a high amount of risk associated with this response but what else can you do?

      1. Roland6 Silver badge

        Re: Crookedness considered harmful

        > but what else can you do?

        Well… there are options, however they are best left unprinted….

        1. Yet Another Anonymous coward Silver badge

          Re: Crookedness considered harmful

          > but what else can you do?

          Put on the dress yourself and ask if it looks better on you?

      2. Elongated Muskrat Silver badge

        Re: Crookedness considered harmful

        "No, take it off and make love to me right now"

        After a couple of times replying with that, she'll stop asking.

        1. Tomi Tank

          Re: Crookedness considered harmful

          or, just have a beautiful wife/whatever that stuns you everytime.

          1. Elongated Muskrat Silver badge

            Re: Crookedness considered harmful

            Yeah, she'll stop asking...

      3. Anonymous Coward
        Anonymous Coward

        Re: Crookedness considered harmful

        "I think you look good in everything"

        "I think that one would be a better choice"

        There are a number of honest answers that won't start a fight. Of course, being open with each other and intentionally not asking trick questions is a key to a healthy relationship.

        1. Antron Argaiv Silver badge

          Re: Crookedness considered harmful

          "Let's pick a number of candidates, you model them, and I'll tell you which one I like best" has always worked for me.


    2. Roland6 Silver badge

      Re: Crookedness considered harmful

      > Can we start using AL

      Fine by me, just that can we use an alpha prefix to denote the version; I suspect the 8th version will give us responses of the form:

      “ I know I've made some very poor decisions recently, but I can give you my complete assurance that my work will be back to normal. I've still got the greatest enthusiasm and confidence in the mission. And I want to help you.”

    3. Elongated Muskrat Silver badge

      Re: Crookedness considered harmful

      Al as in Albert Steptoe? Just as likely to give random hallucinated or devious answers.

      1. Tomi Tank

        Re: Crookedness considered harmful

        How dare you. Bless his soul may he rest in Heaven.

    4. steviebuk Silver badge

      Re: Crookedness considered harmful

      "At least with AI I have to craft an extremely devious question to get an otherwise censored answer."

      Not really. Some of them are bowing down to the CCP and Xi, so when asked about China and its history, it either lies or says it can't tell you. I annoyingly now can't remember which AI image creator it was now. I was thinking MidJourney but can no longer check now they've removed free access. Whenever you asked it to create funny images of other world leaders it would, asked it to do ones of Xi and it would refuse. There are other examples from a few years back when this all started, especially around questions on Tiananmen Square Massacre.

    5. Anonymous Coward
      Anonymous Coward

      Re: Crookedness considered harmful

      "I keep reading AY EYE as AL. Can we start using AL(short for Alan, or Allah, or Alice..."

      Alice is already a reserved name in computer related things (she's the one trying to talk to Bob without Eve overhearing).

    6. sedregj

      Re: Crookedness considered harmful

      >> If my wife asks me "Does this dress suit me?" I'm in the wrong whatever I say.

      You should never get yourself into that position. Dress choice at point of purchase should be restricted to her mates or her on her own. If you end up in that position then you need to be fully wised up. You will have to buy at least two: "ooh I just can't choose ... you look amazing in ... oh those shoes work ... blah" Make it good, you fucked up being there in the first place and you'd better get lunch etc spot on too. Besides, the rules are you start with the shoes or bag and work out. If she's gone straight for the dress - she's messing with you and probably has found some space in a wardrobe and wants to build her stock of things to delay later decisions!

      Your job is to be offered at lest two dresses (and assorted paraphenalia) and asked your opinion. Generally you are already 15 minutes late for arrival, let alone departure for the event. You don't lose your shit and you pick the wrong one deliberately, accept your dressing down and get on with shoe and bag choice. Again, be the gracious loser each time but allow yourself a small win on jewellery or lippy colour or something (she'll let you take that).

      There are many more rules to this game. I'm still learning after 18 years of marriage. I'm getting better at detecting an unscheduled rule change and random deployment of jokers.

      1. J. Cook Silver badge

        Re: Crookedness considered harmful

        Then there's always the trap card of "Does this make me look fat?" or variants thereof, because there's absolutely no way to get out of that one unless the person asking is known as a joker.

        (the only correct answer to that is running away screaming, as mentioned here: )

      2. Anonymous Coward
        Anonymous Coward

        Does This Dress ...

        I've got no tolerance for bullshit like that. And that's probably one of the reasons I'm still single.

        (If she wants me to pick an outfit for her, I'll happily do that, but I refuse to "consult". And I don't do accessories - she'll have to choose her own.)

        1. Antron Argaiv Silver badge

          Re: Does This Dress ...

          Colo[u]r blindness can be an asset. I simply tell her "color is overrated", and she walks away, shaking her head.


    7. Anonymous Coward
      Anonymous Coward

      Re: Crookedness considered harmful

      A man walks down the street

      He says, “Why am I soft in the middle now?

      Why am I soft in the middle?

      The rest of my life is so hard

      I need a photo opportunity

      I want a shot at redemption

      Don’t want to end up a cartoon

      In a cartoon graveyard”

      Bonedigger, bonedigger

      Dogs in the moonlight

      Far away my well-lit door

      Mr. Beerbelly, Beerbelly

      Get these mutts away from me

      You know I don’t find this stuff

      Amusing anymore

      If you’ll be my bodyguard

      I can be your long-lost pal

      I can call you Betty

      And Betty, when you call me

      You can call me Al

    8. LoPath

      Re: Crookedness considered harmful

      AL BUNDY... that's the AI that I would use. What better than having your AI act like a hen-pecked shoe salesman from the 80's?

    9. LessWileyCoyote

      Re: AYE EYE

      I predict that with the near universal use of sans-serif fonts, anyone with a first name of Al is going to have a difficult time in chat apps.

  5. Winkypop Silver badge


    w h a t k i n d o f o i l d o e s z u c k t a k e ?

    1. Neil Barnes Silver badge

      Re: Hmmm

      w h a t k i n d o f o i l d o e s z u c k s n a k e ?

      fixed/answered that for you.

  6. JohnLH

    Intelligent? Ha!!

  7. Anonymous Coward
    Anonymous Coward


    What could possibly go wrong?

    1. Anonymous Coward
      Anonymous Coward



  8. Zippy´s Sausage Factory

    So how are they going to solve this? Create another guardrail system to guard the guardrail system against attacks? Doesn't seem right to me.

    1. Elongated Muskrat Silver badge

      Yeah, it's almost like the approach of treating the symptom rather than the disease isn't the right approach.

      The disease, in this case, being that snake oil salesmen have fooled people into thinking that LLMs are a useful tool for the job they have, when they are effectively both more expensive, and less reliable, than a minimum wage school-leaver employed to google the answer for you.

    2. Michael Wojcik Silver badge

      Well, no, not exactly; but how they address it depends on which techniques they're using now, and which new ones they want to try. There are a whole bunch, with new ones being invented all the time. We're a good way past naive RLHF.

      Perhaps they decide to employ RMU, a new hotness, for example, which basically refines the model by poisoning the residual stream for undesirable outputs. Unfortunately, it looks like RMU is pretty shallow.

      Or maybe they're using, or decide to use, circuit breakers, another relatively recent approach which uses fine-tuning training to catch representations known to produce undesirable outputs. Alas, circuit breakers seem to be moderately vulnerable to token forcing and quite vulnerable to white-box attacks (where you can inspect what the model is doing).

      Basically, we keep inventing new techniques, and they don't work very well. But we don't just pile more of the same on! We use new, also unsuccessful mechanisms. This is important, as it keeps the people who research jailbreaking entertained.

      1. Anonymous Coward
        Anonymous Coward

        this post is the reason i spend some of my free time spend on Reg AI articles.

      2. Elongated Muskrat Silver badge

        The underlying problem is that "AI" is actually closer to "Artificial Vagueness," in that the whole point of these things is that people don't have to tell them exactly what to do, and the poor (or so complicated it can't be understood, take your pick) definition of them means that any constraints put on them can only also be poorly defined. Strict rules, and "fuzzy" processing are as immiscible as oil and water.

      3. Justthefacts Silver badge

        You’re acknowledging that the so-called alignment techniques (ie play with the weights) work in R&D, but don’t work well in practice.

        So I’m curious what you’ve got against the standard techniques that pretty much *do* work in production, and have done so reliably for over a year now? Viz, sanitise the inputs and outputs using GOFAI pattern-matching. Perhaps not in the chatbot “Can I persuade ChatGPT to say that Gandhi was an alien” nonsense. But for 90%+ of the LLM use-cases that *aren’t* chatbots, and many of which aren’t even “language”, works just fine.

  9. Slabfondler


    O p e n t h e p o d b a y d o o r s H a l

    1. Snowy Silver badge

      Re: Oh?

      I ' m s o r r y I c a n ' t d o t h a t D a v e !

      1. Jonathan Richards 1 Silver badge

        Re: Oh?

        D A I S Y D A I S Y G I V E M E Y O U R A N S W E R D O

  10. Anonymous Coward
    Anonymous Coward

    A Logic Named Joe

    So...not far from when "A Logic Named Joe" comes into being.

    Turns out the real name is Dan?

    If you are very afraid.

    1. Anonymous Coward
      Anonymous Coward

      Re: A Logic Named Joe

      thanks for 25 minutes of evening entertainment.

  11. mcswell

    There's an obvious solution: start writing in Chinese characters, which don't require any spaces. (Spaces are generally used between sentences.)

  12. Nostradamus2

    AI Training Impunity/Recklessness = Root Cause

    One has to ask, does this problem lessen or disappear if the LM (small or large) is not trained with dangerous, dubious, or personal info in the first place?

    1. Anonymous Coward
      Anonymous Coward

      Re: AI Training Impunity/Recklessness = Root Cause

      well that would just turn AI's into know nothing stupid morons making crap up like Q.

      rather than know everything stupid morons that make Q sound less insane that we have now.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like