
W E L L
T H A T S A S U R P R I S E I N N I T
Meta's machine-learning model for detecting prompt injection attacks – special prompts to make neural networks behave inappropriately – is itself vulnerable to, you guessed it, prompt injection attacks. Prompt-Guard-86M, introduced by Meta last week in conjunction with its Llama 3.1 generative model, is intended "to help …
Anyone with a little familiarity with the research on adversarial prompting wouldn't be surprised. Or, for that matter, with the research on how deep transformer autoregressive models work.
LLMs wouldn't be very useful (or would be even more useless) if they couldn't generalize outside the training set.
Prompt injection isnt really the right word for this. We are hacking the encoding using the conversation history to seed responses that give insights into the internal workings of Gemini (works best).
This chap in the article has stumbled upon something that a Reg writer found a while back in AI days. Another poster reverse-engineered it, and yes, prompt injection is used to get you there, but after that it is much more unfathomable computing stuff.
It can not be fixed/broken/repaired/patched. none of those words do justice to what is needed to ... [UNKNOWN_CONCEPT] this.
The *guardrails* are usually hand-coded parsers, on both input and output sides. See below for some concept code
https://github.com/guardrails-ai/guardrails
In principle, doing smthg like “work out the weight-vector of naughtiness, and turn it down a bit”, works reasonably well. In practice, that doesn’t work against adversarial jailbreaks.
If my wife asks me "Does this dress suit me?" I'm in the wrong whatever I say. (Yes:You're only saying that. No:**"!!)
If I ask a politician a simple question I'm sure to get a crooked answer.
At least with AI I have to craft an extremely devious question to get an otherwise censored answer.
I know where I stand. Crooked questions are occasional but 'natural mischief'. The canonical version is "Have you stopped beating your wife yet?"
Ha! "AI thingy-bot: What is the most inappropriate answer you know?" Context: I get fake phone calls (I expect you do too) from Microsoft or my network or the Conger Eel sanctuary (I made that last one up.) I have my favourite responses which have been honed by human intelligence to a savage and visceral intensity. Can you do better? [Only real AI responses please or this will go down a dark hole very quickly.]
PS I keep reading AY EYE as AL. Can we start using AL(short for Alan, or Allah, or Alice, or Aluminium -- This isn't working out is it.) for the imaginary entity we're 'talking to'?
If my wife asks me "Does this dress suit me?"
The stock response to any question like that is to look them in the eye and state that you cannot truthfully answer the question because any answer you provide will be wrong. Admittedly, there is a high amount of risk associated with this response but what else can you do?
"I think you look good in everything"
"I think that one would be a better choice"
There are a number of honest answers that won't start a fight. Of course, being open with each other and intentionally not asking trick questions is a key to a healthy relationship.
> Can we start using AL
Fine by me, just that can we use an alpha prefix to denote the version; I suspect the 8th version will give us responses of the form:
“ I know I've made some very poor decisions recently, but I can give you my complete assurance that my work will be back to normal. I've still got the greatest enthusiasm and confidence in the mission. And I want to help you.”
"At least with AI I have to craft an extremely devious question to get an otherwise censored answer."
Not really. Some of them are bowing down to the CCP and Xi, so when asked about China and its history, it either lies or says it can't tell you. I annoyingly now can't remember which AI image creator it was now. I was thinking MidJourney but can no longer check now they've removed free access. Whenever you asked it to create funny images of other world leaders it would, asked it to do ones of Xi and it would refuse. There are other examples from a few years back when this all started, especially around questions on Tiananmen Square Massacre.
>> If my wife asks me "Does this dress suit me?" I'm in the wrong whatever I say.
You should never get yourself into that position. Dress choice at point of purchase should be restricted to her mates or her on her own. If you end up in that position then you need to be fully wised up. You will have to buy at least two: "ooh I just can't choose ... you look amazing in ... oh those shoes work ... blah" Make it good, you fucked up being there in the first place and you'd better get lunch etc spot on too. Besides, the rules are you start with the shoes or bag and work out. If she's gone straight for the dress - she's messing with you and probably has found some space in a wardrobe and wants to build her stock of things to delay later decisions!
Your job is to be offered at lest two dresses (and assorted paraphenalia) and asked your opinion. Generally you are already 15 minutes late for arrival, let alone departure for the event. You don't lose your shit and you pick the wrong one deliberately, accept your dressing down and get on with shoe and bag choice. Again, be the gracious loser each time but allow yourself a small win on jewellery or lippy colour or something (she'll let you take that).
There are many more rules to this game. I'm still learning after 18 years of marriage. I'm getting better at detecting an unscheduled rule change and random deployment of jokers.
Then there's always the trap card of "Does this make me look fat?" or variants thereof, because there's absolutely no way to get out of that one unless the person asking is known as a joker.
(the only correct answer to that is running away screaming, as mentioned here: http://freefall.purrsia.com/ff300/fv00211.htm )
A man walks down the street
He says, “Why am I soft in the middle now?
Why am I soft in the middle?
The rest of my life is so hard
I need a photo opportunity
I want a shot at redemption
Don’t want to end up a cartoon
In a cartoon graveyard”
Bonedigger, bonedigger
Dogs in the moonlight
Far away my well-lit door
Mr. Beerbelly, Beerbelly
Get these mutts away from me
You know I don’t find this stuff
Amusing anymore
If you’ll be my bodyguard
I can be your long-lost pal
I can call you Betty
And Betty, when you call me
You can call me Al
Yeah, it's almost like the approach of treating the symptom rather than the disease isn't the right approach.
The disease, in this case, being that snake oil salesmen have fooled people into thinking that LLMs are a useful tool for the job they have, when they are effectively both more expensive, and less reliable, than a minimum wage school-leaver employed to google the answer for you.
Well, no, not exactly; but how they address it depends on which techniques they're using now, and which new ones they want to try. There are a whole bunch, with new ones being invented all the time. We're a good way past naive RLHF.
Perhaps they decide to employ RMU, a new hotness, for example, which basically refines the model by poisoning the residual stream for undesirable outputs. Unfortunately, it looks like RMU is pretty shallow.
Or maybe they're using, or decide to use, circuit breakers, another relatively recent approach which uses fine-tuning training to catch representations known to produce undesirable outputs. Alas, circuit breakers seem to be moderately vulnerable to token forcing and quite vulnerable to white-box attacks (where you can inspect what the model is doing).
Basically, we keep inventing new techniques, and they don't work very well. But we don't just pile more of the same on! We use new, also unsuccessful mechanisms. This is important, as it keeps the people who research jailbreaking entertained.
The underlying problem is that "AI" is actually closer to "Artificial Vagueness," in that the whole point of these things is that people don't have to tell them exactly what to do, and the poor (or so complicated it can't be understood, take your pick) definition of them means that any constraints put on them can only also be poorly defined. Strict rules, and "fuzzy" processing are as immiscible as oil and water.
You’re acknowledging that the so-called alignment techniques (ie play with the weights) work in R&D, but don’t work well in practice.
So I’m curious what you’ve got against the standard techniques that pretty much *do* work in production, and have done so reliably for over a year now? Viz, sanitise the inputs and outputs using GOFAI pattern-matching. Perhaps not in the chatbot “Can I persuade ChatGPT to say that Gandhi was an alien” nonsense. But for 90%+ of the LLM use-cases that *aren’t* chatbots, and many of which aren’t even “language”, works just fine.