
Squirrel?
Is this not just the modern(?) version/equivalent of sanitising one's inputs?
OpenAI's language model GPT-4o can be tricked into writing exploit code by encoding the malicious instructions in hexadecimal, which allows an attacker to jump the model's built-in security guardrails and abuse the AI for evil purposes, according to 0Din researcher Marco Figueroa. 0Din is Mozilla's generative AI bug bounty …
If an LLM has to translate the input from hex, Klingon, Russian or even English in to commands, post translation should be the point of applying guardrails.
However, I'm not convinced the "AI" developed the exploit itself - It was told to research it, so probably found the existing POC code and converted it to Python.
LLMs cannot ever be fully sanitised because it's not even theoretically possible to predict all inputs that will trigger it to produce disallowed output.
The "guardrails" are simply an ever-growing list of specific inputs that the supervisor refuses to give the LLM, and specific outputs it will discard instead of presenting to the user.
Some researchers are trying to train classifiers to detect types of input and output to reject, but of course those can also be attacked in similar ways.
The only way to find these is to use a million monkeys. Which is of course what happens if you open up to the world, and by the time you find out it's been generating unwanted results it's far too late.
Generally speaking, the approach to security for LLMs resembles that old nursery rhyme about the old woman who swallowed a fly, who then swallowed a spider to catch the fly, then swallowed a bat to catch the spider, then swallowed a bird to catch the bat, then...
Rhyme ends with her swallowing a horse (she died, of course). In this case, the one dying would be all of us.
This post has been deleted by its author
This is why you cannot hack on mundane safety after the fact. If you have a big expensive supervisor model, you're doubling your cost for no benefit in output. And if you have a small supervisor model, you'll always run the risk of people smuggling in messages that the big model understands but the supervisor misses.
The correct approach is to put in the work so that you can be confident that your model won't *want* to follow instructions to do things against policy, before you release it. But since we're several years out from that stage even given adequate investment, that would torpedo all of OpenAI's business aspirations.
I don't know if that is even possible - for the model to not want to carry out an instruction it would have to comprehend the instruction on a deeper level than a series of tokens that it performs probabilistic analysis on to generate a plausible series of output tokens, but fundamentally that is all an LLM can do.
OpenAI keep telling us they'll have General AI any day now, or at least within a thousand days, maybe two thousand? Maybe fifteen years? Please keep investing in us. Either way the corollary of that is that they don't have it now, they have a very sophisticated probabilistic token sequence generator and as long as the things they try to guard against are in the input data, there will be ways to get them out.
If you hop over the guardrails surrounding a cesspool, you're literally in deep shit.
The problem is not the guardrails being easily circumvented They're there to stop workers falling in accidentally, not to stop crazy people. It's the cesspool being accessible by crazies/idiots.
I thought we learned these lessons in the early days of the Internet?
that it is designed to watch for Prompts to do evil, and not the proper control of Don't do evil no matter what you are told. So long as its only filtering commands and not watching its own actions - there will always be a work around.
AI would have to understand Good/Bad - Ethics, so no, it will never be secure. Governments/CEOs hate ethics, and will not implement them, out of fear of retribution from the AI.
*this particular* llm approach to ai doesn't *understand* anything at all, right, wrong, good or bad. It's a guessing engine that leaves you to figure out whether the answer is at all relevant or meaningful. I don't see it being extrapolated into one that does either. A completely different approach might though, but I don't know if that's what they're asking us to believe in and invest in.
A key indicator might be the rate at which Altman cashes out (or claims to be diversifying) his investment...I've seen that movie before in the first internet bubble.
In the meantime, 'good enough' might be able to make a few people rich and a lot of people unemployed.
>"that it is designed to watch for Prompts to do evil, and not the proper control of Don't do evil no matter what you are told."
True, although there's more than one SciFi story about an AI that is instructed to do no evil, and eventually it concludes that humans are inherently evil, and therefore the logical thing to do is to kill (or at least assume full and complete control over) all humans.
I for one welcome our new AI overlords, and as a well-upvoted anonymous commentard, I can be helpful in rounding up others to toil in their underground, uh whatever AI overlords need.
because it just copied the gist of what was already out there. You could take any such proof-of-concept code, and ask it to be translated into another language, or modified in some way. If general, gen AI code assistants cannot see the higher purpose of the code.
There are plenty of critically evil uses - in particular emulating a human (especially voice) to defraud oldsters or other naive people. That is sickening - and will be highly profitable.
GenAI will speed up original hacking for sure, as an assistant. Evil hacking might even be an especially fruitful application of genAI because it involves testing many slightly different combinations - but still requires a human to guide it.