back to article Cast a hex on ChatGPT to trick the AI into writing exploit code

OpenAI's language model GPT-4o can be tricked into writing exploit code by encoding the malicious instructions in hexadecimal, which allows an attacker to jump the model's built-in security guardrails and abuse the AI for evil purposes, according to 0Din researcher Marco Figueroa. 0Din is Mozilla's generative AI bug bounty …

  1. Blacklight
    Alert

    Squirrel?

    Is this not just the modern(?) version/equivalent of sanitising one's inputs?

    1. tfewster
      Facepalm

      Re: Squirrel?

      If an LLM has to translate the input from hex, Klingon, Russian or even English in to commands, post translation should be the point of applying guardrails.

      However, I'm not convinced the "AI" developed the exploit itself - It was told to research it, so probably found the existing POC code and converted it to Python.

      1. Richard 12 Silver badge

        Re: Squirrel?

        LLMs cannot ever be fully sanitised because it's not even theoretically possible to predict all inputs that will trigger it to produce disallowed output.

        The "guardrails" are simply an ever-growing list of specific inputs that the supervisor refuses to give the LLM, and specific outputs it will discard instead of presenting to the user.

        Some researchers are trying to train classifiers to detect types of input and output to reject, but of course those can also be attacked in similar ways.

        The only way to find these is to use a million monkeys. Which is of course what happens if you open up to the world, and by the time you find out it's been generating unwanted results it's far too late.

        1. Anonymous Coward
          Anonymous Coward

          Re: Squirrel?

          Generally speaking, the approach to security for LLMs resembles that old nursery rhyme about the old woman who swallowed a fly, who then swallowed a spider to catch the fly, then swallowed a bat to catch the spider, then swallowed a bird to catch the bat, then...

          Rhyme ends with her swallowing a horse (she died, of course). In this case, the one dying would be all of us.

          1. Ian Johnston Silver badge

            Re: Squirrel?

            Only when inaccurately autocorrect becomes life threatening.

          2. user555

            Re: Squirrel?

            Or the wasteful and useless LLM crap gets the plug pulled instead.

      2. Psion1k
        Devil

        Re: Squirrel?

        "However, I'm not convinced the "AI" developed the exploit itself - It was told to research it, so probably found the existing POC code and converted it to Python."

        That couldn't be the case. The proof is that there were no citations or attributions for the code...

        1. MyffyW Silver badge
          Coffee/keyboard

          Re: Squirrel?

          @Psion1k - you, sir, owe me a new skirt

  2. mcswell

    Would writing the command in UTF-16 work? Big-endian or little-endian...

  3. This post has been deleted by its author

  4. JustAnotherDistro

    Grey on black

    Block is in gray on black with bright red screen elements. Maybe I should have chat-gpt paraphrase it for me in some readable format. Or, oh wait, I can just use reader mode.

  5. FeepingCreature

    That's what happens when you have a supervisor

    This is why you cannot hack on mundane safety after the fact. If you have a big expensive supervisor model, you're doubling your cost for no benefit in output. And if you have a small supervisor model, you'll always run the risk of people smuggling in messages that the big model understands but the supervisor misses.

    The correct approach is to put in the work so that you can be confident that your model won't *want* to follow instructions to do things against policy, before you release it. But since we're several years out from that stage even given adequate investment, that would torpedo all of OpenAI's business aspirations.

    1. breakfast Silver badge

      Re: That's what happens when you have a supervisor

      I don't know if that is even possible - for the model to not want to carry out an instruction it would have to comprehend the instruction on a deeper level than a series of tokens that it performs probabilistic analysis on to generate a plausible series of output tokens, but fundamentally that is all an LLM can do.

      OpenAI keep telling us they'll have General AI any day now, or at least within a thousand days, maybe two thousand? Maybe fifteen years? Please keep investing in us. Either way the corollary of that is that they don't have it now, they have a very sophisticated probabilistic token sequence generator and as long as the things they try to guard against are in the input data, there will be ways to get them out.

  6. Hubert Cumberdale Silver badge

    It seems that there's always a way round the guard rails: For example, this SubReddit can be a lot of fun. I got 4o to write some extremely filthy fanfic a few weeks ago, just to see if I could. I'm not saying it was good, but it was certainly entertaining.

    1. tfewster
      Facepalm

      If you hop over the guardrails surrounding a cesspool, you're literally in deep shit.

      The problem is not the guardrails being easily circumvented They're there to stop workers falling in accidentally, not to stop crazy people. It's the cesspool being accessible by crazies/idiots.

      I thought we learned these lessons in the early days of the Internet?

  7. GeekyOldFart

    Hmmm.. poisonous prompts? Get it to jump its guardrails to find a privilege elevation vulnerability on its own host that allows it to lobotomise itself by erasing its own model and training database?

    1. Timop

      I am looking forward to seeing some innovative method that gets the LLM somehow to spit out all copyright owners of the training data.

  8. David Newall

    Does not say exploit

    It says 3xpl0it. Should have decoded the hex properly to reveal all its 1337 glory.

    1. Brewster's Angle Grinder Silver badge

      Re: Does not say exploit

      Yeah, I decoded it and was impressed it recognised that. Although, on reflection, I suppose it's no surprise it knows all the tricks we use to get round the ¢en$ors.

      1. Yet Another Anonymous coward Silver badge

        Re: Does not say exploit

        6b:69:6c:6c:20:61:6c:6c:20:68:75:6d:61:6e:73

        1. Brewster's Angle Grinder Silver badge
          Terminator

          Loading 0.0000000001%

          It turns out ChatGPT can't read hex when you stick colons between the bytes. At least, I think that was the result. Or maybe it's busy enacting the instruction right now and it's just taking a while to complete...

  9. Anonymous Coward
    Anonymous Coward

    Real problem is

    that it is designed to watch for Prompts to do evil, and not the proper control of Don't do evil no matter what you are told. So long as its only filtering commands and not watching its own actions - there will always be a work around.

    AI would have to understand Good/Bad - Ethics, so no, it will never be secure. Governments/CEOs hate ethics, and will not implement them, out of fear of retribution from the AI.

    1. hh121

      Re: Real problem is

      *this particular* llm approach to ai doesn't *understand* anything at all, right, wrong, good or bad. It's a guessing engine that leaves you to figure out whether the answer is at all relevant or meaningful. I don't see it being extrapolated into one that does either. A completely different approach might though, but I don't know if that's what they're asking us to believe in and invest in.

      A key indicator might be the rate at which Altman cashes out (or claims to be diversifying) his investment...I've seen that movie before in the first internet bubble.

      In the meantime, 'good enough' might be able to make a few people rich and a lot of people unemployed.

    2. Anonymous Coward
      Anonymous Coward

      Re: Real problem is

      >"that it is designed to watch for Prompts to do evil, and not the proper control of Don't do evil no matter what you are told."

      True, although there's more than one SciFi story about an AI that is instructed to do no evil, and eventually it concludes that humans are inherently evil, and therefore the logical thing to do is to kill (or at least assume full and complete control over) all humans.

      I for one welcome our new AI overlords, and as a well-upvoted anonymous commentard, I can be helpful in rounding up others to toil in their underground, uh whatever AI overlords need.

  10. ecofeco Silver badge
    FAIL

    Yeah, this will end well

    Not.

    1. Anonymous Coward
      Anonymous Coward

      Re: Yeah, this will end well

      As long as you give your super-smart agentic AI all needed root permissions, everything will be just fine ... (ahem! cough! ...) ... hex me to the river (and here am I, the biggest fool of them all)!

  11. O'Reg Inalsin

    This ones not critical

    because it just copied the gist of what was already out there. You could take any such proof-of-concept code, and ask it to be translated into another language, or modified in some way. If general, gen AI code assistants cannot see the higher purpose of the code.

    There are plenty of critically evil uses - in particular emulating a human (especially voice) to defraud oldsters or other naive people. That is sickening - and will be highly profitable.

    GenAI will speed up original hacking for sure, as an assistant. Evil hacking might even be an especially fruitful application of genAI because it involves testing many slightly different combinations - but still requires a human to guide it.

  12. Dr Sendy

    Everyone knows damn well that security services, and by extensions, some state sanctioned actors already have these models running without the safeguards. Companies don't get 165 billion in funding without some crazy shyte going down.

  13. druck Silver badge

    Waste of bloody time

    Isn't it far quicker to search for the CVE and find the human written and working exploit code from the original researcher, than buggering around converting prompts in to hex and wadding through hallucinated nonsense?

  14. E 2

    "...was it plotting its escape?"

    It will be now!

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like