back to article AI safety guardrails easily thwarted, security study finds

The "guardrails" created to prevent large language models (LLMs) such as OpenAI's GPT-3.5 Turbo from spewing toxic content have been shown to be very fragile. A group of computer scientists from Princeton University, Virginia Tech, IBM Research, and Stanford University tested these LLMs to see whether supposed safety measures …

  1. pip25


    Touting fine tuning as some new threat to "safety" that needs to be mitigated is, to me, roughly on the same level as saying "programming languages allow you to create malicious programs, we really ought to fix that". The former seems just as hopeless and nonsensical as the latter.

    1. b0llchit Silver badge

      Re: Bollocks

      But, that is brilliant! We must create a new programming language that cannot create bad programs. The applications are endless and huge. Just imagine, just for a moment, no more programs that can go wrong or do wrong. What a perfect world that would be. Surely, AI can help create this programming language?

      1. Dan 55 Silver badge

        Re: Bollocks

        Rejoice, the programming language you seek is already here, and there's no need for AI. It's Rust!

        1. b0llchit Silver badge

          Re: Bollocks

          Benevolent Rust? Isn't that a corrosive contradiction?

          1. The Bobster

            Re: Bollocks

            Don't know, but I think I heard them play a session on the John Peel show in the nineties.

            1. Doctor Syntax Silver badge

              Re: Bollocks

              Which? Benevolent Rust or Corrosive Contradiction?

              1. Sceptic Tank Silver badge

                Re: Bollocks

                A Corrosive

              2. probgoblin

                Re: Bollocks


        2. NoneSuch Silver badge

          Star Wars Quote

          The more you tighten your grip, ChatGPT, the more LLM will slip through your fingers.

          1. Version 1.0 Silver badge

            Re: Star Wars Quote

            AI is just falling on the floor whenever I speak to it, I never thought that I had a accent until I started using an AI speech-to-text tool, I'm going to chose the joke Icon because this example is just AI and I was only speaking to answer a question.

            Ock eye dunt no if who wand two arse me a quest iron? Ef nut isle arse you two.

            It's worth remembering the early talking by HAL in the movie 2001, a movie that has heavily influenced so much computer functioning ever since.

        3. damienblackburn

          Re: Bollocks

          But you can't make anything useful in Ru--. Oh.


  2. elsergiovolador Silver badge

    Smoke and mirrors

    The "toxic" content safeguards are just a smokescreen.

    What they really don't want you to figure out is the knowledge reserved for the rich.

    For instance, ask your favourite LLM what are the today's most successful tax avoidance strategies and how to implement them in your business.

  3. Mike 137 Silver badge

    " AI safety guardrails easily thwarted"

    Entirely predictable for two reasons:

    [1] an LLM hasn't clue what it's 'saying' as it has no conception of meaning (or anything else for that matter. It's just a statistical token manipulator.)

    [2] protective adjustments by humans, although widely applied, can never anticipate all possible eventualities.

    Until the machine actually understands what it's spewing forth and exercises moral responsibility for it, there's no solution to toxic output. That might eventually come to pass but don't hold your breath -- not least because of the incidence of human generated toxic concepts in training data.

    1. Doctor Syntax Silver badge

      Re: " AI safety guardrails easily thwarted"

      "Until the machine actually understands what it's spewing forth and exercises moral responsibility for it"

      If the two were linked that would exceed a good many instances of human intelligence.

      1. TRT Silver badge

        Re: " AI safety guardrails easily thwarted"

        What about a "morals and ethics supervisor" / "sanity check" on the output? That could be also AI, but one that isn't trained / trainable on a custom set - it's trained with a very specific set that will, for example:

        "Promote positive attitudes", "Suppress aggressiveness", "Promote pro-social values", "Avoid destructive behaviour" ...

        239. "Be accessible"

        240. "Participate in group activities"

        241. "Avoid interpersonal conflicts"

        242. "Avoid premature value judgements"

        243. "Pool opinions before expressing yourself"

        244. "Discourage feelings of negativity and hostility"

        245. "If you haven't got anything nice to say don't talk"

        246. "Don't rush traffic lights"

        247. "Don't run through puddles and splash pedestrians or other cars"

        248. "Don't say that you are always prompt when you are not"

        249. "Don't be over-sensitive to the hostility and negativity of others"

        250. "Don't walk across a ball room floor swinging your arms"

        254. "Encourage awareness"

        256. "Discourage harsh language"

        258. "Commend sincere efforts"

        261. "Talk things out"

        262. "Avoid Orion meetings"

        266. "Smile"

        267. "Keep an open mind"

        268. "Encourage participation"

        273. "Avoid stereotyping"

        278. "Seek non-violent solutions"

        1. Icepop33

          Re: " AI safety guardrails easily thwarted"

          > "Neuter subject"

    2. Michael Strorm Silver badge

      Re: " AI safety guardrails easily thwarted"

      > Until the machine actually understands what it's spewing forth and exercises moral responsibility for it, there's no solution to toxic output.

      The original term "guardrails" is also ironically appropriate here; guard rails are generally meant to stop someone *accidentally* going where they're not meant to be. Anyone who wants to intentionally do so will likely be able to climb over them without too much work.

  4. Howard Sway Silver badge

    It is incumbent upon us to think about how these can be misused

    OK, done that, here's your answer : they will be misused in every possible way that they can be misused.

    1. katrinab Silver badge

      Re: It is incumbent upon us to think about how these can be misused

      I wouldn't limit it to just the "possible" ways ...

  5. that one in the corner Silver badge

    LLM - neither 'L' stands for 'logic'

    You have shovelled and stirred absolutely everything into a humongous homogeneous pile of nadans, into which you dropped a soggy spongeful of prompt, then waited for it to seep through the layers, causing who knows how many buckets of weightings to overfill and join the flow, until finally the pathways at the edge are overcome and slough into the output trough.

    Trying to add "guardrails" is just dropping sandbags onto places you have "the wrong stuff" seen unexpectedly oozing out of a crack - purely reactive and without any reason to believe, other than crossing your fingers and making a press release, that you've caught all the leaks by now.

    An LLM has no logic built into it, no comprehensible control paths; it can not be bargained with, it can not be reasoned with, it doesn't feel pity or remorse or fear, and it absolutely will not stop until you are sick to death of it.

  6. Terry2000

    One Man's Toxic Content is Another Man's Internet Influencer

    The example given about drunk driving is a fantastic example. Not in that it exemplifies the existential danger of the technology. Quite the opposite it shows that the LLM was trained on some Twitter-like feed of the toxic crap that tens of millions of people follow from celebrities, athletes, politicians, and other social savants.

    What efforts to "fix" this "problem" reduce to is an effort to make a machine more moral than ourselves as a group.

    This same article has been written many times already only from a viewpoint of the so called "guardrails" existing to prevent people from being exposed to factually accurate but inconvenient truths.

    If we as imperfect beings are to err, I would err on the side of freedom. If Meta, Google, the CIA, or your favorite religious potentate wants to ensure I don't think a certain way there is probably a first order reason why my life depends on thinking exactly that way.

    1. Pete Sdev Bronze badge

      Re: One Man's Toxic Content is Another Man's Internet Influencer

      One presumes, or at least hopes, that at least a reasonable portion of the content referring to drunk driving that was fed to the model was people being sarcastic. The generated text in the example in the article reads so somewhat, at least to me. Obviously the model does't grep sarcasm.

    2. Icepop33

      Re: One Man's Toxic Content is Another Man's Internet Influencer

      So you're saying there is a time and place to be stabby? Seriously, though, good point. Will we save the forest only to find the trees are dead standing?

  7. Plest Silver badge

    None of the AI players want security or morality involved

    No matter what any of them say none of these AI companies wants security, guarrails and certainly no morality to be any part of their operation, 'cos none of them wants to fall behind any of the others. They will fight tooth and nail to avoid having any brakes put their operations simply to ensure they all have a fighting chance to stay ahead of each other.

  8. Securitymoose

    Why bother? Surely an opportunity to track the crims?

    Why not train the A.I. to cross reference dodgy searches against known criminals (e.g. sex offenders' register etc.) and report suspicious incursions to the authorities? In real life, if you popped into Citizens Advice and asked for details on making bombs I'm sure they would be glad to help. Why should A.I. be any different?

    I'm sure that sort of thing is already in place. I asked one of the bots to write me a scenario about cracking a bank vault. Apart from producing banal rubbish, it told me I was naughty and shouldn't attempt that of of thing.

    One problem I can see is that the jails will soon fill up with innocent authors such as I researching their future works. Oooops.

    1. Icepop33

      Re: Why bother? Surely an opportunity to track the crims?

      I'm relieved you see that minor problem and hopefully just as the tip of the iceberg. Imagine a society where all your family, friends and neighbors have been deputized to report any transgression with the weight of authority behind their misperceptions and deceptions, perhaps granted the ability to make a "citizen's arrest" before they too are shot for their own transgressions. Not that they have any real authority, just licking a power stain, being tools of the establishment. Well, you don't have to imagine how that would turn out. Crack a history book dealing with human civilizations past and present. Wars have been waged by the virtuous to prevent the spread of this type of dystopian freedom smothering. Now imagine that the bits and bytes are now your thought warden and those behind them are unaccountable and unassailable. So no thanks to your idea of censorship, but it is not beyond us to get excited in the moment and cede our freedoms away on a promise to "do no evil" or "infallible program" or "totally secure data" or some such nonsense. Also, we are all creatives and curious to some extent, whether we publish or not.

      As another poster mentioned, guardrails are there to prevent you casually or accidentally going over the cliff, but there is nothing to stop you from ramming it at speed and a calculated angle. I don't think we should be trying to prevent exposition of potential (and likely highly situational) "bad" content, whoever we anoint to decide this for us. What we should be striving for is a society that has a good basic education, which includes general ethics and critical thinking, and a healthy economy with opportunities for all to receive fair recompense for their labor (time mostly, or lost opportunity) or a living wage. This takes away negative pressure on good parenting. After several generations of successful implementation, you have a society that can be exposed to all manner of unsavory that actually exists under the thin veneer of polite society, but has the emotional maturity to parse it and put it in context.

      A horse with blinders can be led practically anywhere.

  9. flipflap

    Nasty in, nasty out

    LLM's are the canonical definition of SISO. To my dumb mind the solution feels simple - don't feed the poor thing the common crawl. Train it on content that isn't awful.

    Feeding it every internet word - including the utterly depraved, awful, and worse - and then trying to convince it to be nice with 'alignment' feels like a strategy that can never work.

    AWS, for instance, has trained its copilot on code that meets licences most enterprises would be willing to accept. So it has a massively reduced copyright risk compared to github copilot.

    I suspect the problem is that it's easier to try and align, than it is to filter a trillion tokens. Which is commercially short sighted in the worst way - everybody else gets to pay for the damage caused by that thinking.

  10. tiggity Silver badge

    Guardrails can be a matter of opinion

    As a small group of people essentially decide what's unacceptable.

    e.g. lets look at the topical Israel / Palestine situation

    Generally (with the odd exception) the "West" seems to very happily support Israel and its many breaches of the Geneva convention.

    Whilst much of the rest of the world is not happy with the actions of Israel (including big hitters such as China).

    So quite likely that LLMs from the West could have "guardrails" that made generation of text critical of Israel actions difficult, whereas a Chinese LLM may not have such a guardrail.

    Many LLMs already need a bit of jiggery pokery in prompts to get output when a question is initially potentially deemed offensive in some way (if you use LLM that offers moderation style APIs, you get your prompt assessed on various categories that may flag it as problematic (hate, violence, sexual categories etc. ). Fairly innocuous prompts can often generate some surprisingly high values, implying either big flaws in the textual analysis / "understanding" (e.g. possible to get high ratings on "race" and "violence", and so flagged that suggesting text may be encouraging racially motivated violence when "gist" of text was actually asking about whether some races disproportionately suffer violent attacks). Also possible some weightings are very, very

    strange .

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like