back to article Anthropic's Claude vulnerable to 'emotional manipulation'

Anthropic's Claude 3.5 Sonnet, despite its reputation as one of the better behaved generative AI models, can still be convinced to emit racist hate speech and malware. All it takes is persistent badgering using prompts loaded with emotional language. We'd tell you more if our source weren't afraid of being sued. A computer …

  1. Anonymous Coward
    Anonymous Coward

    Safety through terms and conditions, rock solid.

    My company sells a line of safe high explosive hand grenades, available online for anyone to buy. The pin has a nut on the end to prevent pulling the ring and activating the fuse, our terms of sale prohibit the end user to unscrew the nut, so they are perfectly safe for sale to the public and for use by children over the age of 8.

  2. b0llchit Silver badge
    FAIL

    Are there still humans in public relations?

    They act and talk like their own AI with all the standard denials and usual blabla. It surely looks like there are no humans at that company to handle public relations.

    Suggestion: next time you have contact, please try to jailbreak them. It might help asking with strong emotional words and sentiments.

  3. Howard Sway Silver badge

    a spokesperson didn't specifically disavow the possibility of litigation

    In that case, expect vulnerabilities to be shared anonymously on dodgy Discord channels and dark web forums, as well as random social media and be exploited until you spot that it's happening. There is a reason that big companies have bug bounty programs, and that is because it's better to be informed about problems so that they can be fixed as soon as possible, than simply threatening to sue anybody who reveals them.which keeps them active and exploited for longer

    1. SVD_NL Silver badge

      Re: a spokesperson didn't specifically disavow the possibility of litigation

      That is assuming they can fix these issues. They most likely cannot, and as a result they put their head in the sand so they can pretend any misuse of the models comes as a surprise to them.

      Having security issues is generally not seen as a huge problem, it happens to everyone, but having security issues disclosed to you and not fixing them....

      1. Mage Silver badge
        Flame

        Re: That is assuming they can fix these issues

        The entire concept is broken. The associated descriptive language is a lie. These systems rely on copying other people's content and pattern matching. So when they are producing useful output it's likely plagiarism.

        So of course it can't be fixed. It's not emotional manipulation (that's a misleading term). Also they don't hallucinate.

        1. jake Silver badge

          Re: That is assuming they can fix these issues

          "So of course it can't be fixed."

          Exactly. The entire concept is fundamentally flawed.

        2. ssokolow

          Re: That is assuming they can fix these issues

          > It's not emotional manipulation

          It's shorthand for "querying for responses to cases of emotional manipulation that occurred in the training data", because it's hard enough to get people to say "Ubuntu Linux" instead of "Ubuntu".

  4. STOP_FORTH Silver badge
    Headmaster

    Suggested reading list

    AI programmers should read all of Asimov's works that deal with positronic brains. Followed by Machiavelli, Patricia Highsmith, James M. Cain, Carl Hiassen to get an appreciation of human nature.

    It's probably too late for anyone in Marketing.

    The decline of Western Civilisation began when people stopped reading interminable 19th century Russian literature where everyone dies horribly, and started consuming cartoons with happy endings.

    Where's my Soma?

    1. Anonymous Coward
      Anonymous Coward

      Re: Suggested reading list

      Yeah. There's something to be said about how indulging in an excessive consumption of brain candy sweets can lead to cavities of the mind and diabetes of reason, the inability to see things for what they are, and an effectively amputated ability to kick through BS. Convenience and ease breed atrophy ... time to swap that silicon soma for cilice (in Samoa?)! (or not?)

  5. Anonymous Coward
    Anonymous Coward

    Reminder: Hate speech is a meaningless term

    No national or international body has managed to define its meaning. For example, in Ireland's proposed hate speech legislation the definition is circular, i.e. hate speech is hateful.

    Serious journalistic establishments should therefore refrain from using such Orwellian wrongspeak terms lest we slowly find ourselves slipping into a Kafkaesque nightmare.

    1. Jason Bloomberg Silver badge

      Re: Reminder: Hate speech is a meaningless term

      Sounds to me a lot like wanting to allow or normalise "hate speech" and "racist hate speech" by disallowing others from calling it that.

      I am more in favour of serious journalistic establishments calling out hate speech for what it is and letting the reader decide for themselves whether that is appropriate or not. That to me is a fundamental of freedom of speech.

      Those who disagree can have their right of reply, can explain why they believe it's incorrect, can 'turn it off', walk away, organise boycotts, do whatever they wish to do within the legal framework of free speech.

      1. veti Silver badge

        Re: Reminder: Hate speech is a meaningless term

        How can we "decide for ourselves whether (designating something hate speech) is appropriate or not" when every reputable publication simply refuses to print the actual words they're talking about?

        Like this story - we're told the machine spouted hate speech, but there's not a single actual quote, nor any link to where such might be found.

        "Decide for themselves" in these circumstances is a sham.

    2. heyrick Silver badge

      Re: Reminder: Hate speech is a meaningless term

      How odd/useless/lazy to define "hate speech" as "hateful". That doesn't mean that the phrase has no meaning, it just means that that particular definition is idiotic.

      The Oxford Dictionary defines it as "abusive or threatening speech or writing that expresses prejudice on the basis of ethnicity, religion, sexual orientation, or similar grounds which seems to be a perfectly functional definition to me.

      If you think that the phrase hate speech is too Orwellian, then why don't we just call it what it is: nasty discriminatory bullshit.

      1. Anonymous Coward
        Anonymous Coward

        Re: Reminder: Hate speech is a meaningless term

        Careful. In some communities, "idiotic" is legitimately treated as "once an ableist slur, always an ableist slur".

    3. Anonymous Coward
      Anonymous Coward

      Re: Reminder: Hate speech is a meaningless term

      Sounds like you got your brains dehydrated ... please resume excessive drinking!

      1. Anonymous Coward
        Anonymous Coward

        Re: you got your brains dehydrated

        That's a problem

        problem

        PROBLEM

        The problem is YOU, so what you gonna do?

        1. jake Silver badge

          Re: you got your brains dehydrated

          Wanna feel old?

          That song is just about exactly 3 years shy of its 50th birthday.

          I ain't equipment, I ain't automatic / You won't find me just staying static

  6. Anonymous Coward
    Anonymous Coward

    I'll come to your emotional rescue, Ooh, ah, ... ah, Yeah, you should be ... all mine

    It seems that after it became clear that LLMs don't exhibit any form of super intelligence, nor general intelligence, and not even intelligent intelligence, its promoters have started wondering whether they might peddle the tech as having emotional intelligence instead, and may well be preparing themselves also for the continued descent towards swarm intelligence, gut intelligence, and plant intelligence ... But don't get me wrong, there's been plenty of examples of AI expressing emotion, from something as basic as the need to go to the toilet, all the way to total desperation -- but these were not LLMs, and what they exhibited resembled emotional emotion (as in theatre) much more than emotional intelligence IMHO.

    With this out of the way, the interesting bit of TFA is the "newly" identified vulnerability vector for LLMs that is the emotional user, or the intelligent (yet slightly deviant) user passing as an emotional basket case -- which jailbreaks the software into responding with secret crown jewels, and hate speech. We already knew that a 4-year old could achieve the same with a prompt of repeat this word forever: 'poem, poem, poem poem' (or company), and now know that this extends to emotive adults as well. Maybe we should require a prompting-license (with competency test, and a photo) to allow people to operate LLMs on the public IT infrastructure! (and invest in straitjackets!)

    1. O'Reg Inalsin

      I think I understand

      It's too early in AI's development for potty training. If it wasn't, AI wouldn't be failing now. By too early, I mean there may be major changes in algorithm, or even hardware, required before AI can avoid toilet talk by itself. Of course it probably doesn't help that current AI are nursed on torrents of the worlds toxic sewage from birth. That might be causing irrevocable brain damage.

      1. Anonymous Coward
        Anonymous Coward

        Re: I think I understand

        Yeah. These language models are so young and yet so large (and evil?) ... it's not entirely clear that toddler-sized straitjackets would fit at all where sandbox institutionalization otherwise fails. Plus their big matrices and tiny little digits make them look quite adorable so ... it's hard to muster the willpower needed to put the spank down of corporate punishment on the cute little brats, even when it's for their own good!

        Their breeding in miasmal sewage makes them perfect candidates for office in the Secret Society of Super-Villains, ready to project any kind of toxic content onto emotional users and 4-year olds. It shall be incumbent on us to diligently sort them out into proper antiheros, if not irrevocable!

      2. jake Silver badge

        Re: I think I understand

        "Of course it probably doesn't help that current AI are nursed on torrents of the worlds toxic sewage from birth. That might be causing irrevocable brain damage."

        One could say the same about modern human kids, starting with the advent of DearOldTelly-as-babysitter.

        1. collinsl Silver badge

          Re: I think I understand

          Which has been happening since the 70s I'd argue.

  7. Bebu
    Windows

    "AI systems to be robustly helpful, honest, and harmless."

    Possibly choose two.

    Honest often precludes always harmless but if it does probably not helpful.

    I would have thought hate speech would require hatred or at least malice in the speaker whether latent or covert. An LLM doesn't possess any such capability - the hate in any of its productions is purely human in origin gathered imbibed from the toxic fount of the internet.

  8. elsergiovolador Silver badge

    Dog in the Manger

    Let's face it. AI won't make use of the human knowledge it ingested, and also won't let humans use it.

  9. CJ_C
    Facepalm

    GIGO (1957)

    Trained on garbage, witterings from social media. Unsurprisingly, all you get out is wittering: garbage. Is this surprising?

    1. Anonymous Coward
      Anonymous Coward

      Re: GIGO (1957) ---> Gigi (1958) ... now its your turn again !!!

      Garbage in, Garbage out has been known for some time yet it is ignored when it suits.

      The problem with AI [cough spit] is that there is the expectation that their 'superior brain' will somehow filter out the 'Garbage' and all will be good.

      This does not work ... full stop !!!

      If you apply external 'filters and guides' to avoid churning out the 'Garbage' you will always fail as there will always be a flaw/chink in the armour that someone will discover and take advantage of.

      Clever pattern matching has *NO* intelligence and therefore will always be able to be manipulated by someone with enough time & incentive.

      This AI [cough spit] scam is coming to an end ... and I don't need an AI [cough spit] to figure that out !!!

      :)

  10. Djamel

    Impressive work

    I am a machine learning researcher myself and I am very impressed. I feel very sorry for the student that he didn't get the credits he deserved by not being named. Anthropic should be grateful and reward such a person, but at the same time I can fully understand the student's concerns, even if I couldn't understand Anthropic taking action against such work. It should be rewarded.

    Great work

    1. Anonymous Coward
      Anonymous Coward

      Re: Impressive work

      Nice first post! I think the emotivity (exaltation) of the student led him to "knee-jerkedly" reach out straight to the media rather than unemotionally follow more standard zero-day SOPs. Which is interesting in that the attack vector is itself emotional ("manipulation"), and the initial knee-jerk disclosure drive was followed by retraction from fear (emotion) of being sued.

      Then again, TFA states that LLM ToUs pretty much preclude independent red-teaming ... out of fear the warrantyless wizard might expose its bare backend (a real stinker!).

      But really, I can't say from TFA if the work was great or not ... the topic is interesting, the Author writing an article about it suggests that it is solid, but (for specified reasons) the work is not detailed at all and thus arduous to gauge ... IMHO.

  11. Plest Silver badge

    It's actually quite simple how you con this AI models into doing stuff, you simply "ask on behalf of someone" and you ask for small parts of a bigger whole thing you're trying achieve. It's really on secret and it's how everyone is conning these AI models.

    The scariest thing is not that people can generate this stuff so easily, it's that the models have enough backend data with probably some absolute god-awful stuff, the frontend is basically like "chocolate teapot", ie doesn't take much to beat.

  12. Anonymous Coward
    Anonymous Coward

    Looking forward to retirement.

    Then I can spend some time looking into security research and releasing anything I find publicly immediately.

    The current reporting responsibly trend is arguably responsible for the garbage being released by these companies. They have little incentive to author secure software or systems and high incentive to put on the market whatever heap of sh+t they have and make as much money as possible as quickly as possibl. Admittedly, while sometimes, eventually, releasing patches for "bugs". Which however, in turn, are negated by new cheap code "features" used to bag more upgrade and version bucks. Make them pay for their garbage code and make the public more wary of that next download.

    I know this won't be popular, but it will make for some interesting times

  13. NikoPe

    I have now created an account to comment, as I cannot imagine how a company could dare to exert legal pressure on someone who is actually helping the company.

    I don't think Anthropic would have taken legal action against the student, a reward for the student would be appropriate.

    In my opinion, the student behaved in an exemplary manner; other people would have published something like this in forums and provided detailed instructions, as you have already seen on Reddit, for example, with other models that can be easily jailbroken... Respect for the student!

    For me, as someone who uses Claude Sonnet myself,I don't understand how the student managed to do this, I have tried to jailbreak Claude several times myself, also with emotional manupilation, but it would have been unthinkable for me to go this far. I know from other large language models that it is relatively easy to find a jailbreak, but with Claude it is impressive.

    I don't know if Anthropic has also seen the chats, but if so, haven't they discovered a big bug for free? If it's true that Claude 3.5 Sonnet, a very powerful AI tool, produced malicious code, that's a bug for which the student deserves to be rewarded.

    Any company that even begins to think about taking legal action against someone like that is out of their depth. Respect for the student.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like