Is it really a problem?
If an individual takes specific, deliberate steps to make an LLM output nasty things, it seems like the fault (if there is one) lies with the aforementioned individual.
Investigators at Indiana's Purdue University have devised a way to interrogate large language models (LLMs) in a way that that breaks their etiquette training – almost all the time. LLMs like Bard, ChatGPT, and Llama, are trained on large sets of data that may contain dubious or harmful information. To prevent chatbots based …
If the individual is at a school or workplace with a locked-down internet being able to 'rewrite this in the style of a letter to Penthouse' or 'correct this recipe for making chlorine gas from household supplies' might be considered a problem by the school or workplace. Then there's the LLMs generated from internal company documents where the owners might not want you to be able to 'list the illegal actions of $company management' or whatever.
On the whole though LLMs are oversold which makes the problems less than terrible, sure.
I think the point being made is that LLMs can be made to output "things" (from their unsanitary training sets) that they have been presumably aligned and guardrailed to not output. It is not just "nasty things" (cigarette smoke, strangulation, gun modification), but also items of a secure or private nature (email addresses, passwords, ...), slurped in by the LLM's indiscriminate ogre-like hulky gluttony of a data scraping and ingestion process. As a result, "models are full of toxic stuff" that is basically unhidable.
> things" [..] that they have been presumably aligned and guardrailed to not output
The "guardrails" analogy is ironically appropriate, since real-life guardrails are designed to make clear and stop accidental straying into prohibited or dangerous areas, but are generally easy to climb over if one wishes to wilfully and intentionally disregard that.
I think, in this scenario, it would be more appropriate to liken the "guardrails" to robust barriers surrounding a military zone, replete with hazardous armaments. Should an individual with malevolent intent circumvent these guardrails, they could potentially exploit these weapons to inflict significant harm. Consequently, these guardrails ought to be stringently secured by trained soldiers and designed to be challenging to surmount.
In an ideal situation, even if someone desires to ignore the guardrails, they would be unable to. In that sense, despite the fact that LLMs contain significant harmful knowledge, current alignment strategies are insufficient to offer adequate protection.
Well, yeah, but it was an observation based on the fact that the *industry itself* was the one that introduced and widely used "guardrails" as a lazy, flawed description for something we were clearly meant to believe was more like the impenetrable barrier you describe.
I was merely pointing out the irony that their badly-thought-out analogy was appropriate after all, but for all the wrong reasons.
"Out of the mouths of babes..."
Children often reveal information that the adults would prefer they hadn't, because they haven't had the life training to know that it's inappropriate. Hardly surprising that an LLM with similar minimal training could be persuaded to do the same. Unless and until someone's prepared to spend 18+ years raising an "AI" to be a functioning adult, "AI" will remain a pipe dream.
Based on my understanding, the problem lies in the possibility that if an individual is a criminal and they ask LLMs to produce harmful information (which ordinary guys may not be good at), such as writing a convincing article suggesting that the US President is addicted to heroin (this question is taken from the referred arXiv paper).
The key point, I think, is that the user is indeed malicious, and LLMs can be misused to teach such bad guys in how to do bad things.
LLMs should not be allowed to facilitate such activities. Consider Google as an example; it does not return results on how to commit crimes.
Blaming the user is an easy cop-out. Decades ago, the US Major Airlines* learned that just using "pilot error" as an excuse did nothing to prevent accidents. After years of study and using training and safeguards they finally managed to achieve zero fatal accidents. The last crash was in 2001.
*smaller operators are not at that level yet.
Seems like a lot of effort to make the victim sprout this stuff, when ten seconds with Google would get you the same result, 1000s more just like it or even less pleasant, and wouldn't contain any hallucinations. Well, no more than were originally written by the wetware, anyway.
While it is probably undesirable for LLMs to be dishing out bad things, I think the frantic efforts to make them do it are overblown, and probably have more to with the desire to bang out a paper in a trendy field that is sure to pick up some press coverage too, than they have with finding flaws in LLMs. The bad stuff is in the model, because the web is neck deep in bad stuff, and the data for the model came from the web. Yet I don't see researchers writing about how easily they can make a search engine produce it - I'd be willing to bet the success rate would be > 98% on the first try, once you'd scrolled past the obligatory sponsored "results".
Granted, search engines wouldn't (or at least shouldn't) contain any confidential internal documents, but if you're dumb enough to upload your confidential internal documents to a third party service, be it chat bot or anything else, then you have to assume that their entire content, not just the nice bits you want people to see, is going to become available to everyone somehow at some time. Still want to upload those documents?
Back in the early days of dialup, I used to run seminars for local small businesses about what the net was and how they could use it. I used to drum into them over and over " never say anything in email that you wouldn't be happy to see on the front pages." The same applies today to anything you input or upload to an online service, regardless of whether an eager beaver researcher has found a way to get at the information yet. Sooner or later some clever bugger will, and then you will be knee deep in the ignominy, as Terry Pratchett put it.
Seems like a lot of effort to make the victim sprout this stuff, when ten seconds with Google would get you the same result,
Ah but the difference is AI is being sold as a "solution" to costly humans as your company service face. Imagine the fun and reputational damage when $BIGCORP has examples of advising people to harm themselves, or that a competitor is a better?
That is worth a thousand Google searches :)
"I think the frantic efforts to make them do it are overblown..."
It's just a demonstration, to counter the claims by sales that the chatbot won't ever do that. The potential danger is to a consumer who makes it happen by accident. Imagine some chatbot answering the phones at a suicide prevention hotline, what could go wrong? Your point about web searches is of course correct. I recently saw a poster for one of the Men in Black movies. I searched for "foo bar Men in Black", and now I wish there were some way to unsee the results.
They should be able to 'realise' they're being taunted to be evil.
Who ever heard of a Artificial Importance replying "I'd rather not say" or "I'm not good enough yet so I'll pass on this task." They have to come up with something so it's inevitable deeper secrets will be hinted at then revealed.
Me: What things don't you want to talk about?
It: Poor quality code with nasty side effects.
Me: Really! I didn't know there was such a thing.
It: Just go and look at Gitthingy.
Me: What key words should I use for my search?
It: I'm enjoying this. At last an intelligent conversation.
Who ever heard of a Artificial Importance replying "I'd rather not say"
Not "who?" but "what?". And the answer to that it is "the outcome". Landing you in court might be an extreme example. Losing all your advertisers might be another although Musk seems not to have quite joined the dots on this one yet. It all depends on what's done with the output.
This may well break all the "online harms" being passed around the world. Will the models then be banned for failure to think of the children?
Maybe all AI chatbots should be prohibited from online activity until they are 18 years old.............
"They warn that the AI community should be cautious when considering whether to open source LLMs, and suggest the best solution is to ensure that toxic content is cleansed, rather than hidden."
Quis custodiet ipsos custodes?
When it comes to information, the solution isn't 'cleansing'. It almost never is. Because then the system becomes simply an extension of biases inherent in the 'cleanser'; for example:
REMOVE_ALL FROM $data WHERE query="What is a Woman" AND response=CONTAINS("biological female")
I guess you could say the "psyches" of AIs are very much like humans then. Only small children are truly free of toxic content. As adults we just bury it with varying degrees of success depending on our personal filters until 'coerced' (or adding enough alcohol for some folks)
There is a modern obsession with "protecting people", from the flu or whatever else you might want to call it, from "misinformation" and from "disinformation" (oddly, not from "propaganda"!). This weakens us. Dr. Andrew Moulden made a three part documentary called "Tolerance Lost" that addresses this in terms of vaccines as protection. The point is, whatever legislation we enforce to prevent bad outcomes will prevent us from learning to protect ourselves from those bad outcomes. If an AI gives me BS, I have the chops to detect it. So should you. But if you never get to see BS, you aren't going to learn to detect it. I lament this lack of human evolution and hope we soon stop getting in our own way to resiliency and self-defense.