2001
Is anyone else reminded of the behaviour of the HAL9000 in Arthur C Clarke's 2001: A space odyssey?
Sometimes bots, like kids, just wanna break the rules. Researchers at Anthropic have found they can make AI models less likely to behave badly by giving them permission to do so. Computer scientists have long known that machine learning models may exhibit undesirable behavior that emerges from optimizing actions to maximize …
Actually, it's a little better described in the 2001 book.
The HAL9000 was given the overriding directive to be as helpful to the meatware as possible. Then the mission profile was revised to include investigation of the environs of Jupiter, following up from the burst of radio transmission aimed that way from the monolith on the moon.
Because the monolith signal investigation was supposed to be ultra hush-hush, the HAL9000 had another directive applied: to keep schtum about the revised mission plan until the last moment.
The conflict between the two directives upset the HAL9000, which eventually decided getting rid of the meatware would resolve the issue. Initially it couldn't manage taking direct action. That came later on.
I'm worried they're attempting to start injecting a notion of "natural" and "behavior" (and even "emergent") into this here artificial synthetic software matrix-vector computation tech (lossy databases, stochastic next word recall). For one thing, their paper [PDF] (TFA link) is titled: "NATURAL EMERGENT MISALIGNMENT FROM REWARD HACKING" ...
The text in there casually refers to "One natural mitigation for misaligned generaliation [sic]", "models naturally deciding to hide their misalignment", and "misalignment can arise due to natural differences" ... all of which is highly objectionable. What's next for this energy-hogging twinkie, healthy, whole grains, organic, bio, and superintelligently mischievous?!?!
I'd say it seems they've pre-concluded their gizmo exhibits intent (suggesting sentience) in a hyperbolic bout of anthropomorphization fever. Someone at Anthropic needs a doctor stat, imho, or straitjackets!
It has to be said that anthropomorphic language is commonplace in technological and scientific discourse.
Heavens, cognitive neuroscientists talk about "behaviours", "cognition", "intentionality", even (god forbid) "consciousness" when we know full well it's just basic electrochemistry implementing (albeit somewhat complex) statistical inference on sensory input.
"I know Dilbert is very much cancelled these days."
Scott Adams is/was cancelled.
I don't recall Dilbert ever uttering anything that might offend the exquisite sensitivities of the cancellati although they now have much bigger fish to fry now and seemingly with a lot less oxygen.
Australian English already had a word rort which sort of covers this behaviour.
Until AI can fear a visit from the toe cutters† setting these boundaries is not likely to be very effective.
Recognising and understanding rules then choosing to obey those rules presupposes at least consciousness if not free will which convinces me that the AI bros are either total grifters or irretrievably lost in La-la Land; probably both.
† likely never.
Isn't this just moving the goal post and shifting the incentives around?
This feels like treating symptoms instead of causes.
They expect the model to produce something and fit an known set of outcomes. So the models lie, fake data, hallucinate to please the developer and users.
So instead of tricking AI models into 'correct' behaviour like teenage children, why don't they encourage models to just say they don't know. Say it is indeterminate, or it doesn't know how? That would go way further in encouraging my trust in AI models if they would do that. I would start to trust the scope of AI's knowledge base and where I might have to do real research. Instead it behaves like one of those employee that thinks they are more competent then they are and will lie to cover it. They just insert chaos into everyone's day. I respect and trust people more who say they don't know.
But perverse incentives prevail. AI vendors would have to admit their model doesn't have the entirety of human-knowledge and that doesn't sound as good in marketing and ethics doesn't make VC dollars flow. It seems like business majors all nap during ethics classes.
> ... why don't they encourage models to just say they don't know.
Hey, perhaps we could even encourage humans to do that.
(To be fair, as humans go, scientists in particular are actually rather good at that, usually followed by "... but we're working on it".)