Asking the impossible
> Grok's human creators appear to have failed to prevent it from creating posts that remove the clothing from real people in real photos when asked to do so.
More to the point:
Grok's human creators have absolutely no idea how to ... prevent it from creating posts that remove the clothing from real people in real photos when asked to do so. Or any other unwanted outputs.
The underlying models are completely opaque. Nobody, but nobody, has any knowledge of what fiddling with any given nadan of the bejillions of so-called parameters will do (without just trying it and hoping it'll get involved in processing a test prompt). Let alone inverting that and calculating the *complete* set of changes that'll *reliably* ensure a *desired* change in the results[1]. And if they *do* luck upon an observable result[1, again] the next bit of training will jumble it all up again.
So they can't honestly claim to be modifying the contents of the LLM to remove the ability to do the Bad Thing[2].
Leaving them with trying to bolt on external filters - but made of what? What do they have that could possibly do that job? Something that can parse and comprehend the natural language prompt? But if they had that, why prat about with LLMs in the first place?
The whole "we are adding guardrails" spiel just sounds like wishful thinking (if not outright delusional thinking or, say it softly, simple fraud).
[1] there was an article on El Reg a while back (sorry, ref not hand) where one research group claimed to have found a nadan where "the concept of Paris" (IIRC) was stored and changed it so that the model inserted "London" instead (at least, for their test prompts), but that was no more than finding where one string token's id number had been stored and changing it for another, like getting the id wrong in a case-statement that prints a readable value for an enum. AND there wasn't anything presented there to definitively prove that their model didn't contain another activation path which led to another instance of that id, which hadn't been changed, so still got converted back to the string "Paris".
[2] even if they tried an approach like an ice-pick lobotomy (e.g. a less nonsensical version of "feed in a prompt and if it generated a Naughty Result, look for the 'parameters' that were involved and set them all to zero") they'd need to send in every single possible variation of that prompt; good luck with that.