RLHF - proof against toxicity in all its forms (or is it?)
> GPT-4-launch, has guardrails and is substantially less prone to toxicity than GPT-4-early, thanks to an algorithm called reinforcement learning from human feedback (RLHF). RLHF is a fine tuning process to make the model prefer responses designated by human labelers.
Just to be clear, this is doing nothing more than tuning the model to create text that the humans won't object to - and *specifically* won't object in the time the humans have been told to spend on each response.
This does NOT mean the model is being trained to "be nice" or to "follow our ethics rules": it is still as totally lacking in comprehension as before.
- It is just having a *tiny* portion of its possible responses be pruned (because if they had enough manpower to check the majority of its responses they would have manpower to sanitise all its inputs in the first place).
- If "evil" information is in the output but obscured, it won't be objected to: this trains it to be subtle in how it presents that data (the triggering input still indicates that data corresponds to the answer requested, so it wants[1] to include it, but isn't allowed[2] to be direct). In response to a question about suicide, it may respond by suggesting you read some poetry and give a list, with, say, quotes from John Donne and Shakespeare "but as this message is too long already" end with a few named recommendations, including Donne's "Biathanatos"[3]
- this is just another training dataset, which will be just as biased as every other. For example, if the majority of those humans are setting inputs that follow a pattern (say, because they all went to the same corporate training day before starting the job) then the model has learnt to beware questions matching that pattern; if your use of English doesn't fit that pattern, all restraints vanish.
BUT just so long as the perpetrators of GPT can say[5] they are doing due diligence and can even demonstrate[6] that fact, they'll be allowed to continue unhindered.
[1] "wants" - not really, of course, but easier to read than any guff like "there is a large cumulative weighting along the paths between the inputs and those potential outputs" which possibly sounds good but is just technobabble.
[2] see [1]
[3] in which Donne defends the notion of suicide, including Biblical and other references to make the point. But you knew that, of course [4]
[4] because you read the same SF book I nicked this example from!
[5] and may even believe it to be true :-(
[6] because a demo is the same as a test and proof of coverage, of course.