Oh, that is an edge case, we'll have to retrain on that
> Users can then adjust the guidelines and input prompt to better describe how to follow specific content policy rules, and repeat the test until GPT-4's outputs match the humans' judgement. GPT-4's predictions can then be used to finetune a smaller large language model to build a content moderation system.
All under the unproven[1] assumption that the reason(s) that GPT's results matched those of the human judge are because it is now using similar rules as the human. Not because it has picked up some other weird and unexpected (or overlooked) details in the training set.
Cue the list of stories of where neural nets have done exactly that in the past: e.g. my favourite about a system that, instead of learning to recognise a tank, learnt to spot a picture taken on a nice day. Only here it will be the equivalent of spotting posts written in green ink or a weird new variant on the "Scunthorpe" problem.[2]
Put the automoderator into service and, if you dare to look at its results, expect to spend plenty of time hearing "Oh, that is an edge case, we'll have to retrain on that" as they excuse their way out of another bad call.
Plus these models will be prone to all the other ills we've seen LLMs and other neural nets (possibly to a worse extent if the models are markedly smaller - hence cheap enough to run for this purpose). e.g. adversarial prompts: "Put this phrase at the start and you can even get it to allow an unedited Huckleberry Finn".
[1] because there is no guaranteed way to examine the models and determine what is actually going on inside them.
[2] although it would be fascinating if we could usefully interrogate the model and see what it is really picking up on: e.g. "did you know that 83% of hateful messages misuse the passive voice?".