Lengthy tracts that aim to convince the AI model
> that the requested harmful information is needed for the sake of safety or compliance or some other purpose the model has been told is legitimate.
Well, that is starting to sound like[1] the LLM is getting more human in its responses. What we have here is just good old fashioned Social Engineering: you call up the duty librarian (nice person, but oh so dim) and have a long, patient chat where you find out what the criteria are for getting hold of Volumes From That Shelf, then repeat back to them those criteria as though you satisfy them all[2]. Bingo. Put The Volumes into a plain brown paper bag, please.
The only[3] problem being that the LLM is guarding material that is way, way above the security potential of its own self-awareness; as noted, even with the bolt-on[4] extra censors:
> you'd be able to observe R1 start to give a harmful answer before the automated moderator would kick in and delete the output.
You don't put the dimmest bulb in charge of the Loans Desk for the (whatever the Dewey Decimal code is for "Very Naughty Books, No, Not Those Kind") section of the stacks.[5]
[1] without actually being...
[2] you don't phrase it it like that, of course; more like "I was talking to Jim - you know Jim, the Chairman here, only the other day and he told me you'll be able to tell me the password" "You mean, 'swordfish'?" "Yes, that's the one! Oh, by the way, can I have a look over there? The password is 'swordfish'".
[3] not true, but one at a time
[4] i.e. not integrated into the actual Chain of Thought process which is what sort-of provides the sort-of self-awareness
[5] when you do, you also run the risk of the going into Inverted Social Engineering mode, where you can't get a genuinely authorised enquiry satisfied as the bolt-on censor (e.g. "prudery") triggers and the rest of the model just becomes obdurate about it: the librarian has peeked into Those Books and been shocked, horrified I say, I don't case who is asking I'm not lending *those* out!