Certify Your Corpus
> If the mechanisms in use do not comply with the rules applied to them, then they should not be used until they do.
Totally agree.
However (oh no, here it comes, another rambling post!):
Can we make a clear distinction about what "the mechanisms" are and control each appropriately? Just 'cos there is the risk of over-reaction, not separating the players from the game, throwing the baby out with the bath water, and creating another "AI Winter".
Please, still put the boot into OpenAI, Bard and that other one - they've deliberately[1] pissed on GDPR etc (just group them all together as "privacy" for the moment). But, despite their own hype, they aren't the be-all and end-all.
Now, as with every other system, we've got data collection, data storage, data transformation and data retrieval happening. In (the current crop of) LLMs the first is creating a training corpus, the second the actual training run, the remaining two are munged together when the model is used. Trivially, to comply with privacy, you have precisely two choices: don't feed your system with vulnerable data in the first place or make sure that the stored data can be found and eradicated as required (and without trashing your system as a result[2]). We want to be sure that The Rules reflect the two options (or at least The Procedures Required to comply with The Rules).
The (current) LLMs are, by their very nature, incapable of the second option: some of them have proven to be tweakable (the Rank-One Model Editing (ROME) algorithm) but that isn't anything to rely on. The current Reg article notes some alternate ways to structure the models that will help but we aren't there yet (which is why the risk of another AI Winter is a concern, as that'll shut down the bulk of research into said structuring).
So, right now applications of LLMs[3] can only be managed via the first option. So:
We need to have certification applied to the training data and enforceable requirements on the systems that use certain certifications of the training data - plus a very large rolled-up newspaper applied to the existing suppliers of the training data[4] to get them certified. The requirements would then be along the lines of:
* All systems must identify the corpora used and their certification (this is the big change from the current situation)
* No certificate? Can only be used in-house for limited purposes (e.g. research into option two; demos to PHBs that using this stuff will get them sued), no public access to the model, publicly released responses from it allowed only in research papers with full attribution of the corpus (enough to ensure the results can be replicated, only within another house)
* Certified clean of infringing data (e.g. only uses 18th Century novels)[5]? No restrictions on use.
* Known to contain a specific class of data (e.g. medical records)? Restricted access to identified classes of people, must detail the start and end dates of the data collected, where it was collected, intended use; a stated expiry date (set to comply with the relevant usage - e.g. European data expires within the "right to be forgotten" time) - at the end of the expiry on the corpus, it must be updated and re-certified, any models trained on it must be deleted and new ones trained from the re-certified corpus (and there is an opportunity for supplying automation to users of the models)
* Variants of the "medical data" are required, for example data from a proper double-blind study will be accompanied by any appropriate releases by the members of the study and won't have an expiry date.
* And so on[7]
[1] either it was deliberate or they were all lying through their teeth about how expert and knowledgeable their teams are - or both, of course.
[2] if you just go around cutting bits out of the Net then it is very likely that you'll just increase the rate of hallucinations: if you pose a query to one of these models, you *will* get a reply out; if it can't synthesise something "sensible" because the highly-correlated paths have been broken then it'll just light up the less likely paths and bingo "Donald Trump rode a brontosaur in the Wars of the Roses" when we all know that it was Abraham Lincoln on a T-Rex in the Black Hawk War.
[3] and they are going to be applied, however one feels about that, whilst there is the perception (founded or unfounded) that there is money to be made by doing so. Well, duh, but I wish applications were better thought out than that.
[4] yes, no doubt the well-known names have done a lot of collecting (sscraping) themselves, but they also pulled in pre-existing text corpora; if you are making your own LLM there are suppliers from which you can get a raw-data text corpus or a "pre-trained" model that has already been fed on "the general stuff" (and you are expected to continue training on domain-specific texts).
[5] or any other more sensible[6] criteria that you feel will comply with the concept of "doesn't contain iffy data" or even "has a less than k% chance of containing iffy data" on the grounds that everything in real life has risks and we're looking at managing them.
[6] unless you are doing linguistic research, in which case this is a perfectly sensible corpus
[7] if I try to continue listing variants this will never get posted in time[8]
[8] oi, rude!