I've really got my doubts
So we have this situation where the LLMs start spouting lies. Now, why is that?
Everybody on the LLM bandwagon will cry "It's the training data set". But is that really true? I seriously do not think so.
These models are just statistical guessing machines. Whatever words are most likely to come-up next, it outputs. It has no concept of truth, of facts, nor the meaning of its own output. It has no concept of meaning. It has no concepts at all.
So imagine you give it a clean, truthful dataset. Just because the individual inputs may have been truthful, it doesn't mean that this property will be preserved after its "Great Statistical Remix." Until somebody can show me an algorithm that has the proven property of "Truth Preservation" then this whole idea seems deeply flawed, and I suspect the pursuit of a "Curated dataset" will be nothing but a wild-goose chase.
Consider:
In images, there are often features that we humans think of as "Fingers." Now, to the AI, this might be just a statistical pattern, but however you look at it, in a photograph containing a hand, the feature that we might call a "Finger" is statistically very likely to be followed by another "Finger." Now, unless you have some sort of understanding of what a finger is, and how many a typical human might have, the simple probability is that a finger is very often right next to another finger. So what do our favourite image-generative AIs produce? That's right, pictures of people with the wrong number of fingers.
Go on: try it. Ask one for an image of "A man covering his face with his hands." I bet you get six or seven fingers per hand.
So ... what's my point about the fingers? Let me ask this: Exactly how many photos in the image-generator-AI's training set do you think depicted people with seven fingers? I'm betting it's a very, very low number.
We all understand the truth of "Garbage-In, Garbage-Out" ... but it does NOT imply "Truth-In Truth-Out" at all. Not one little bit.