Already the case ?
Should we therefore consider that Model Collapse is already a problem and that the current responses are already skewed?
What happens to machine learning models when they feed on themselves, when the data they ingest comes more and more from other generative models rather than human authors? This is already happening as the output of text models like ChatGPT and Bard, and of text-to-image models like Stable Diffusion, shows up on websites, gets …
A: stop blindly scraping everything off the Web.
Even without seeing your own output again, scraping everything was never going to provide a totally sane dataset in the first place.
But it was cheap and easy, because the only way they know to build these models is to keep on shovelling into the training maw.
But labelling, manually attaching provenance and "truthiness" values is expensive (and gets away from the Tech Boys Building Bigger Machines), and becomes unfair for smaller companies.
Shame they don't have a better way of building their models.
"if you then take all of this and you train a model on top of ... all of this distribution of data that exists out there that humans have produced ... including facts as themselves – and then you ask a model to model the whole thing and start generating data – which is statistically indistinguishable from this distribution of data – the model inherently is going to make mistakes.
And it always will make mistakes. It's infeasible to assume that in some hypothetical future, we'll build perfect models. It's impossible. "
But let's carry on anyway because we can make money out of it.
(Somewhat abridged and my emphasis.)
Sure, if you train a cloistered LLM on data certified to be pure, organic, pesticide-free, and free-range, you'll get something that only reflects YOUR unacknowledged biases, and that you can convince yourself is sorta almost error-free. But I bet it won't be very useful. Enlarge the training data to include stuff from the 'Net because you just need more training data, and you've introduced stuff that will skew the LLM. You CANNOT effectively vet a large enough training data set to make it really useful without an uneconomically large amount of tedious, soul-deadening human effort (which we all know those eager profit-hungry corporate types won't support), so the data will be poorly vetted (if at all!) and crap will get in. Game over, insert quarter. Your LLM is screwed.
Could this positive feedback reward system of confirmation bias algorhythmics (with drift?), resulting in distortion of perplexity, with underlying echo chamber entrapment and associated misperception of reality, be a model of social collapse as well?
... and it also sounds as though outputs from revolutionary AI models, once massively spread throughout the web, will be much like the revolutionary plastics pollution that we just can't get rid of, no matter how big the filters ... (is this food for tots?).
is the way he thinks big companies will have an advantage because they can pay for humans to create better training data.
He is technically correct, but big companies also can pay humans to do better quality control, can pay for safer and less polluting processes, can pay for equipment repair, living wages, research etc.
A training set that is massively flawed, but a dollar cheaper to produce than a "pure" data set, will be the one that is used.
This strikes me as similar to the human “echo chamber” effect, where somebody with an idea / strategy initiates consultation, but only accepts responses from those with positive views, the idea/strategy then continually evolves but with input only from those who were originally positive, with the group slowly decreasing in size as objections are ignored. Sometimes this method works, sometimes it fails.
BIML has been studying ML security colstoy for six years. We agree with Shumilov and furthermore thin this is the number one risk facing LLMs
We say in https://berryvilleiml.com/results/
[LLMtop10:1:recursive pollution] LLMs can sometimes be spectacularly wrong, and confidently so. If and
when LLM output is pumped back into the training data ocean (by reference to being put on the Internet,
for example), a future LLM may end up being trained on these very same polluted data. This is one kind of
“feedback loop” problem we identified and discussed in 2020. See, in particular, [BIML78 raw:8:looping],
[BIML78 input:4:looped input], and [BIML78 output:7:looped output]. Shumilov et al, subsequently wrote an excellent paper on this phenomenon.
Also see Alemohammad. Recursive pollution is a serious threat
to LLM integrity. ML systems should not eat their own output just as mammals should not consume brains
of their own species. See [raw:1:recursive pollution] and [output:8:looped output].
gem
ML systems should not eat their own output just as mammals should not consume brains of their own species.
So eating ancestor's brains is bad?
There's a book "The Ghost Disease" about how the habit in some parts of New Guinea caused Kuru - a type of Creuzfeld-Jakob Disease.
It's interesting that the same applies to AI