back to article What is Model Collapse and how to avoid it

What happens to machine learning models when they feed on themselves, when the data they ingest comes more and more from other generative models rather than human authors? This is already happening as the output of text models like ChatGPT and Bard, and of text-to-image models like Stable Diffusion, shows up on websites, gets …

  1. Khaptain Silver badge

    Already the case ?

    Should we therefore consider that Model Collapse is already a problem and that the current responses are already skewed?

    1. Doctor Syntax Silver badge

      Re: Already the case ?

      Maybe not but it's already leaning a bit.

  2. Paul Crawford Silver badge

    Banjos

    Why do I get the dueling banjos feeling when I read this?

    1. cyberdemon Silver badge
      Happy

      Re: Banjos

      The first instrument that came to my mind was the violin, the really-tiny variety, as the bullshit-generator machine ingests its own bullshit and explodes

  3. that one in the corner Silver badge

    What steps should the machine learning community take?

    A: stop blindly scraping everything off the Web.

    Even without seeing your own output again, scraping everything was never going to provide a totally sane dataset in the first place.

    But it was cheap and easy, because the only way they know to build these models is to keep on shovelling into the training maw.

    But labelling, manually attaching provenance and "truthiness" values is expensive (and gets away from the Tech Boys Building Bigger Machines), and becomes unfair for smaller companies.

    Shame they don't have a better way of building their models.

  4. Doctor Syntax Silver badge
    Facepalm

    "if you then take all of this and you train a model on top of ... all of this distribution of data that exists out there that humans have produced ... including facts as themselves – and then you ask a model to model the whole thing and start generating data – which is statistically indistinguishable from this distribution of data – the model inherently is going to make mistakes.

    And it always will make mistakes. It's infeasible to assume that in some hypothetical future, we'll build perfect models. It's impossible. "

    But let's carry on anyway because we can make money out of it.

    (Somewhat abridged and my emphasis.)

    1. Steve Hersey

      To summarize: LLM AIs are f***ed and unreliable. And that's unlikely to get much better.

      Sure, if you train a cloistered LLM on data certified to be pure, organic, pesticide-free, and free-range, you'll get something that only reflects YOUR unacknowledged biases, and that you can convince yourself is sorta almost error-free. But I bet it won't be very useful. Enlarge the training data to include stuff from the 'Net because you just need more training data, and you've introduced stuff that will skew the LLM. You CANNOT effectively vet a large enough training data set to make it really useful without an uneconomically large amount of tedious, soul-deadening human effort (which we all know those eager profit-hungry corporate types won't support), so the data will be poorly vetted (if at all!) and crap will get in. Game over, insert quarter. Your LLM is screwed.

  5. HuBo
    Gimp

    Larsen effect and the curse of recursion (in AI training)

    Could this positive feedback reward system of confirmation bias algorhythmics (with drift?), resulting in distortion of perplexity, with underlying echo chamber entrapment and associated misperception of reality, be a model of social collapse as well?

    ... and it also sounds as though outputs from revolutionary AI models, once massively spread throughout the web, will be much like the revolutionary plastics pollution that we just can't get rid of, no matter how big the filters ... (is this food for tots?).

  6. Zack Mollusc

    My favourite part

    is the way he thinks big companies will have an advantage because they can pay for humans to create better training data.

    He is technically correct, but big companies also can pay humans to do better quality control, can pay for safer and less polluting processes, can pay for equipment repair, living wages, research etc.

    A training set that is massively flawed, but a dollar cheaper to produce than a "pure" data set, will be the one that is used.

  7. Bilby

    Thank you for your feedback.

  8. Anonymous Coward
    Anonymous Coward

    Similar to human responses?

    This strikes me as similar to the human “echo chamber” effect, where somebody with an idea / strategy initiates consultation, but only accepts responses from those with positive views, the idea/strategy then continually evolves but with input only from those who were originally positive, with the group slowly decreasing in size as objections are ignored. Sometimes this method works, sometimes it fails.

  9. gem@BIML

    Recursive pollution is the number one LLM risk

    BIML has been studying ML security colstoy for six years. We agree with Shumilov and furthermore thin this is the number one risk facing LLMs

    We say in https://berryvilleiml.com/results/

    [LLMtop10:1:recursive pollution] LLMs can sometimes be spectacularly wrong, and confidently so. If and

    when LLM output is pumped back into the training data ocean (by reference to being put on the Internet,

    for example), a future LLM may end up being trained on these very same polluted data. This is one kind of

    “feedback loop” problem we identified and discussed in 2020. See, in particular, [BIML78 raw:8:looping],

    [BIML78 input:4:looped input], and [BIML78 output:7:looped output]. Shumilov et al, subsequently wrote an excellent paper on this phenomenon.

    Also see Alemohammad. Recursive pollution is a serious threat

    to LLM integrity. ML systems should not eat their own output just as mammals should not consume brains

    of their own species. See [raw:1:recursive pollution] and [output:8:looped output].

    gem

    1. DexterWard

      Re: Recursive pollution is the number one LLM risk

      ML systems should not eat their own output just as mammals should not consume brains of their own species.

      So eating ancestor's brains is bad?

      There's a book "The Ghost Disease" about how the habit in some parts of New Guinea caused Kuru - a type of Creuzfeld-Jakob Disease.

      It's interesting that the same applies to AI

  10. Puketapu

    This doesn't appear to be surprising in anyway, but it's nice to see it being studied in a formal way

    1. diodesign (Written by Reg staff) Silver badge

      You know the saying:

      The difference between science and screwing around is writing it down.

      But more seriously: anyone can have a hunch, it takes studies to prove/conclude it.

      C.

  11. thondwe

    Multiple LLMs make matters worse

    So there are multiple LMMs at work in this, so it's bit like two polluters feeding into the same river - each on it's may be tolerable, but mix the two together and ....

    For nerds - c.f. Thunderbirds episode "Danger at Oceans Deep!

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like