back to article Big brains divided over training AI with more AI: Is model collapse inevitable?

AI model collapse – the degradation of quality expected from machine learning models that recursively train on their own output – is not inevitable, at least according to 14 academics. The risk that ongoing generative AI output, known as synthetic data, will dilute human-created organic data and impair the performance of …

  1. The Dogs Meevonks Silver badge

    Strange you should say that....

    Just the other week I posted the comment below.

    "As everyone (inc me keeps repeating) who will bother reading what no one could be bothered to write?

    Oh... other AI's reading other AI's in a never ending loop of regurgitated ever more inaccurate bollocks until it collapses under it's own ineptitude.... at least... that's my hope. let it come crashing down as swiftly as NFTs did."

    1. Anonymous Coward
      Anonymous Coward

      Re: Strange you should say that....

      "that's my hope. let it come crashing down as swiftly as NFTs did."

      Sadly, it wont come crashing down and ur compassion to NFT is weak (IoT was closer). AI is actually useful. Im sorry but you are like a fish asking what is water. It isnt even debatable about now - AI is the next big thing. And if you are a seasoned campaigner like myself, it is a wonderful very early retirement plan.

      Perhaps the oddest thing of all is that I think AI is actually under-hyped. Not becuase I'm a fanboi as those days have passed me by, but becuase you really don't get much press on anything other than LLMs.

      "As everyone (inc me keeps repeating) who will bother reading what no one could be bothered to write?"

      You already are reading LLM text every single day. Most social media is full of it. The news sites are using it more and more and if the reg allowed me to post about their use of it, I could give u examples. But they wont cause they keep blocking my posts. I can assure you that there are LLM bots on this site too - and not the jokes-not-funny-now Marsanon twat.

      I can understand you resistence to this change. But it is a shift bigger than the mobile internet was. We really dont care what u think.

      1. theOtherJT Silver badge

        Re: Strange you should say that....

        You already are reading LLM text every single day. Most social media is full of it

        Yes. I've noticed. The fact that it's full of meaningless noise compared to how it was even 5 years ago is sufficient to have me considering if it's worth giving it up at this point. It's too badly polluted to be useful any more.

        I think the only reason I even still have a facebook account is that there's a dozen or so of us who have a "pub?" group that we use to organize meeting for beers after work.

      2. This post has been deleted by its author

      3. Anonymous Coward
        Anonymous Coward

        Re: Strange you should say that....

        While based on your post it makes sense your a proponent of AI and convinced it's the next big thing, I fear that is based on your grasp of basic grammar and the basic principals of logic.

        This is another hype bubble, and a much smaller and more tightly focused remnant will be all that remains of this generation of ML driven technology. In the mean time, I hope you can stay out of the whirlpool as it sucks down the hucksters, frauds, and pump and dump scammers.

        1. Anomalous Cow Herd

          Re: Strange you should say that....

          <quote>Re: Strange you should say that....

          While based on your post it makes sense you're a proponent of AI and convinced it's the next big thing, I fear that is based on your grasp of basic grammar and the basic principals of logic.

          This is another hype bubble, and a much smaller and more tightly focused remnant will be all that remains of this generation of ML driven technology. In the mean time, I hope you can stay out of the whirlpool as it sucks down the hucksters, frauds, and pump and dump scammers.</quote>

          Sorry - couldn't help but correct the *deliberate* grammar mistake here

      4. FuzzyTheBear

        Re: Strange you should say that....

        " i " use that .. " I " like in " I really dont care what you think. " Leave the rest of us alone. You're certainly not my voice.

      5. Sanae Alankon

        Re: Strange you should say that....

        Are you trolling at this point?

    2. Herring` Silver badge

      Re: Strange you should say that....

      Web 5.0: LLMs write all content. The only readers are other LLMs

      1. Michael Wojcik Silver badge

        Re: Strange you should say that....

        Even before transformer-based LLMs were released to the public, this sort of thing — machine-generated text being consumed by machines — was fairly common. People were using simpler models to generate everything from product reviews to news articles about sporting events and financial results to entire books (Icon Publishing), and that content was being scraped, summarized, and otherwise digested.

        But, yes, the rush to use LLMs for every purpose some idiot can cram them into is greatly increasing this trend. We're burning huge amounts of resources on making machines expand and compress low-quality data for no useful purpose.

        LLM enthusiasts like the incoherent AC above1 continue to shout that we're witnessing the next industrial revolution, but thus far I see very few signs of it (and I follow a lot of the research and analyses). There are a few genuinely useful applications of large autoregressive models, such as AlphaFold3 and some quite interesting results from using LLMs as another tool for computational mathematics (FunSearch is one example); but the vast majority of both transformer and diffusion model use seems to be just play, noise, or laziness and learned helplessness.

        1If thinkers I respect and sometimes agree with, like Scott Aaronson and Zvi Mowshowitz, can't convince me with reasoned, articulate arguments that gen-AI offers significant mundane utility, then certainly some half-assed babble isn't going to.

    3. spold Silver badge

      Re: Strange you should say that....

      It will be a shift in its Turing Test results - it will devolve to the point you will not be able to tell it from a politician.

  2. Filippo Silver badge

    From what I understand, they can't claim that there is no problem. They may claim that instead of "model collapse" we'll get just some degree of "model degradation". That's still a pretty big problem, considering that current models are already largely only good for party tricks, and fairly crap at serious tasks.

    I still think that the current main dangers to AI research are not "model collapse", but (1) overhype, and (2) focusing resources on superficially promising avenues that turn out to be dead ends.

    1. Anonymous Coward
      Anonymous Coward

      Why worry if it is over-hyped? So was the internet and that sorted it self out.

      It seems that most criticisms against AI come from those who missed the boat. Yes, it is in a bubble at the moment, but it seems the greybeards dont seem to understand that the only thing that really matters nowadays is to make you stash and run - as I have done.

      Who cares? Why care? No point. Take the money and run. Im off to the Cayman Islands soon and I wont be looking back.

      Ive spent a good part of my career - like many here - working to improve people's lives through the Internet. And I've learnt it is mostly a waste of time because stupid always finds a way.

      "considering that current models are already largely only good for party tricks, and fairly crap at serious tasks."

      AI is much more than ChatGPT3. In my own field of predictive human behaviour it is a gold mine and it works better than humans. It is being tested by a mental health service in south east London on their patients and has proved wildly successful in trails. It is being used extensively in sport - especially tennis and football. It has value and offers what was never possible b4. It's being pitched for prisons, police stations, general hospitals - anything that needs observation. It is only a matter of time before this tech filters down to your CCTV. It is being used extensively throughout Japan by Lawsons and 7-11 to predict shoplifters. The Chinese (who are world leaders in this particular field) use it throughout their systems. To give u some perspective of how far the West is behind, DeepMind's recent football set-piece prediction is 2 years behind the work we did in China.

      Even in the current state, predictive AI surrounds your existence. AI will completely dominate your future in a way that makes the internet looks puny. And that is a nightmare! A true horror story awaits us - I know cause a built a tiny part of it. The only chance I could see of living a normal life of sorts was to get a load of money. I recommend you do the same.

      1. refitman

        I'm going out on a limb and going to say that you went big in on crypto being "the next big thing that will revolutionise our lives"?

        1. Anonymous Coward
          Anonymous Coward

          dont b silly though one of my friends did get some BitCoins right at the beginning and has bought a massive house in Brighton.

      2. amajadedcynicaloldfart

        A few quotes from this and your previous posts

        "because it is all about the money and that is distasteful".

        Now a quote from this post

        "Who cares? Why care? No point. Take the money and run. Im off to the Cayman Islands soon and I wont be looking back".

        You Sir, are just another hypocrite.

        "The only chance I could see of living a normal life of sorts was to get a load of money. I recommend you do the same".

        How distasteful...

        1. Anonymous Coward
          Anonymous Coward

          Re: A few quotes from this and your previous posts

          Yep. I am. All of those things - just like u. LOL. Takes one to know one... lol

          No seriously though, I glad u put the work in. That is my purpose.

      3. cyberdemon Silver badge
        Devil

        > Im off to the Cayman Islands soon and I wont be looking back.

        > AI will completely dominate your future in a way that makes the internet looks puny. And that is a nightmare! A true horror story awaits us - I know cause a built a tiny part of it. The only chance I could see of living a normal life of sorts was to get a load of money. I recommend you do the same.

        If there is a Hell, you are most certainly going there.

        It doesn't take AI to know that 'u' are the idiot AC above, either

        1. Anonymous Coward
          Anonymous Coward

          Re: > Im off to the Cayman Islands soon and I wont be looking back.

          I dont call the Cayman Islands hell, sunshine. No way. Heaven yes.

          If I ever come here and my posts get more upvotes than downvotes then I have done something wrong.

          On a side note, imagine if you could create an AI model that could, within reason, guess the IQ of a person based on the online output. Now I know that is only a part of what makes a person a person, but still, it is a demographic's dream. Do you know your level? See how accurate I am.

          1. Excellentsword (Written by Reg staff)

            Re: Re: > Im off to the Cayman Islands soon and I wont be looking back.

            Can you calm down a bit before I ban you so I don't have to read this drivel?

            1. Anonymous Coward
              Anonymous Coward

              Re: > Im off to the Cayman Islands soon and I wont be looking back.

              Yes, please. Mr Mod. The model needs a little more fine-tuning. It worked fine on Twitter but needs a little tweak for this place.

              We will return shortly will an updated one. Thanks for reading through those 'wall of texts'. Your choices on what to let through were needed.

              1. cyberdemon Silver badge

                Re: > Im off to the Cayman Islands soon and I wont be looking back.

                Congrats then. You've just proved that the primary use-case for "AI" is attracting downvotes with shitposts.

                I suggest you get your coat.

                1. Anonymous Coward
                  Anonymous Coward

                  Re: > Im off to the Cayman Islands soon and I wont be looking back.

                  No, it just needed some fine-tuning is all.

              2. that one in the corner Silver badge

                Re: > Im off to the Cayman Islands soon and I wont be looking back.

                Anyone else get the feeling this "Tomi Tank" has been watching too many Bond movies?

                He has a "super weapon" and reckons he is getting rich from a nefarious scheme to ruin the lives of everyone he encounters, even if he has to attack us piecemeal, one forum at a time, threatening to return when faced with a ban (and he forgot to add "when we least expect it"; you just don't get same class of villainy these days).

                And like every Bond villain, incapable of realising that, if he really *did* have the unstoppable power he imagines he does, he could make so much more if only he was capable of working *with* society.

                Instead, he cackles on, laughing at what he sees as the little people, who don't share his vision. Hope he remembers to get a cat, as that will be the closest he'll find to loyalty and friendship in his Cayman hideout.

                1. cyberdemon Silver badge
                  Devil

                  > Hope he remembers to get a cat, as that will be the closest he'll find to loyalty and friendship

                  And he'll soon find that a cat is about as loyal and friendly as he is.

                  It'll be plotting his downfall and will coax him to sit in one of the minion chairs, then the lever that tips him into the magma will be some kind of cat toy.

                2. Anonymous Coward
                  Anonymous Coward

                  Re: > Im off to the Cayman Islands soon and I wont be looking back.

                  Anyone else get the feeling this "Tomi Tank" has been watching too many Bond movies?

                  More Dr. Evil than Blofeld.

      4. Anonymous Coward
        Anonymous Coward

        "Im off to the Cayman Islands soon and I wont be looking back."

        Hahahahahahahahahaha. More like Canvey Island.

      5. Anonymous Coward
        Anonymous Coward

        Oh look, scum.

        A legend in your own mind..

        1. Anonymous Coward
          Anonymous Coward

          Re: Oh look, scum.

          ...and a knob-end in everyone elses'!

      6. pbklink

        A quick search of "AI predictive human behaviour" returns this article:

        https://www.linkedin.com/pulse/predicting-human-behavior-through-ai-unsettling-power-brad-carr/

        Seems inline with Tomi Tank's description of AI's use in predictive human behaviour. Does seem scary and unsettling.

        1. Michael Wojcik Silver badge

          Well, people are highly predictable. We've known that since at least the invention of storytelling.

          And, yes, if you build a large-enough model and train it on enough reasonably1 accurate and precise data, you can use it to predict the behaviors of individuals and groups under ordinary conditions. It'll be more difficult in extraordinary ones, because the model won't have much data to work on, so those parts of the state space will have really shallow corresponding gradients in the model's parameter space.

          1Yes, I know this qualification is doing a lot of heavy lifting here. It'd be possible to get better estimates by looking at things like deanonymization research and doing differential-privacy analysis. But specificity isn't really necessary here if you buy the overall argument.

          1. This post has been deleted by its author

          2. pbklink

            I don't think this is about extraordinary conditions. I think it is more about nudging a small amount of people in a certain direction to make a big difference.

            If an election is close, how can you influence individuals to vote a certain way. How can you encourage (or discourage) individuals from participating in current protests. And the influence could be tailored down to the individual.

            The individuals being targeted are in some ways the 'undecided' or close to it. So they are in the middle of the behavior spectrum and the model should have lots of data to work with. Nefarious use of social media could even be used to generate more data.

    2. Glen Murie

      And power requirements. Don't forget the massive amounts of power required.

      Moore's Law only says the processors get faster, not cooler and more energy efficient.

  3. Mike 137 Silver badge

    "it behaves effectively as if it had only been trained on an n-fraction of original data"

    I can't comment on the numerical specifics without sight of the paper, but the principle is obvious. The potential variety of output depends on the diversity of the training data set. Recursive training on output data inevitably tends to homogenise output as no new information points are generated, only reorganisations of the original corpus. It would be nice to have a link to the Kempe paper.

    1. Anonymous Coward
      Anonymous Coward

      The internet archive?

      That model collapse of data being diluted and behaving as if the model was trained on 10 times less data after 10 iterations should be easy to circumvent for old data by using things like the Wayback machine. For newer data, the machines (combined web crawlers with conventional algorithms and AI) should at least be able to reasonable accurately track the original source.

      I wouldn't be surprised the non-profit organisation around the Wayback machine will get attractive offers to be a bit more for profit in the coming years. And the big data hoarders have their own data stamped archives.

      1. Anonymous Coward
        Anonymous Coward

        Re: The internet archive?

        Could some of the downvoters care to share what is wrong with my comment? I seek not to troll or tell the downvoters are wrong. I try to understand if I failed to make my point clear or if I get it wrong myself or if it is just sentiment voting.

        The point I try to make is: as a human, in science and research, it's not only normal but also required to put in references to where you get information in publications. That allows for human readers to track and re-evaluate the quality of the information and generally leads to better learning in the scientific community.

        I understand that the bulk of information these companies scrape for their LLM don't contain such references. Yet for much basic information like language structure and grammar, physics, mathematics, good program structure, thermodynamics, astronomy... old quality resources are as good or often better then new popular snippets of information you find on the internet today. Using good quality resources is a no-brainer for trying to learn a skill or new field.

        This general, basic information is (or better said, should!) be the basis for training the various LLM. And that information can be left free of model collapse by using a very simple filter: "original publication date" < "data LLMs began polluting the information available on the internet".

        As for newer information, LLMs are and excel at stochastic "chewing" on information. Categorizing billions of pages of information according to topic and specific information is what they do. That makes correlating them a lot easier, think of "web page A" and "web page B" and "web page C"... show 92% correlation on their information on the topic of X. Then sort those by date of first publication. If "web page C" predates all the other ones, chances are a lot higher that that one is closer to the original source.

        It's by far a full proof method to get to the source by any means. It's just one of plenty of methods that source information could be weighted for how much it contributes to the training process. And if model collapse was starting to become a problem, a combination of many of these techniques will be "needed" in order to counter that. Curating, by humans and machines, of information will be a valuable (to those AI companies) thing.

        Chances are that the cost of that will be so high that they will forgo on it. That remains to be seen. In the mean time, don't be gullible. Stack Exchange made a U-turn not only on selling "its" information, but also on allowing AI generated information to be posted on its fora. The later is not just a way to excuse itself by saying "see we take AI serious as a valuable resource". It contains a hidden sting:

        It allows OpenAI to post its own generated questions and answers on a topic where its machines marked uncertainty about, remember that *itself* posted those questions and answers (so that it won't slurp those itself) and try and learn from the (mixture of) human (and machine) users pointing "the original poster" out what is wrong in the answer. By that, OpenAI gets itself "free reinforced learning" if it manages to filter the answer according to quality (and things *as simple as* comparing upvotes and downvotes of *known* human users will help them estimate the quality of the answers).

        1. Anonymous Coward
          Anonymous Coward

          Re: The internet archive?

          Not one of the down voters(or up voters), but here are a few points to consider.

          The first and the biggest: Scraped date has all sorts of problems, and isn't an essential or necessary source of training data. The way-back engine may contain a trove of snapshots of pre-LLM invasion websites but that resource is finite and over training on out of date information will cause it's own problems.

          While your point of going back to primary sources has merit, you stumble on tying that to a publication date. Bad math and science information existed on the internet pre-LLM, and bot generated word salad existed too thanks to Pagerank and SEO. That also included bogus math and science data. In reality there isn't much benefit to using untrustworthy publicly scraped data for that. Public domain books, or buying the rights to legitimate text would be a much safer choice with a better signal to noise ratio.

          Your points on curating data are closer to the mark, as is your observation on how the market will react to the beginning of the threat of model collapse. While the different papers tackle the issue from different viewpoints, all of them at the core are pointing at similar results. Dumbly scraping public text and the inability to reliably identify garbage(not just LLM generated output) will degrade the performance and reliability of the systems trained on it even more than it already has. This also overlooks a more important point that it is adjacent to. LLMs don't magically scale directly with the amount of data you feed them. The biggest models are already showing the breakdown at the top end. Less and higher quality data is the main way forward, and the major players have known it for some time. (see the "there is no moat" memo, etc). As a result the momentum has already shifted at the vanguard. People aren't freaking out about publicly posted garbage choking the next generation LLMs because they are already moving on to other methods.

          As to the prospects of the internet archive selling access to it's material, while I wouldn't totally rule it out if the legal environment changes, they probably don't have the luxury in their legal jurisdiction. If they try to monetize the internet archive they will most likely be in bad trouble with fair use and the copyright/IP laws. That said their are other parties that hold similar information that may be willing to play that game. The two biggest won't share though, as Google and Microsoft both have their own LLMs to protect. Both also have deep enough pockets to dump the scraped data if it's in their interest.

          That said, as long as the "free" training data they are effectively stealing yields acceptable results they will happily keep going till someone makes them stop.

          1. Anonymous Coward
            Anonymous Coward

            Re: The internet archive?

            Now you and Veti below both have clear points I agree too.

            The point I tried (and largely failed) to make are similar to what you two say: the potential for model collapse sort of creates the fear (among some researchers) or hope (among many who aren't fond of AI racing forward) that this pollution of the internet with an ever higher percentage of stuff written and ingested back by LLMs will severely (further for those considering LLM are nothing but useless bots) degrade LLM model quality and that that in itself could "destroy" the future of any and all LLM. That's where I agree with what you both said: there are plenty of other methods to get good and better quality information then blind scraping. In fact, blindly scraping is one of the worst methods for model quality. It just happens to be one of the or the utmost cheapest method to obtain large amounts of info for training them.

            => So: model collapse defined as all LLM models tumbling down won't likely happen for those with deep enough pockets to use a wide array of techniques to acquire better quality information. Manually (human) curated resources is one of those, but extremely expensive.

            Why I highlighted using archives is for those reasons: it's not only a simple example on how to (only partly, agreed) preserve quality of much "legacy" material, but it's also a very cheap one. Multiple data hoarders simply already sit on such archives. It indeed will become increasingly dated, but older information has IMO a far longer shelf life then most give it credit. In school, almost everything (book and learning materials) in basic school (age 6-12) is decades to centuries old with the exception of how to use an IPad or tablet. In high school (age range 12-18 in my region) it's barely more if you don't chose technology related courses. Even at uni, much material is decades old. So in theory, old material can still provide quite a decent basic training to models.

            => My analysis is that this thing called model collapse therefore first will create barriers of entrance for the small players without very deep pockets and treasure troves of stolen data. After that, the big players may or may not find it too expensive to pay for better curated information. But at that time there may only be a few very strong resourced players be standing. These tech giants may realize that is likely a viable way to create a quasi monopoly in case that would end up a big market and that may well be the reason for this current rush and giant investments in this thing: to try and corner a potential future very big (IF it succeeds) market for themselves.

            For those hoping model collapse will deflate this possible AI bubble, I fear that's idle hope.

            For those that say that AI models will be too expensive too run for companies to make a profit (IF they find enough paying customers that is), something similar holds. If the big players are forced to curate "their" info that they use for training a lot better, the same quality (for whatever measure of quality) will be achieved by a lot smaller and cheaper to *infer* models. Inference cost already dominates total LLM cost. So IF LLM acceptance further grew (and IF LLM gets enough paying customers), investing more in curated quality information may well *decrease* total LLM (training plus inference) costs. For the big players, that is once more.

            1. Anonymous Coward
              Anonymous Coward

              Re: The internet archive?

              Decrease total cost *per inferred token*, at large scale I meant to say.

        2. veti Silver badge

          Re: The internet archive?

          Also not one of the down voters, but if I were looking for "clean" data to feed to a learning model there are plenty of ways to get it.

          Books, for example. Every book has a publication date. Pick books from before the first appearance of LLMs. Project Gutenberg has 70,000 of them for starters.

          Or Wikipedia. Every wiki page has a full version history that allows you to wind back its text to any earlier date you care to specify. I've often thought there's some interesting research to be done on charting how the whole corpus has changed over the years, but you could also use it to retrieve text from before LLMs broke upon the world.

    2. Not also known as SC

      Re: "it behaves effectively as if it had only been trained on an n-fraction of original data"

      It's on arxiv.org.

      If this link doesn't work or is blocked, visit Arxiv.org, change the search to Computer Science and search for the paper title.

      Tale of Tails: Model Collapse as a Change of Scaling Laws https://arxiv.org/pdf/2402.07043

      The other papers mentioned :

      The Curse of Recursion: Training on Generated Data Makes Models Forget. : https://arxiv.org/pdf/2305.17493

      Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data. : https://arxiv.org/pdf/2404.01413

    3. Mike 137 Silver badge

      Re: "it behaves effectively as if it had only been trained on an n-fraction of original data"

      There are two papers by Kempe et al on the reduction in effective model size with recursive training -- A Tale of Tails: Model Collapse as a Change of Scaling Laws and Model Collapse Demystified: The Case of Regression. Both are quite "mathematical" but merit careful consideration. Probably the most telling observation for the general reader is in the latter paper: "A direct consequence of our multiplicative degradation result is that, over time (i.e as the number of generations becomes large), the effect of large language models (like ChatGPT) in the wild will be a pollution of the web to the extent that learning will be impossible. This will likely increase the value and cost of clean / non-AI-generated data.".

      Bears thinking on, doesn't it?

      1. Anonymous Coward
        Anonymous Coward

        Re: "it behaves effectively as if it had only been trained on an n-fraction of original data"

        At the core it's all pointing back in the same direction. Models ingesting their own output is already degrading performance of systems using uncontrolled training data. At the point that degradation becomes severe enough, the utility value of clean data will rise above the cost for the models that will survive. This may favor larger players with deep pockets and smaller players with domain expertise and pristine data.

        Conversely the doomsday scenarios will almost certainly not arrive. Much like real world pollution, when it gets bad enough, the incentive to clean things up hits an inflection point and things start to level off or improve. A more interesting discussion for me is where the system begins to find metastable equilibrium, and how prone to flash crashes the faster moving parts of it become. This will start to be clear in hindsight if nothing else, but my god those could be some bumpy bumps if idiots keep tying these LLMs into live control systems.

        The likely case is that both public and privately curated data sets become the norm, and the weights are tuned to filter public data against a smaller set of cleaner and more reliable data, at least for models that train off public data sets at all. Likely the intellectual property issues and legal ambiguities will provide a bigger kick in the near term than the breakdown curve performance wise.

        1. Anonymous Coward
          Anonymous Coward

          Re: Much like real world pollution, when it gets bad enough....

          ...the incentive to clean things up hits an inflection point and things start to level off or improve.

          What fucking planet are you on? It's clearly not this one.

  4. abend0c4 Silver badge

    Here's us thinking recursion was a solved problem

    Possibly not now that AI is actively courting Stack Overflow.

  5. User McUser
    Go

    AI will learn a very important lesson

    Don't shit(post) where you eat.

    1. Zolko Silver badge

      Re: AI will learn a very important lesson

      Not completely related to sh***ing and eating in the same place, but I've always wondered why nature has come-up with the solution to offer a single output tube for 2 very different body liquids coming out of men.

  6. Throatwarbler Mangrove Silver badge
    Boffin

    A commentary on humanity

    To me, this analysis of AI model recursion contains an implicit commentary on the current state of human communication and learning. As a species, we used to learn from nature and the natural world, then we started to vanish up our own asses with self-referential stories, and now that's mostly what we repeat instead of continuing to learn from the world around us (there being obvious exceptions, of course). AI is simply going down the rabbit hole that humanity dug for it.

    1. Anonymous Coward
      Anonymous Coward

      A relflection of humanity

      We built a statistical model of human language. All it can possibly do is reflect and replicate the underlying system it's built on. That's the funny thing, to many of the bright but not so worldly folks building these things made a classic blunder. The built a system modeled on human linguistics, but totally discounted the underlying body of work in linguistics dealing with the problems handling formal logic in those linguistic systems. You literally can't express some of those problems correctly or deterministically in a language like English. Ignoring that problem doesn't make it go away, it just makes you look like an idiot later. We new about many of the problems a hundred years ago, back when teaching philosophy and logic were as popular as advanced mathematics.

      1. Michael Wojcik Silver badge

        Re: A relflection of humanity

        The[y] built a system modeled on human linguistics

        Well, no, they did not. Transformers are not "modeled on human linguistics", or on human language at all. The tokenization algorithm is, in a fairly trivial sense, linguistic; but the overall model architecture is not.

        Current LLMs mostly do poorly at many formal-analysis tasks (though they're getting better) because session memory is limited to the context window,1 and because attention is not, as it turns out, all you need when you have to maintain formal rigor. Attention-head superposition is one problem; there are others. I recently saw a description of a paper investigating this further but I'm not finding it at the moment.

        1And no, the recent Google "infinite context" paper doesn't fix this. What they describe isn't infinite context; it's using an RNN in addition to the CNN layers to provide, in effect, lossy compression of old context as it shifts out of the window.

  7. rgjnk Bronze badge
    Devil

    As expected

    It's hardly a surprise to see people busy trying to defend the future of their house of straw when their livelihood depends on keeping it standing.

    My experience so far with most of those pushing stuff for the latest AI hype cycle is that they have a very loose definition of stuff 'working' at the best of times (with 'stable' not existing at all), plus they're really keen to push their latest shiny even when it doesn't really work, especially compared other less magic methods.

    I've got burned out on the whole current AI cycle; it's great for spinning dreams but reality isn't quite so kind.

  8. steelpillow Silver badge
    Boffin

    Feedback

    Systems engineering, first lesson: Any system can be broken down into Input > Process > Output.

    Systems engineering, second lesson: Don't forger feedback - output returning to the input for another crack at the whip. This can go one of three ways:

    1. Positive. This means either flip to the limit and stay there, or oscillate wildly.

    2. Zero. This has no effect.

    3. Negative. This stabilises the output and can be tailored to give you what you actually want.

    This does not translate directly into recursive AI, but it gives you the gist of where your head should be going.

  9. DS999 Silver badge

    I don't buy their reasoning

    They say they can continue training on "legacy" data from human sources. But that will become a lower and lower percentage of the total amount of data out there as AI progresses. Just look at a web search today versus a year ago - you're already see multiple first page results written by AI depending on what you're looking for. Now imagine what those same searches will be like in 2 or 3 years.

    If model collapse from ingesting AI generated data is inevitable, there is no way they'll be able to prevent it unless the AI grabbing training data is able to discern human or machine generated data to limit how much incest is happening.

    1. Will Godfrey Silver badge
      Thumb Up

      Re: I don't buy their reasoning

      Actually I think you use of the word 'incest' is very appropriate here - a slow steady degradation until the organism becomes dysfunctional.

      P.S. It might take many generations before the errors become visible.

    2. Michael Wojcik Silver badge

      Re: I don't buy their reasoning

      The companies building LLMs already have corpora of historical data from before the web at large became polluted with LLM-generated content. They can continue to use that to train larger and larger models.

      The recursion problem is, of course, an issue for incorporating new data, which they would very much like to do to keep their models relevant.

      I don't find the Gerstgrasser et al. paper very interesting, and the objections from Kempe, Feng, and Shumailov are apt. But it's not impossible that an organization wanting to produce successive generations of LLMs could apply the Gerstgrasser technique to a training collection composed of historical LLM-clean corpora and curated data from after LLM tainting began. That curation would likely take the form of an initial mechanical pass, using some of the algorithms that have been developed for detecting LLM output (none of which are very good, but if they have an F0 that's better than random, they'll work for an initial screening), and then one or more human passes.

      If I were doing this (which I very much will not be) I'd probably have one large, cheap human pass with lots of relatively inexpensive human labor — some of those many out-of-work folks with humanities degrees — followed by specialist human judges evaluating random samples of what got through the mechanical filter and the first human filter, as quality control and to help refine the process.

      It's doable. Is it worth doing? I don't believe so.

    3. MatthiasG

      Re: I don't buy their reasoning

      One of the authors of the new paper here! Thank you for your interested in the topic, and I just wanted to chime in on this with a couple of clarifications:

      > But that will become a lower and lower percentage of the total amount of data

      We do actually show in our work that test error can remain stable even if the fraction of "real" data becomes smaller and smaller! The only assumption we make is that the real data doesn't actively get deleted, but then it's not like that data will get "drowned out" by adding synthetic data on top of it; plus, in a number of other aspects our assumptions are still very pessimistic. To be clear, we're not saying that there aren't *any* circumstances where model collapse can occur - indeed, the fantastic work done by our colleagues quoted in the article shows precisely that there are such circumstances. What we do show is that synthetic data ending up on the internet and in future training datasets doesn't by itself mean catastrophic failure for future AI training.

      > there is no way they'll be able to prevent it unless the AI grabbing training data is able to discern human or machine generated data

      All of this works even if you cannot discern what is real and what is synthetic! In both our theory and experiments we assume that all (real and synthetic) data gets accumulated into one big pile with no way to tell apart one from the other.

  10. Judge Jury Executioner

    The term "echo chamber" comes to mind ...

    .... hopefully this won't all end in tears - or Skynet ;)

    1. Michael Wojcik Silver badge

      Re: The term "echo chamber" comes to mind ...

      I suspect gradual and overwhelming enshittification is more likely. Widely-deployed LLMs will make it easier to conduct ideological warfare, so what we'll probably see in the near future is ever more partisan squabbling, wasted effort as groups undermine one another (and then themselves), and small-scale military conflict.

      I think people who dismiss the possibility of extinction via ASI are far too cavalier about it; the x-risk arguments that have been advanced are complex and nuanced. But I'm not convinced we'll get to ASI, because we're doing a pretty good job at marching toward wrecking civilization.

  11. Bebu
    Windows

    Brouwers Least Fixed Point Theorem?

    The least fiixed point of this nonsense is a sea of codswallop stretching from horizon to horizon.

    I imagine at any time AI in toto can be expressed as a higher order function which is applied to itself, its initial training set and any subsequent non-AI generated input. As the non-AI input is inevitably going to be a smaller and smaller proportion of the total input the net effect will be a very expensive, environmentally harmful, hightech navel gazer. :)

    XLII - save a lot of bother just write it on the wall - its the answer fwiw. :)

    1. Anonymous Coward
      Anonymous Coward

      Re: Brouwers Least Fixed Point Theorem?

      The oft made mistake here is viewing the amount of publicly scraped garbage as a variable people building models don't have control over. So there is not much inevitable in the proportions. As the utility of blindly scraped material trends down, the weighting will either shift away from it, or people will stop using it. More of an entropic equilibrium than an inevitable accelerating breakdown.

      In reality the scrapers will have to be less indiscriminate, and the value of curated data will offset the cost of moderation and reputation on even public information sources. Low quality sites with no reputation and curation will rapidly degrade until the scrapers de-index them. This will probably have profound effects on what the public internet looks like, but the play for the model builders is straightforward. Straightforward cost/benefit drivers will steer them to steal what they can, and pay for what they must. Any application that that can be made to work acceptably well and turn a profit will survive, and if it costs more that it's worth, it will get shut down eventually. So we will see selection and competition bring and end to this Cambrian explosion, and as the hype cycle tires, the first of a cycle of mass extinction events.

  12. sabroni Silver badge
    Happy

    the problem of training AI on AI-made data isn't significant....

    ...given that the output is bollocks when they are trained on human input.

    FTFY!

  13. DoctorPaul Bronze badge

    Sorry just don't get it

    Someone correct me if I'm wrong but I believe that even the most ardent fans of ChatGPT accept that these systems sometimes "hallucinate" and state things as fact when they simply are not. Ergo you cannot trust anything these systems produce without checking the authenticity of the output.

    So just what is the point? It reminds me of "50% of my advertising budget is wasted, I just don't know which 50%".

    Still, what do I know? Maybe just a little given that my PhD is in AI

    1. Brewster's Angle Grinder Silver badge

      Re: Sorry just don't get it

      There are plenty of creative fields where accuracy is not important.

      Look at the videos they generate. Flawed, yes. But most of us could never acquire the budget to create anything half as good. And if humans only have to fix the flaws, that's probably a cost saving.

      Or, I wrote to a recruiter the other day. My letters rarely get noticed. I fed the job spec into an AI, copy-edited the BS it generated, and sent the result. The recruiter was beating down a path to my door; i.e. I couldn't lie, but the AI could do it for me. I'm sure I could have copied specimen cover letters from job sites and synthesised something similar myself, but I felt shitty enough as it was. And it was all done at the push of a button; then I could take a shower.

      Come to that, surely locating job adverts on the internet and matching CVs to the job is something an AI could do. Maybe a human could do it a bit better. But the cost of being a "bit better" than an AI is not worth the price.

      To date, it has been very expensive to build probabilistic models that can be used by software to make decisions. AI makes that cheap - even if trades accuracy and rigour. Good enough is fine for many roles. A worked example would be automated translation. A high quality translation need a human fluent in both languages and cultures. But most of us can't afford that and Google translate will allow us to have the benefits of translation in everyday life. Similarly, AI will democratise skills that are amenable to AI's techniques.

    2. Michael Wojcik Silver badge

      Re: Sorry just don't get it

      Ergo you cannot trust anything these systems produce without checking the authenticity of the output.

      You misspelled "should not trust". Plenty of people do so regardless.

  14. Brewster's Angle Grinder Silver badge

    For once, the stars align.

    It's in society's best interests to know what has been ejaculated by an AI. But if model collapse is real then it is also in the AI industry's own best interests to ensure AI generated content (whether partially or totally generated by AI) is labelled. Being able to find authentic human data is going to be priceless for all parties.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like