Re: The internet archive?
Could some of the downvoters care to share what is wrong with my comment? I seek not to troll or tell the downvoters are wrong. I try to understand if I failed to make my point clear or if I get it wrong myself or if it is just sentiment voting.
The point I try to make is: as a human, in science and research, it's not only normal but also required to put in references to where you get information in publications. That allows for human readers to track and re-evaluate the quality of the information and generally leads to better learning in the scientific community.
I understand that the bulk of information these companies scrape for their LLM don't contain such references. Yet for much basic information like language structure and grammar, physics, mathematics, good program structure, thermodynamics, astronomy... old quality resources are as good or often better then new popular snippets of information you find on the internet today. Using good quality resources is a no-brainer for trying to learn a skill or new field.
This general, basic information is (or better said, should!) be the basis for training the various LLM. And that information can be left free of model collapse by using a very simple filter: "original publication date" < "data LLMs began polluting the information available on the internet".
As for newer information, LLMs are and excel at stochastic "chewing" on information. Categorizing billions of pages of information according to topic and specific information is what they do. That makes correlating them a lot easier, think of "web page A" and "web page B" and "web page C"... show 92% correlation on their information on the topic of X. Then sort those by date of first publication. If "web page C" predates all the other ones, chances are a lot higher that that one is closer to the original source.
It's by far a full proof method to get to the source by any means. It's just one of plenty of methods that source information could be weighted for how much it contributes to the training process. And if model collapse was starting to become a problem, a combination of many of these techniques will be "needed" in order to counter that. Curating, by humans and machines, of information will be a valuable (to those AI companies) thing.
Chances are that the cost of that will be so high that they will forgo on it. That remains to be seen. In the mean time, don't be gullible. Stack Exchange made a U-turn not only on selling "its" information, but also on allowing AI generated information to be posted on its fora. The later is not just a way to excuse itself by saying "see we take AI serious as a valuable resource". It contains a hidden sting:
It allows OpenAI to post its own generated questions and answers on a topic where its machines marked uncertainty about, remember that *itself* posted those questions and answers (so that it won't slurp those itself) and try and learn from the (mixture of) human (and machine) users pointing "the original poster" out what is wrong in the answer. By that, OpenAI gets itself "free reinforced learning" if it manages to filter the answer according to quality (and things *as simple as* comparing upvotes and downvotes of *known* human users will help them estimate the quality of the answers).