Re: Guess who copied that line...
"the AI was trained by reading the actual text. That, of itself, is not a copyright violation, as far as I know."
Why wouldn't that be a copyright violation? Any unauthorized copying is a violation by default, unless you can claim a Fair Use exemption, which includes things like research, education, noncommercial, and transformative use.
Scraping up enormous amounts of copyrighted works in order to sell access to a machine that generates content that competes directly with those works clearly doesn't fall under this exemption. Ask ChatGPT itself:
----
Purpose and character of the use:
OpenAI is a for-profit entity selling access to GPT-4. Commercial use can weigh against fair use. Given this commercial intent and the potential for monetization, this factor is more likely to be seen as a potential copyright violation than if the use were strictly non-commercial.
Nature of the copyrighted work:
Common Crawl contains a mix of factual and highly creative content. Using factual content generally leans towards fair use, while using creative content can weigh against it. Given the mix, this factor is ambiguous, but the presence of creative content might make it more likely to be considered a potential copyright violation, especially if significant portions of the dataset are creative.
Amount and substantiality of the portion used:
If GPT-4 was trained on vast amounts of data from the web, it's possible that it was exposed to large portions or the entirety of specific copyrighted works, even if indirectly. This factor might weigh against fair use and towards potential copyright violation, especially if whole works or significant portions of them are used.
Effect on the potential market or value:
If GPT-4's outputs can serve as a substitute for original content (even if transformative), it could impact the market for the original work. Considering this and the potential for competition, this factor is more likely to be seen as a potential copyright violation.
Procurement of Data:
Independently of how the data is used, the act of scraping, storing, and processing copyrighted content without explicit permission could be seen as infringement. Given that Common Crawl scrapes a vast portion of the web, without distinction between copyrighted and non-copyrighted content, the procurement and storage aspect is more likely to be considered a potential copyright violation.
Raw Data in Model Weights: - While neural networks store patterns rather than exact replicas of data, large models might, in specific cases, reproduce snippets of their training data. If GPT-4 can reproduce copyrighted content verbatim or nearly so, even in small snippets, this could be considered a form of copying. This makes it more likely to be seen as a potential copyright violation.
----
"For it to be a copyright violation he has to demonstrate that the model has not just read but has stored an infringing amount of the actual text."
That's not true, as shown above, but even this argument fails, as arxiv 2311.17035 has shown that these OpenAI models really do memorize their training data verbatim.