* Posts by Anonymous Kitten

4 publicly visible posts • joined 4 Dec 2023

91% of polled Amazon staff unhappy with return-to-office, 3-in-4 want to jump ship

Anonymous Kitten

<blockquote>Bottom line for an employee is - or should be - pay minus expenses.</blockquote>

Not for me. My priorities for future jobs are 1. fulfilling work and 2. work/life balance. Pay is just a number; a means to an end.

Anonymous Kitten

Confidence ≠ Competence

Pulitzer Prize winning author Michael Chabon and others sue OpenAI

Anonymous Kitten

Re: IP land grab

"Reading is not copying."

This is literally copying. They're scraping up enormous amounts of copyrighted works, copying them to their servers, and using them to train a machine that generates content that competes directly with those works. That's clearly a copyright violation and doesn't fall under Fair Use.

For noncommercial research and educational purposes? Sure.

For transformative use like a search engine? Sure.

To produce a for-profit machine that generates content that competes with the copyright holders? Nope.

Anyway, arxiv 2311.17035 has shown that these OpenAI models really do memorize their training data verbatim, so if that argument works on you, there you go.

"I understand their panic, but we mustn't allow copyright holders to use "AI" as an excuse to extend their grip on the intellectual domain even further."

What a bizarre argument. The AI companies are the ones doing the "land grab", scraping up millions of peoples work and selling it back to them without compensation or attribution, ignoring the requirements of every copyleft license, putting them out of a job, etc. etc.

Anonymous Kitten

Re: Guess who copied that line...

"the AI was trained by reading the actual text. That, of itself, is not a copyright violation, as far as I know."

Why wouldn't that be a copyright violation? Any unauthorized copying is a violation by default, unless you can claim a Fair Use exemption, which includes things like research, education, noncommercial, and transformative use.

Scraping up enormous amounts of copyrighted works in order to sell access to a machine that generates content that competes directly with those works clearly doesn't fall under this exemption. Ask ChatGPT itself:

----

Purpose and character of the use:

OpenAI is a for-profit entity selling access to GPT-4. Commercial use can weigh against fair use. Given this commercial intent and the potential for monetization, this factor is more likely to be seen as a potential copyright violation than if the use were strictly non-commercial.

Nature of the copyrighted work:

Common Crawl contains a mix of factual and highly creative content. Using factual content generally leans towards fair use, while using creative content can weigh against it. Given the mix, this factor is ambiguous, but the presence of creative content might make it more likely to be considered a potential copyright violation, especially if significant portions of the dataset are creative.

Amount and substantiality of the portion used:

If GPT-4 was trained on vast amounts of data from the web, it's possible that it was exposed to large portions or the entirety of specific copyrighted works, even if indirectly. This factor might weigh against fair use and towards potential copyright violation, especially if whole works or significant portions of them are used.

Effect on the potential market or value:

If GPT-4's outputs can serve as a substitute for original content (even if transformative), it could impact the market for the original work. Considering this and the potential for competition, this factor is more likely to be seen as a potential copyright violation.

Procurement of Data:

Independently of how the data is used, the act of scraping, storing, and processing copyrighted content without explicit permission could be seen as infringement. Given that Common Crawl scrapes a vast portion of the web, without distinction between copyrighted and non-copyrighted content, the procurement and storage aspect is more likely to be considered a potential copyright violation.

Raw Data in Model Weights: - While neural networks store patterns rather than exact replicas of data, large models might, in specific cases, reproduce snippets of their training data. If GPT-4 can reproduce copyrighted content verbatim or nearly so, even in small snippets, this could be considered a form of copying. This makes it more likely to be seen as a potential copyright violation.

----

"For it to be a copyright violation he has to demonstrate that the model has not just read but has stored an infringing amount of the actual text."

That's not true, as shown above, but even this argument fails, as arxiv 2311.17035 has shown that these OpenAI models really do memorize their training data verbatim.