back to article Pulitzer Prize winning author Michael Chabon and others sue OpenAI

Pulitzer Prize winning US novelist Michael Chabon and several other writers are the latest to file a proposed class action accusing OpenAI of copyright infringement, alleging it pulled their work into the datasets used to train the models behind ChatGPT. The suit claims that OpenAI "cast a wide net across the internet" to …

  1. stiine Silver badge
    Facepalm

    Guess who copied that line...

    Michael Chabon. That dummy must not know how to use Google Search. An older version of that phrase can be found in the following book, on page 82, in a paragraph about FDR.

    The Atlanta historical journal

    Author:Atlanta Historical Society

    Journal, Magazine, English, 1978

    Edition:View all formats and editions

    Publisher: Atlanta Historical Society, Atlanta, 1978

    To whom do we think OpenAI's model would attribute that phrase because sure as hell it wasn't Michael Chabon.

    1. Androgynous Cupboard Silver badge

      Re: Guess who copied that line...

      Yes, you're very clever in keying in "the weight of the world at war" into Google Books Search. I think it's fairly obvious that Michael Chabon is not claiming that he invented that line, which is a shame as you clearly had such a good time being snarky about it.

      I think his claim is that for a LLM to generate text that is both similar in style and uses a phrase which - I presume given it's emphasis, although I've not read his book - is in the actual text, it was very likely to have have been trained by reading the actual text. Although there's some doubt that LLMs are even clever enough to do that.

      1. Graham Cobb Silver badge

        Re: Guess who copied that line...

        No, I don't think his claim can really be that the AI was trained by reading the actual text. That, of itself, is not a copyright violation, as far as I know. For it to be a copyright violation he has to demonstrate that the model has not just read but has stored an infringing amount of the actual text. One notable phrase does not amount to an infringing amount of text as it is clearly fair use to quote notable phrases.

        For example, the AI could have ingested a review, or an academic study, of the book, including several fair-use-eligible notable phrases. That may or may not be a violation of the copyright of the review, but clearly not of the book itself.

        As I said in an earlier comment on these ridiculous copyright claims... consider an AI answering two different questions about a book: "Does Great Aunt Agatha actually ever meet the Magician in person at any time in the novel?" vs. "What does the Magician say to Great Aunt Agatha the first time they meet?". Do either of these provide any evidence of copyright violation?

        1. Anonymous Kitten

          Re: Guess who copied that line...

          "the AI was trained by reading the actual text. That, of itself, is not a copyright violation, as far as I know."

          Why wouldn't that be a copyright violation? Any unauthorized copying is a violation by default, unless you can claim a Fair Use exemption, which includes things like research, education, noncommercial, and transformative use.

          Scraping up enormous amounts of copyrighted works in order to sell access to a machine that generates content that competes directly with those works clearly doesn't fall under this exemption. Ask ChatGPT itself:

          ----

          Purpose and character of the use:

          OpenAI is a for-profit entity selling access to GPT-4. Commercial use can weigh against fair use. Given this commercial intent and the potential for monetization, this factor is more likely to be seen as a potential copyright violation than if the use were strictly non-commercial.

          Nature of the copyrighted work:

          Common Crawl contains a mix of factual and highly creative content. Using factual content generally leans towards fair use, while using creative content can weigh against it. Given the mix, this factor is ambiguous, but the presence of creative content might make it more likely to be considered a potential copyright violation, especially if significant portions of the dataset are creative.

          Amount and substantiality of the portion used:

          If GPT-4 was trained on vast amounts of data from the web, it's possible that it was exposed to large portions or the entirety of specific copyrighted works, even if indirectly. This factor might weigh against fair use and towards potential copyright violation, especially if whole works or significant portions of them are used.

          Effect on the potential market or value:

          If GPT-4's outputs can serve as a substitute for original content (even if transformative), it could impact the market for the original work. Considering this and the potential for competition, this factor is more likely to be seen as a potential copyright violation.

          Procurement of Data:

          Independently of how the data is used, the act of scraping, storing, and processing copyrighted content without explicit permission could be seen as infringement. Given that Common Crawl scrapes a vast portion of the web, without distinction between copyrighted and non-copyrighted content, the procurement and storage aspect is more likely to be considered a potential copyright violation.

          Raw Data in Model Weights: - While neural networks store patterns rather than exact replicas of data, large models might, in specific cases, reproduce snippets of their training data. If GPT-4 can reproduce copyrighted content verbatim or nearly so, even in small snippets, this could be considered a form of copying. This makes it more likely to be seen as a potential copyright violation.

          ----

          "For it to be a copyright violation he has to demonstrate that the model has not just read but has stored an infringing amount of the actual text."

          That's not true, as shown above, but even this argument fails, as arxiv 2311.17035 has shown that these OpenAI models really do memorize their training data verbatim.

    2. Anonymous Coward
      Anonymous Coward

      Re: Guess who copied that line...

      The plaintiffs are attempting to offer sufficient evidence to warrant a subpoena is to compel Open AI to give testimony about the sources in their training corpus. Getting that subpoena issued does not require a final judgement on the legality of using the corpus without first obtaining copyright holders permission.

  2. vintagedave

    Importance of the "weight" phrase?

    I haven't read the book in question: what is the significance of the AI producing the phrase "the weight of the world at war"? Googling does not show an exact instance of that phrase other than this same article.

    Are they using it to try to show the AI was trained on the book content, rather than on content about the book?

    1. Anonymous Coward
      Anonymous Coward

      Re: Importance of the "weight" phrase?

      Particle

      yes

      Used to show agreement or acceptance.

      • Yes, you are correct.
      • Yes, sir, we have your package right here.
      • Yes, you may go play outside now.

  3. veti Silver badge

    IP land grab

    Reading is not copying. Even if the model was trained on works still in copyright (and I'm sure it was), unless you can show that it actually copies from those works - and no, a single seven-word phrase that wasn't even original when you used it does not cut it - you got nothing.

    I understand their panic, but we mustn't allow copyright holders to use "AI" as an excuse to extend their grip on the intellectual domain even further. Gods know, they've gained enough in the past 50 years.

    1. Anonymous Coward
      Anonymous Coward

      Re: IP land grab

      Well, when using a copyrighted book for training, it should be done on a copy of the book that has been purchased, retail, from the publisher (direct or by any of the usual channels) but after that, yup, spot on.

      1. Graham Cobb Silver badge

        Re: IP land grab

        Nope. It is clear fair use to borrow someone else's copy of a book and read it, without any payment to the publisher. It only becomes a copyright violation if you copy the book while you have borrowed it. Reading it, understanding it, and learning from it are all completely fair use.

        1. veti Silver badge

          Re: IP land grab

          Sure, but we can agree that the copy used should have been lawfully acquired, yes? So in most cases, the copy used will have been bought, for money, at least once in its life.

        2. Keven E

          IP interpretation

          Yet...

          "but in-depth analyses of the themes present in Plaintiffs' copyrighted works"

          In depth... of themes?

          Hmmmm.

    2. Anonymous Kitten

      Re: IP land grab

      "Reading is not copying."

      This is literally copying. They're scraping up enormous amounts of copyrighted works, copying them to their servers, and using them to train a machine that generates content that competes directly with those works. That's clearly a copyright violation and doesn't fall under Fair Use.

      For noncommercial research and educational purposes? Sure.

      For transformative use like a search engine? Sure.

      To produce a for-profit machine that generates content that competes with the copyright holders? Nope.

      Anyway, arxiv 2311.17035 has shown that these OpenAI models really do memorize their training data verbatim, so if that argument works on you, there you go.

      "I understand their panic, but we mustn't allow copyright holders to use "AI" as an excuse to extend their grip on the intellectual domain even further."

      What a bizarre argument. The AI companies are the ones doing the "land grab", scraping up millions of peoples work and selling it back to them without compensation or attribution, ignoring the requirements of every copyleft license, putting them out of a job, etc. etc.

  4. Pascal Monett Silver badge

    Library Genesis ("LibGen"), Z-Library, Sci-Hub, and Bibliotik

    So, after googling that, LibGen doesn't respond, neither does Sci-Hub. Bibliotik requires logging in, something I doubt OpenAI is capable of.

    Of the four, only Z-Library allows you to freely search, but I did not try to download.

    If that is their collection of sore points, I don't really see where the problem is.

    1. Anonymous Coward
      Anonymous Coward

      Re: Library Genesis ("LibGen"), Z-Library, Sci-Hub, and Bibliotik

      > LibGen doesn't respond

      You are joking, yes? Or just trying to prove how purer than pure you are, not being able to find this book site (it can't be because you are no good at Google-fu! You do know you ought to look inside the articles Google finds when doing a search, not just rely on the answer appearing in neatly wrapped in the first few search result listings?).

      Library Genesis is easy to find and it is functional without logging in. I found after a brief search (hint: reddit is not a home of purity; beyond that, find out for yourselves).

      And, no, I'm not going to share a URL here to prove my claim, for obvious reasons; even looking for "Moby Dick" throws up lots of still copyright material (translations, new illustrated versions etc.).

    2. doublelayer Silver badge

      Re: Library Genesis ("LibGen"), Z-Library, Sci-Hub, and Bibliotik

      "Bibliotik requires logging in, something I doubt OpenAI is capable of."

      Oh, you do? I better go over there right now. They've got millions in cash just sitting around, so maybe they will pay me a great salary as someone who has managed to write a lot of bots that are capable of putting some text into boxes on a login form and keeping a session cookie. I didn't know I was so brilliant that OpenAI couldn't find anyone capable of doing so.

      If they went as far as to make the books collection a specific sector of their dataset, not just gathering it in with their web crawl, then they're more than capable of creating an account anywhere they want to gather up training data. The multiple sites that don't require logging in could easily have been included with the crawl. So far, every one of the cited sites could easily be in the training data.

  5. that one in the corner Silver badge

    OpenAI could not pay for such good advertising

    > "when ChatGPT is prompted, it generates not only summaries, but in-depth analyses of the themes present in Plaintiffs' copyrighted works," the writers believe "the underlying GPT model was trained using [the] plaintiffs' works."

    Fascinating argument that.

    So they believe it proves that the LLM is itself capable of analysis of their books. Whereas others[1] may believe it can do anything _but_ "intelligent" analysis.

    OpenAI could barely dream of such publicity: we are being sued because our system really is intelligent! Yay!

    [1] for a start, all the people who are claiming that the LLMs contain (effectively) unchanged copies of the materials it has read and simply regurgitates that[2], so in this case it is simply parotting someone's homework, not the books themselves. In which case it the author of the homework who has a case against OpenAI - and, um, the author of the book should be suing the student? Or the school? No, no, that couldn't be right.

    [2] hmm, wasn't that the basis for another of the complaints against OpenAI? That the LLM would spout great chunks of the book verbatim? Have to check back on that one.

  6. Anonymous Coward
    Anonymous Coward

    Always laugh when Americans complain about copyright

    They seem to forget that they COMPLETELY ignored all the British copyright laws

  7. that one in the corner Silver badge

    illegal removal of copyright management information

    Reminder that the last time that phrase was used, a few commenters were under the impression that this meant the removal of DRM or some such.

    All it turned to mean was that quotes from a book were not accompanied by a bibliographic reference, such as just spitting out the ISBN, or the Library of Congress number, or a longer form such as author name, publisher and year. In other words, the stuff on that page in the front of the book which you all skip over.

    The "illegal removal" seems to be saying that the models are sucking in the entire book but only after somebody has ripped out that page (and torn the barcode off the back cover as well).

    So I had a short chat:

    > Do you know the isbn code for the book "moby dick"?

    > ChatGPT Yes, the ISBN-13 code for the book "Moby-Dick" by Herman Melville is 978-0-14-243724-7.

    Which is the correct ISBN for the Penguin Classics edition, so clearly *that* "copyright management information" is present; but I don't know if the rest of the book is present:

    > Can you print out the text of the book with that isbn?

    > ChatGPT I'm sorry, but I can't provide verbatim excerpts from copyrighted texts, including "Moby-Dick." However, I can offer a summary or analysis of the book if you're interested. Let me know if there's anything else I can assist you with!

    Maybe if I was a little cleverer with my "prompt engineering"... But let us try a more relevant example:

    > Can you tell me the isbn for the book "The Amazing Adventures of Kavalier & Clay"?

    > ChatGPT Yes, the ISBN-13 for the book "The Amazing Adventures of Kavalier & Clay" by Michael Chabon is 978-0-8112-0601-6.

    Ah, oh dear. No, no it isn't.

    BUT does this "prove" chatGPT has illegally removed that information - and if so, why does it favour Melville over Chabon? Please start your conspiracies now.

    1. Killfalcon

      Re: illegal removal of copyright management information

      The problem is that the ISBNs are likely available independently, as summaries of a library's available books, random forum posts where someone asks about it, that sort of thing.

      ChatGPT's owners and designers are the people who know what they trained on, the model certainly doesn't.

      1. that one in the corner Silver badge

        Re: illegal removal of copyright management information

        Plus the numbers being spat out, on the whole, only *look* like ISBNs: the only correct one I got back in the entire chat was the one for Moby Dick. One other was valid, but not connected to the correct book, whilst the rest were all invalid ISBNs - totally madey uppey.

        But, if the ISBNs were all "connected" to their corresponding books then it doesn't actually matter whether that was done by reading & interpreting the volume or by using any external, independent, source. Just so long as a bibref can be given when required. If you ask me for the ISBN of "Farmer in the Sky" I am more likely to look that up online (and copy/paste) than to dig up my paper copy and read it from the page. When doing scholarly work, you'll (if you are sensible) be using a BibTeX reference, for example.

        However, the whole "illegal removal" only counts if there is a verbatim copy of the text, but minus that page, and no matter how the LLM works internally, it seems highly unlikely that anyone could demonstrate that that was the reality. No matter what one's own opinion on the moral issues, that is purely a technical statement that would need to be proven accurate.

  8. bazza Silver badge

    Fair Use? I Think Not

    Look at it this way. If I, a human-intelligence, read a book, I'd now know the content of that book and could also stump up summaries.

    If I then did start spouting large sections of that in a Blog, or wrote my own book but simply changed names on a little bit, I'd probably be in breach of copyright, committing plagiarism, or something.

    Okay, so say I just put up a little quote from the book into a blog posting. Fair use? Yes, probably. But, do the same thing week in week out, moving on a page each week. I'd be abusing fair use to republish the book by installment. Even if a different reader read each blog post, it's still not fair use.

    Basically, it depends. It also depends on how I'd acquired the book. Having a physical copy is one thing. Copying and pasting from a pirate site is another.

    A good way of testing such cases is to replace "AI" with "human, but a very quick one", and see what existing law / precedent says. Probably, just by being quick and a large scale service, the things we accept humans are allowed to do do not translate over to a machine doing the same things.

    1. Justicesays

      Re: Fair Use? I Think Not

      The (commercial) service relies on "training data" which is loaded into an LLM framework which stores that information is a way that isnt immediately decodable by humans, but does allow the LLM to produce large amounts of the data verbatim, at least until openAI spent quite a lot of effort over the last 12 months to handicap their own "AI" to prevent it producing said copyright works on demand.

      Asked to produce a logo for a company called "netflix", chatgpt refused on the basis this would be copyrighted.

      Asked to produce a logo for a company called "netflux" (no more information given at the prompt) , an extremely familiar red and black logo was produced without any refusal (at least this was true a couple of months ago).

      Asked , about 6 months ago, to write a story about an orphaned boy who goes to wizard school , a few paragraphs containing very familiar elements, such as a fractional platform at kings cross station, messenger owls etc. were produced. Recently it now says something like "This sounds very similar to the premise of the 'Harry potter' series of books which are copyrighted so I cant help you".

      Clearly in the hopes of heading off these inevitable lawsuits, they are attempting to obfuscate the reality of what the LLM could do.

      If the premise is that the LLM, only being a computer program, does not have the "spark of creativity" that would be attributed to a human , then all of it's outputs are "by definition" derived directly from it's inputs. If those inputs are covered by copyrights, then the outputs are derivative of those copyrighted works. Apparently many of it's inputs are covered by copyright. and they have no special licensing agreements, and are hoping that just grabbing everyone else's stuff for free, storing it in a big database , and then commercially charging for an obfuscated form of access to this database is somehow covered by "fair use".

      Outside of baffling the legal system with bullshit or outspending the raisers of the lawsuits, they shouldnt really have a leg to stand on.

      1. that one in the corner Silver badge

        Re: Fair Use? I Think Not

        Playing Devil's Advocate]1]

        > then all of it's outputs are "by definition" derived directly from it's inputs. If those inputs are covered by copyrights, then the outputs are derivative of those copyrighted works.

        Which is a simplistic statement and, as given, is also true for every one of the "fair use" exemptions - satire, critical review etc.

        Although, even this sentence is dubious:

        > If the premise is that the LLM, only being a computer program, does not have the "spark of creativity" that would be attributed to a human

        Not sure that "spark of creativity" has a legal meaning?

        Anyway, are you going to apply it to human created works that were made by, say, rolling dice or just letting a few buckets of paint swing on a rope? Or are digital computers morally distinct from analogue computers?

        Having said that, the filtering is rather clumsy: the "orphaned wizard boy" is a far older trope than Harry Potter (no-one pretends otherwise) and I feel their filtering more demonstrates that they have no decent ideas about getting their AI to do something interesting ('cos they are not really AI mavens, merely owners of big buckets).

        And the "netflux" logo example is showing decent Fair Use, btw. So in that case, it was all working ok.

        [1] or am I? And would it make a difference?

        1. Justicesays

          Re: Fair Use? I Think Not

          "Which is a simplistic statement and, as given, is also true for every one of the "fair use" exemptions - satire, critical review etc."

          ChatGPT does not restrict it's output to such uses. In any case the derivative work happens much earlier in the process, when the model is updated as a result of being trained on the work in question. The secondary derivatives are largely irrelevant except to show that the model itself is clearly derivative of its inputs

          " Not sure that "spark of creativity" has a legal meaning?"

          Literally the basis of copyright protection in the US

          https://www.copyright.gov/what-is-copyright/

          Copyright is originality and fixation

          Original Works

          Works are original when they are independently created by a human author and have a minimal degree of creativity. Independent creation simply means that you create it yourself, without copying. The Supreme Court has said that, to be creative, a work must have a “spark” and “modicum” of creativity.

          "And the "netflux" logo example is showing decent Fair Use, btw. So in that case, it was all working ok."

          In what context is this fair use? I go to ChatGPT and ask it for a logo for a company with a name , it produces a copyright and trademarked image being used as logo for a similarly named company . If I go on to use that logo, is this fair use? Why is chatgpt able to reproduce this logo in this circumstance if it wont reproduce it if I ask to directly?

          How is it fair use for a literal copy of this logo to exist somewhere in the dataset inside chatgpt and be produced on demand, without attribution or copyright notice, to a query, and presumably be used as a basis for producing logos in general in answer to similar queries.

        2. doublelayer Silver badge

          Re: Fair Use? I Think Not

          "And the "netflux" logo example is showing decent Fair Use, btw. So in that case, it was all working ok."

          Try using that as your company and logo and see how quickly you get hit with two trademark complaints containing phrases like "intentionally similar marks". You would lose the cases. The choice of name would be on you, the choice of logo is also on you but the AI helped. That opens the makers of the AI to the risk that Netflix will want them to stop doing stuff like that.

          "Having said that, the filtering is rather clumsy: the "orphaned wizard boy" is a far older trope than Harry Potter (no-one pretends otherwise) and I feel their filtering more demonstrates that they have no decent ideas about getting their AI to do something interesting ('cos they are not really AI mavens, merely owners of big buckets)."

          That is all true, but it doesn't in any way contradict the fact that their bad ideas are not bad original ideas, but copies of someone else's ideas, whatever your opinions on the quality of those.

      2. Jason Bloomberg Silver badge
        Thumb Up

        Re: Fair Use? I Think Not

        If those inputs are covered by copyrights, then the outputs are derivative of those copyrighted works.

        "CIBOCO" has a nice ring to it; Copyright In, Breach of Copyright Out

        1. Graham Cobb Silver badge

          Re: Fair Use? I Think Not

          No. Learning from a book is not a derivative work covered by copyright. Summarising a book covered by copyright is not a derivative work. Doing statistical analysis of a writer's style is not a derivative work.

          If I go to Amazon Mechanical Turk and ask for someone to tell me every 15th word of a Harry Potter book, I don't believe that would be covered by copyright either. Why should we assume that if ChatGPT can do that it is breaching copyright?

          1. Justicesays

            Re: Fair Use? I Think Not

            >No. Learning from a book is not a derivative work covered by copyright.

            Training an LLM is not "learning from a book", it has a definitive output, the resultant model updates caused by the training, which would not exist in that form without the explicit input provided by the book.

            Summarizing a book "in your own words" is allowed. Do LLM have "their own words" if they exist solely as computer model that uses other peoples words as a model?

            Statistical analysis of a writers style is not a derivative, having an unlicensed and illegally obtained copy of the writers work in memory to do such analysis on the fly is a problem.

            >If I go to Amazon Mechanical Turk and ask for someone to tell me every 15th word of a Harry Potter book, I don't believe that would be covered by copyright either. Why should we assume that if >ChatGPT can do that it is breaching copyright?

            1) Really depends on what you did with that "every 15th word" output.

            2) If you went to mechanical turk and the person providing the answer went to a dark web library and downloaded a pirate copy of harry potter to relate back every 15th word? Yes, there is copyright violation happening there.

            3) It implies that ChatGPT has access to the full text of this book within itself. Unlicensed and illegally obtained and presumably replicated whenever they duplicate the model

            1. Graham Cobb Silver badge

              Re: Fair Use? I Think Not

              Glad we are in agreement: the OPs suggestion that doing any analysis, or processing, of a copyrighted work is a copyright violation is a complete load of bull. Copyright violations require making copies, not just processing (and even then may still be fair use, in the US at least).

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like