back to article Author hopes to throw the book at OpenAI, Microsoft with copyright class action

While the chaos unfolding at OpenAI might be amusing to watch, the issue of copyright infringement continues to plague the upstart after yet another lawsuit was lobbed its way. The crux of the case is all too familiar. An author is not happy that their work has been slurped as training data into OpenAI's text-generating models …

  1. l8gravely

    So what about all the students reading books to write papers?

    So what about all the students who read these books and write papers about the the information inside them? Is that stealing too? Just training (or reading as us wet brains call it) is what you do with books and media. It's how you learn and build upon other's works. Even citing small passages is enshrined in the Copyright Act (of the US, no clue on UK laws) as fair use.

    But mostly I feel for all the monks who are out of jobs now that authors can just print their documents on a printer, without any human help! Woe!!!

    1. doublelayer Silver badge

      Re: So what about all the students reading books to write papers?

      The argument is that reading a book as a human and processing the book in a process called "training" aren't the same thing. Thus, just because one is acceptable doesn't mean the other is. It gets philosophical when we start to ask what the model is really doing with the text it is ingesting, but it should be clear that I can't use whatever copyrighted information I want just by calling whatever my program is doing training.

      1. cornetman Silver badge

        Re: So what about all the students reading books to write papers?

        > ...but it should be clear that I can't use whatever copyrighted information I want just by calling whatever my program is doing training.

        It seems to me that the authors would have a point if their works were being duplicated by these systems and if a ChatBot did spew out verbatim passages from their works, then they would have a very real case.

        Someone with a very good memory could read the works and recite them back into printed form and that (with fair use caveats) could constitute copyright infringement.

        But unless it actually does, I don't really see a fundamental distinction between what people do when reading the book and what machines do through training, if these large language models are not merely storing coherent copies of the works.

        As these models become more and more sophisticated and it increasingly seems like the difference between what people and machines do is narrowing, then that distinction is going to become harder and harder to make.

        1. Anonymous Coward
          Anonymous Coward

          Re: So what about all the students reading books to write papers?

          Let's make it simpler then. To write articles, paint, create a design - it takes human effort and ingenuity. It is up to the people doing that effort to decide what can be done with it - that's basically what copyright allows them to specify. If someone uses substantial parts of that work to generate a profit by, for instance, repeating it verbatim (read: not even bother to create a derivative or interpretation then I don't find it unreasonable that they deserve a part of that profit as contributors.

      2. mpi Silver badge

        Re: So what about all the students reading books to write papers?

        > reading a book as a human and processing the book in a process called "training" aren't the same thing.

        Why?

        Both are inferring statistical information about the relation of concepts from ingesting the data, so they can then use that information for unseen and new tasks.

        The only difference is the scale at which that is being done.

        1. Ken Hagan Gold badge

          Re: So what about all the students reading books to write papers?

          "Both are inferring statistical information about the relation of concepts from ingesting the data, so they can then use that information for unseen and new tasks."

          To the extent that those words actually have a meaning that can be nailed down, I don't accept that this is known to be the case. We have no idea whether human learning is a statistical process and we have precious little idea of the kind of statistical knowledge being accumulated in a large language model.

          1. katrinab Silver badge
            Boffin

            Re: So what about all the students reading books to write papers?

            You might not understand it, but plenty of other people do, and conceptually, it is very simple, just dealing with vast amounts of data very quickly.

            1. Ken Hagan Gold badge

              Re: So what about all the students reading books to write papers?

              That "conceptually" is doing a lot of heavy lifting. If there was any real understanding, those other people would be able to create an AI that could explain its choices.

    2. Dr_Bingley

      Re: So what about all the students reading books to write papers?

      Not the same thing since these 'AIs' are not actually intelligent in the human sense at all. They simply use inductive logic to predict which words they could, or should use given conditions X and Y. To phrase it differently, they do not understand what these words 'mean' since they do not share in the human experience. A student may attempt to actually understand a given text, then contemplate its implications and integrate them into their own analysis of a situation or problem.

      1. Lon24

        Re: So what about all the students reading books to write papers?

        Yes, basically adding value by integrating it with other information or applying the knowledge gained to a new situation. Even so it is both courteous and maybe a legal obligation if the input to added value is substantial to give, at least, a citation.

        Something I am unaware these bots are designed to do or even if an LLM has the concept of being able to identify any relevant single input related to output. Just a collection of very sophisticated word correlations that are effectively independent of any one work but totally dependent on ALL our work.

  2. Graham Cobb Silver badge

    Zzzzzzzzzz

    Sancton spent five years and tens of thousands of dollars on the book, secure in the knowledge that the US Copyright Act gives "exclusive rights" as well as "the rights to reproduce the copyrighted work[s]."

    Non-starter. If I read Sancton's book and then set myself up as an expert on the expedition, giving lectures, taking money to help new expeditions learn from the expedition, proof reading other books to correct the spelling of the expedition leaders name, or in any other way make money from the knowledge I gained from reading Sancton's book, I don't owe him a penny.

    And neither does OpenAI.

    Look... I don't like the current pretend-AI (but really LLM) toys any more than the next guy. But for goodness sake let's hit these copyright claims trying to grab some of the AI-hype-money on the head. Facts can't be copyrighted. The information in Sancton's book is not copyrighted. In some cases the wording he uses may be, but it would be very hard to make a copyright claim on the wording of a fact unless the wording was very, very unusual.

    Stop trying to jump on the bandwagon. Instead, take out your frustration by helping stop the hype - show how the book is much better than the so-called-AI.

    1. Falmari Silver badge

      Re: Zzzzzzzzzz

      You do owe him a penny if you read a copy of the book instead of purchasing the book or borrowing a purchased copy. For example if you borrow Sancton's book from your mate that's fine. But if your mate makes a copy of Sancton's book that they own and gives you the copy, that's copyright infringement.

      OpenAI did not purchase the book or borrow a purchased book. Their training data contains an unauthorized copy, a pirate copy. At the very least OpenAI owe him the cost of the book for every copy they have as training data.

      1. veti Silver badge

        Re: Zzzzzzzzzz

        OpenAI did not purchase the book or borrow a purchased book. Their training data contains an unauthorized copy, a pirate copy.

        Why do you say that? How do you know?

        1. BrownishMonstr

          Re: Zzzzzzzzzz

          Perhaps in a few years, Books will have different licenses for the intended purpose.

          That is, for reading it's fine, but for training LLMs it's not.

          1. veti Silver badge

            Re: Zzzzzzzzzz

            If you sell a book, you have no right to specify who can or can't read it, or how, or why. That's *not* one of the rights copyright gives you.

            And I for one will fight any attempt to create such a right. It may happen, through some sort of backdoor licensing step as you suggest, but if it tries to come to my jurisdiction, I'm prepared to travel down to parliament and camp in the lobby to make them put a stop to it.

        2. doublelayer Silver badge

          Re: Zzzzzzzzzz

          From the claims in the court case. If they had purchased a book, the case would have said something like "Defendant purchased a book but used it for purposes we believe do not qualify as fair use", but it doesn't say that. Their case agrees with the "we don't think it qualifies as fair use part" but includes the additional claim that they didn't purchase a book, and we know from many previous cases that they didn't purchase anyone else's book, so there seems to be no reason to expect that they'd have made an exception for this one. Of course, OpenAI is free to prove otherwise, in which case that part of the claim can be immediately dismissed. Do you see them making that very simple statement and getting rid of that claim? I don't, which suggests that they cannot.

    2. katrinab Silver badge
      Megaphone

      Re: Zzzzzzzzzz

      But ChatGPT is only ingesting the words, not the information or meaning behind those words.

      1. Graham Cobb Silver badge

        Re: Zzzzzzzzzz

        ...which is very clever of them, and why LLM toys can appear to be AI, but are not.

        And completely irrelevant for the copyright claims.

        1. veti Silver badge

          Re: Zzzzzzzzzz

          What makes you think this is any different from what people do?

  3. Dostoevsky Bronze badge

    Darn Right

    OpenAI commercializes its chatbot products. OpenAI wouldn't have chatbots without its datasets. The datasets contain copyrighted work. Ergo, OpenAI is commercializing copyrighted material, and is violating copyright law. Not sure what's so hard to understand about this.

    If they provided copyrighted material for free, for educational or other fair-use purposes, they'd be okay. But they don't *just* do that. They get people to pay for a product they derived from authors and creators on the internet, without the permission of the authors.

    If I sell copies I make of a book or other work, it's theft. If OpenAI does it...it's "training"? That can't be right...

    1. veti Silver badge

      Re: Darn Right

      Well, one thing that's hard to understand is your jump from "the datasets contain copyrighted work" to "OpenAI is commercializing copyrighted material". And then you go on to claim that, apparently because of this, it's "violating copyright law", which is a whole other step that doesn't follow logically from the previous one.

      If you sell copies you make of a book or other work (that you don't have the right to), it's copyright violation, sure. But OpenAI isn't doing that. It's selling access to a system that has (probably) been trained on this book, among many others. But unless it actually regurgitates significant chunks of text, it's not clear how that's a copyright violation.

      Copyright law creates certain very specific, clearly defined "rights" around 1) reproduction, 2) adaptation, 3) publication, 4) performance and 5) display (the "five pillars of copyright"). To make a case against OpenAI, you'd have to demonstrate to a court how it's doing at least one of these things with your work.

      1. katrinab Silver badge
        Megaphone

        Re: Darn Right

        Ingesting the book into its training dataset is a copyright violation.

        The training model is a derivative work of that and many other copyrighted works.

        Just like if you were to take a copy of the source code for Adobe Photoshop and compile it yourself, the binary may look very different to the one Adobe's packaging team produced, but it would still be copyright violation.

        1. veti Silver badge

          Re: Darn Right

          There is no "training model". Well, there may be, but it will be just a list of URLs and titles and possibly search terms. The full corpus of text fed into the model - is not a thing that exists.

    2. mpi Silver badge

      Re: Darn Right

      > The datasets contain copyrighted work. Ergo, OpenAI is commercializing copyrighted material, and is violating copyright law. Not sure what's so hard to understand about this.

      Uh huh.

      So, how many copyrighted works would you say does the brain of an average tech student digest? How many copyrighted books, articles, videos, lecture materials, etc. is that students mind absorbing information from?

      And what does the student do with that info later? Why, he's building a skillset, which he then commercializes by offering it to prospective employers.

      > The datasets contain copyrighted work.

      Does it?

      Please, open an open source AI model, and show me where exactly in that collection of float32 values I can find the ingested data. Also, small question here: If the model does "contain" ingested data, why does the models size not correlate with the amount of data ingested? After all, a 5GiB diffusion model is 5GiB regardless whether it is trained on 1, 100 or 10E12 images.

  4. Doctor Syntax Silver badge

    Picking up the nearest paper-back I find a copyright declaration that says "All rights reserved" and then goes on to say "No part of this publication may be reproduced, stored in a retrieval system or transmitted without the prior permission..." (my emphasis).

    That's from a 1985 printing so such declarations have been around for a long time. You buy a book, you have the right to read it, you have the right to pass it on in the same binding (imposing similar conditions on the recipient if you do) and that's it. No right to slurp it into any system. Nothing could be clearer.

    1. Anonymous Coward
      Anonymous Coward

      And what kind of this copyright declaration was violated?

      > No part of this publication may be reproduced

      Is inferring statistical information about how words and word parts relate to each other reproduction?

      > stored in a retrieval system

      See above, same question.

      > or transmitted

      Well, since the training algorithm gets it's data from online sources, I'd assume that someone uploaded the book to a reachable server.

    2. HandleAlreadyTaken

      > you have the right to pass it on in the same binding

      Dammit, I got some of my books rebound - either older books that were falling apart, or gifts I was trying to personalize - and gave some of them to friends and family. Did I break copyright? Should I get permission from the author to get a nice leather binding on my books?

      Or, what if I write some marginalia in the book? Does this make the book a derived work, and I can't allow other people to read the book anymore without explicit permission from the copyright owner?

      1. Doctor Syntax Silver badge

        It's a paperback. The terms do allow for the owner to rebind it but not for circulation. Maybe the intent is that libraries should buy the hardback edition.

        Booksellers in the past (?C18th) sold books in paperback form so the purchaser could have them bound to match the rest of his books.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like