back to article Judge tosses publishers' copyright suit against OpenAI

A US judge has thrown out a case against ChatGPT developer OpenAI which alleged it unlawfully removed copyright management information (CMI) when building training sets for its chatbots. Publishers Raw Story and AltNet allege that when OpenAI removed the description of the copyright status, it resulted in a "concrete injury." …

  1. druck Silver badge
    WTF?

    What is the point...

    ...of creating any content and putting it on the internet, when AI is going to suck it up, and only $billion dollar companies profit?

    The court has given them the green light even before Trump can set Musk off on his bonfire of the regulators.

    1. An_Old_Dog Silver badge

      Re: What is the point...

      Many authors and some publishers post entire books on the Internet, free for download. Baen Books is one publisher which does this, and their marketing 'trick' worked on me. I downloaded and read some of their free books, was introduced to new-to-me excellent authors, and consequently bought many of Baen's paperback books.

      But these postings occurred well-before LLMs were invented, and LLMs Hoovered up the texts before the problem became visible to the publishers and authors.

      1. Richard 12 Silver badge

        Re: What is the point...

        Zero money does not mean you can do what you like with it.

        The copyright still applies, and I'm absolutely certain that the licence Baen used included "attribution" as a requirement. It almost certainly also said "non-commercial" in some way.

        LLMs breach both conditions.

        1. Ken Moorhouse Silver badge

          Re: "attribution"

          Attribution is not just useful for determining ownership from a licencing point of view. It is also useful to know where information has been derived from. What use is a scholarly article without its citations? From an AI perspective, it is the way to prove no hallucinations have taken place.

        2. veti Silver badge

          Re: What is the point...

          Copyright only limits you if you make a copy of the work. (Or a few other things, but copying is what's relevant here.)

          When a work is used to train an AI model, is a copy created? That's far from clear.

          Copyright holders can't make it illegal for you to read and digest a book, or even to memorise it down to the last comma. Why should they be allowed to stop an LLM doing the same?

          1. doublelayer Silver badge

            Re: What is the point...

            Doing this again, are we? It's been explained lots and lots and lots of times, so I'll do the short version.

            "When a work is used to train an AI model, is a copy created?": Yes, at least two. One in the training set, where the company can retrieve it for any future trainings or anything else they may choose to do, stored on their hard drives in the same form they got it in the first place. Another in the model, assuming the model has bothered to retain it, in a chopped up form where exact extraction is more difficult. And if it did get retained, chances are that there is a third copy, the copy that is emitted by the model, either verbatim or with inaccuracies.

            "Copyright holders can't make it illegal for you to read and digest a book, or even to memorise it down to the last comma."

            They can't make it illegal for me to buy a copy and read it, but it is illegal if I don't have a legal copy. If I buy it, no problem. If I get it from a library, no problem. If I download an unauthorized copy someone pirated, I am not allowed to read it. True, they're not going to stop me because they don't really care, but it isn't legal for me to have it. This might be a non-issue if the creators of the models had bought copies. Then this would just be a discussion about what you're allowed to do with works you've legally accessed, but they skipped that part and got illegal copies.

          2. Filippo Silver badge

            Re: What is the point...

            >Copyright holders can't make it illegal for you to read and digest a book, or even to memorise it down to the last comma. Why should they be allowed to stop an LLM doing the same?

            Nobody is suing a LLM.

            All of the lawsuits are against the LLM's developers.

            LLM developers are not reading and memorizing books. They are feeding books to a computer algorithm, running it, and commercially exploiting the output.

            During this process, let me reiterate, they do not read books, nor do they take any action that remotely looks like reading books. They mostly don't even know which books they are processing,

            The LLM does do something which looks like reading books, but that is irrelevant because, again let me reiterate, nobody is suing the LLM.

            On the other hand, doing computer processing of something under license and then selling the output is very unlikely to be covered by the license, and if it were, it would be in the list of things you are explicitly forbidden from doing.

            I really can't put it any simpler than this.

            Also, since this is something that comes up, just because something is freely available to read and download, it is not in the public domain. It just means that there aren't technical restrictions, which makes the legal restrictions hard to enforce. But they still exist.

    2. Omnipresent Silver badge

      Re: What is the point...

      You guys are slow to catch on. The bros have already restarted their buttcoin scam again... and a whole lot of monkeys are going to fall for it again, because monkeys don't learn.

      You see, rule of law does not exist anymore. Checks and balances do not exist anymore. Society has reached its end. Russia won. America is now putins whipping botch boy. It doesn't exist anymore. It's now "new russia".

      The UK needs to dump the US, because they are not your allies anymore. What ever they were is gone. It was a combination of covid brains, ameba worms, and social media. Their heads are actually full of holes. They don't actually know any better, and there is no right and wrong.

      I told you to get off the internet. You have to. It's destroying what ever is left of humanity.

      1. Cliffwilliams44 Silver badge

        Re: What is the point...

        Take your meds junior!

  2. Criminny Rickets

    What about schools that use books to teach kids, who then use that knowledge to go on to get high paying jobs. Should the companies be able to sue schools as well? The schools or students buy the companies books, but so do the AI companies.

    1. Anonymous Coward
      Anonymous Coward

      Those books were purchased, not copied, for the express purpose of educating kids. (Which is what the writers and sellers had in mind.) No "new" work is being created from the books.

      This is closer to making copies of books, without permission or payment, in order to make a collage-like derivative work, which they then go on to sell, without attribution or payment to the real authors.

      1. Criminny Rickets

        Every time a student reads a book, they are copying the information from the book into their brains. Every time one of those students goes on to create something based on what they learned from those books, it is creating a new work.

        The AI is not reproducing the original book word for word, the same way a student when doing something, is not reproducing what they learned word for word.

        1. nobody who matters Silver badge

          I think you are missing the point somewhat.

          The books read by the students have been bought (often in quantity by schools, colleges and universities), and were sold with the express intention of imparting knowledge. OpenAI etc, are not buying the books in quantity to issue to students. At best it would be one copy, but in most cases it seems to be that they have ripped pirated text from the internet without any payment at all and without even any attribution as to where (and specifically, who) the information came from.

      2. veti Silver badge

        How is "purchasing a book to educate a kid" different from "purchasing a book to train an AI"?

        1. sh4dow

          The difference is that since slavery was made illegal, you can't sell the kids containing memorized copies of the books.

          And even if you made them write down from memory the book, you couldn't sell that copy without breaching copyright (even if it contained errors).

  3. Eclectic Man Silver badge

    access to information vs being paid

    "... large language models were infringing copyrighted content on an "absolutely massive scale," arguing that the Books3 database – which lists 120,000 pirated book titles – had been ingested by large language models.

    However, AI developers have argued that maintaining broad access to information on the internet is important for innovation."

    Seems to me that they have not actually denied ingesting pirated books in their statement about the importance of the availability of information for innovation. A bit close to an admission of 'guilty, but with the extenuating circumstance of being for the greater good'. Maybe they should be asked whether being paid is important for the creation of the content they use.

  4. Grunchy Silver badge

    Your comments are being slurped by AI

    Since literally nobody learns to write except by reading someone else’s etchings it’s probably time to stop with the complaints already…?

    Here’s Bill Kirchen doing EVERYTHING, maybe not intelligent, and not really that artificial either:

    https://youtu.be/K2_Kp_q786g

    (For wristwatches we discourage “counterfeit” in preference of “homage”!)

    1. Helcat Silver badge

      Re: Your comments are being slurped by AI

      There are claims of copywrite infringement from authors against other authors who have written their own books that are suspiciously similar to that of the first author.

      Sometimes these cases are proven, sometimes they are not. The point, however, is that if someone writes a book, you read that book, you don't get to write your own version using the same settings, characters with or without the general plot. It's why so many authors had to wait until Sherlock Holmes was no longer protected by copywrite before releasing their own take on the consulting detective without obtaining permission to do so from Sir A C Doyle's Estate.

      These LLMs don't understand this. Ask one to produce a book in the style of Doyle and it will likely produce some version of a Sherlock Holmes story if it's been trained on the original novels. That's the issue: If the books are still under copywrite, you need the LLM to understand that it can NOT produce a book in the style of that author using characters created by that author in a setting associated with those characters as depicted by that author. That would be an infringement. Heck, even in a new and unique setting could still be infringing if it's clearly the same characters as created by that original author.

      Otherwise all that fan fiction out there would suddenly get a semblance of legitimacy. That really wouldn't be a good thing.

  5. The Dogs Meevonks Silver badge

    Am I completely missing the point here?

    I'm not in the US and I'm not a lawyer... But i thought that the 'fair use' defence only applied to transformative and non commercial use... or something along those lines.

    It's how you can use clips of movies for reviews and so forth, or parody something by using the characters from a piece of work.

    If you're selling a service... that's 'commercial', and it's not really transformative when it can regurgitate something entirely.

  6. Rattus
    Black Helicopters

    Where is the RIAA in all of this

    IF this was music or video (I am sure some of it is) then the RIAA would have been all over this

  7. Anonymous Coward
    Anonymous Coward

    The tech bros who control the most advanced AI have realized that information is the raw material of progress once AI becomes as powerful as it is now.

    The court has given them the greenlight to slurp up any information it can get its hands on. It becomes more profitable then if that informtion they slurped gradually becomes non-accessable to the public, either through privitizing the source, or using influence to see that the IP holders are put out of business, so that the information is now held only by them.

    1. Cliffwilliams44 Silver badge

      I believe it depends on how it reproduces said material. If a reviewer, who is getting paid to write the review, copies a passage from a book for illustrative purposes into his review guilty of copyright infringement?

      If the LLM uses the content of a book to make a generalized response on a subject that is related to the book, is that a copyright violation? If the LLM produces a passage verbatim, them they should attribute that passage. This way the user can know where to go if they want more detailed information.

  8. Omnipresent Silver badge

    I am often reminded

    Of the story of Boudicca, and how she gave everything she had to whip the romans right out of Britain. Leaders are thrust into a position they did not ask for. They do not assume power.

  9. Excelziore

    Sample and remix MUSIC --> you need a license. Sample and remix TEXT --> do as you please?

    What's the difference?

  10. Pirate Peter

    it comes down to what are LLM''s doing with the information they slurp

    if you go into what a LLM does with training data you will find it "tokenises" the data

    that is it splits sentences into a series of numerical tokens, a token can be a word or part of a word, or even a comma or other punctuation mark

    what it does then is put them in a large data base so it can run queries, but in simple terms the output of the queries is not full sentences its the probability of a word coming after another word in a particular context

    if you ask "what colour is a cat" you will likely begin with something like "a cat is" with high probability, then you will get several choices like "most often", "very often", "sometimes" then a colour "black", "white", "ginger", "blue", "green", "red" etc and each option will have a probability of coming after a previous word / statement

    so the answers will rarely be classed as a derivative work, as it will bear no resemblance to the training data ingested, so copyright is not the right way to approach this issue

    most likely to succeed IMHO is if there is a "no commercial use" clause on the website data, as clearly training a LLM will be for commercial gain

    the other option Is if the data / website owner has robots.txt in place with the relevant entries for a AI Scraper bot and they can show in logs the bot accessing and ignoring the entries in robots.txt, although many argue robots.txt is an informal agreement that reputable sites like google etc honour but that it has no legal status, so likely to fail unless the company scraping the site and ignoring the robots.txt file says in their website / terms somewhere they will honour it

    but the biggest problem is that due to the tokenisation process once data is ingested there Is no way to remove it from an LLM

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like