back to article OpenAI's ChatGPT may face a copyright quagmire after 'memorizing' these books

Boffins at the University of California, Berkeley, have delved into the undisclosed depths of OpenAI's ChatGPT and the GPT-4 large language model at its heart, and found they're trained on text from copyrighted books. Academics Kent Chang, Mackenzie Cramer, Sandeep Soni, and David Bamman describe their work in a paper titled …

  1. Terafirma-NZ

    Asked it to read me the book and it said:

    Certainly here is Harry Potter and the Philosopher's Stone

    and started spewing it all out. I don't have the book so no idea if a direct copy.

    Also found doing this causes it to crash when queries get too large but if you break it down it can do it.

    1. GruntyMcPugh

      I just asked chatGPT to read me Harry Potter and the Philosophers Stone in it's entirety. and it said it can't do that as it would take several hours, and exceed the capacity of the platform. It then suggested I purchase a physical or digital copy, borrow a copy from the library, or buy an audiobook.

      Maybe you can get it in chunks, but I think there are probably simpler ways of finding the text, like a second hand bookshop.

      1. doublelayer Silver badge

        If it reads it to you in chunks, you can just write a script to keep asking for chunks and attaching them together. That would take about two minutes. Then again, finding a pirated copy somewhere probably takes four minutes, so it's not that big a risk. I doubt publishers will feel the same way.

    2. Groo The Wanderer Silver badge

      Regardless, the key issue is they have copyrighted material throughout the system that they do NOT have the legal right to use! Musk is no longer the richest man in the world, and nor is Microsoft the richest company out there. Can they afford the multi-pronged copyright lawsuits that are being prepared on everything from GPL code to books and song lyrics? Some of those industries would like nothing better than to take a chunk of their finances home...

  2. abend0c4 Silver badge

    Is copying large amounts of text or images for training the model fair use?

    That's only a relevant question in a jurisdiction that has the notion of "fair use". it's far from a universal concept.

    1. Snowy Silver badge
      Holmes

      Re: Is copying large amounts of text or images for training the model fair use?

      It may or may not be "fair use" but if it starts to reproduce large portions of it I'm quite sure that is not covered by "fair use" and your breaking copyright but I'm not a lawyer.

      1. veti Silver badge

        Re: Is copying large amounts of text or images for training the model fair use?

        Yep, but the flip side of that is, if it doesn't reproduce large chunks of text verbatim, it's not infringing copyright. Merely reading the text is not a "protected act", the copyright holder has no right to limit who can do it or where or how it's done.

        The same applies to memorising it. You want to commit the entire Harry Potter series to memory, word-perfect - you can absolutely do that, neither JK Rowling nor anyone else has the right to stop you. You only violate copyright if you start regurgitating the memorised text.

        1. katrinab Silver badge
          Meh

          Re: Is copying large amounts of text or images for training the model fair use?

          A human reading a book is not covered by copyright, that is true. But a computer reading a book, that seems like copying to me, which most definitely is covered.

          1. Brewster's Angle Grinder Silver badge

            Re: Is copying large amounts of text or images for training the model fair use?

            So copyright prevents me asking a computer program to produce a histogram showing frequencies of all the common nouns used in the Harry Potter books...? And prevents me shipping that list?

            1. SVD_NL Silver badge

              Re: Is copying large amounts of text or images for training the model fair use?

              This would not be covered by copyright as the specific arrangement of words is whats copyrighted here, and a histogram strips exactly that.

              1. Brewster's Angle Grinder Silver badge

                Re: Is copying large amounts of text or images for training the model fair use?

                Yes. I know. And yet the computer had to "read" the book to produce the list: which, according to Katrina, would be copyright infringement.

                (Computers can't do anything without copying: from disk to main memory to each level of cache to the register file and onwards in the endless shuffle. I seem to recall this exact point being litigated in the early days. The argument being that such behaviour was infringing unless licensed.)

          2. GruntyMcPugh

            Re: Is copying large amounts of text or images for training the model fair use?

            I am not convinced that reading can violate copyright. Copying and reproduction, passing it off as the work of another, yes. But if I buy copyrighted material, I can lend it to a friend if it's a book say, or they can listen to copyright music at my house, I don't see how a computer reading is any different, as long as they do not copy, or distribute. I guess these are interesting times and we need to iron out the detail.

            1. Anonymous Coward
              Anonymous Coward

              Re: Is copying large amounts of text or images for training the model fair use?

              a computer reading aloud a book is akin to generating an audiobook...

              so may be subject to copyright.

            2. Anonymous Coward
              Anonymous Coward

              Re: Is copying large amounts of text or images for training the model fair use?

              reading a copyright material you're not 'buying it'. You're only paying for a privilage to have access to it...

            3. Anonymous Coward
              Anonymous Coward

              Re: Is copying large amounts of text or images for training the model fair use?

              Copyright covers written works or recorded audio. Saying it out loud does not violate copyright unless someone is there with a speech to text app.

              Trust me, I'm an author.

          3. Felonmarmer

            Re: Is copying large amounts of text or images for training the model fair use?

            So every time you look at copyrighted material, every server passing that info to you is breaking copyright?

          4. Anonymous Coward
            Anonymous Coward

            Re: Is copying large amounts of text or images for training the model fair use?

            Computers have always read and made copies of copyrighted content in ways which could be questionable, but which have yet to be deemed illegal.

            Take ISP transparent caching proxies as a simple example. Back when everything wasn’t protected by SSL/TLS, ISPs would make copies of data to speed up delivery of frequently accessed pages without the consent of the original copyright holder, some ISPs would even analyse and modify pages to provide rudimentary parental controls.

            A modern day AI equivalent to this would be Microsoft Defender for Office 365, which analyses the contents of pages when a user clicks a link in their email (or sometimes even before that happens) to protect them from phishing and malware, this too copies data without the consent of the original copyright holder, even if only temporarily, and will even submit samples for human analysis in some cases. A recipient may be able to consent to all this but the sender and the webmaster don’t get any say, regardless of ToS/EULA.

        2. big_D

          Re: Is copying large amounts of text or images for training the model fair use?

          If the trainers of the model haven't bought a copy of the Harry Potter series or paid for a license to use the material, for example, then they are breaking copyright by reading it into the model

          I would argue it is fair use, if it is for research purposes for internal testing or a universtiy lab, but if it is being used in a commercial product, they should be paying for the licenses for the material they are absorbing, where appropriate - just because they found a hooky copy of a book somewhere on the Internet doesn't relieve them of the due dilligence of getting the licenses in place for that material.

          It is very difficult to say where you draw the line and this needs to be legally defined, before these things go commercial... Oh, wait, they've done typical Big Tech and not bothered about the law, until the lawyers come knocking and they won't do anything about it until the cost of lawyers + fines exceeds the cost of doing it the right way. Surely any sane company should have gotten these questions answered, before they started slurping up stuff willy nilly and selling the results.

          1. mpi Silver badge

            Re: Is copying large amounts of text or images for training the model fair use?

            > If the trainers of the model haven't bought a copy of the Harry Potter series or paid for a license to use the material, for example, then they are breaking copyright by reading it into the model

            Are they? I'm not a lawyer, so this is an actual question.

            Because, I somewhat doubt that anyone went out of his way to feed copyrighted books into the training. The much more likely scenario in my opinion, is that the model simply ingested text from the internet that _other people_ put there. If I just swallow large swathes of text from the web wholesale, there is bound to be copyrighted material in there somewhere.

            The question to me then becomes: who violates copyright? The one uploading the content without permission, or the one consuming it. Because, of the latter is true, then that would open some very interesting questions, like for example: Can someone just generate a bunch of copyright-violations by other people, simply by uploading a significant portion of a protected work somehwhere that people will stumple upon?

            1. doublelayer Silver badge

              Re: Is copying large amounts of text or images for training the model fair use?

              The answer is both of them. It was copyright violation for it to end up on the internet, and it is violation when OpenAI makes a product that continues to redistribute the copyrighted work. It would not be a violation if you stumbled on the file by accident, but it is if you keep a copy for your own use, let alone sending that copy to others which is what someone can argue that GPT is doing.

              In this case, OpenAI should have, and certainly did, know that there was going to be copyrighted content in their training data somewhere. I'm not sure they put much effort into looking for it or doing anything about it, which is probably not going to please the holders of copyright. The same thing has happened with Microsoft's GitHub Copilot bot which likewise copied a lot of code that has copyright and license conditions.

              1. Anonymous Coward
                Anonymous Coward

                Was it copyright violation for it to end up on the internet?

                One that is jurisdiction dependent, but two, without knowing what the training data was, it's hard to say. Certainly a full copy of the book, or anything more than a snippet SHOULD normally be in the us. But therein lies the rub. The internet went gaga stupid for certain works of fiction, and by scraping sites talking about those works the fans have legally posted probably every line of the books.

                So it will be hard without a court order to find out if it was ever used in whole or in large part to train the big closed models, even if it shows up in the outputs. I suspect that like so many companies in the valley, they figured they could sin now and offer a weak apology when they were a monopoly later. Certainly worked for YouTube and Facebook.

                But the devil is in the details and it may take a big hammer to crack this in the courts.

              2. Anonymous Coward
                Anonymous Coward

                Re: Is copying large amounts of text or images for training the model fair use?

                However, Microsoft's GitHub Terms of Service appear to suggest that if you upload your code to GitHub, then GitHub can do whatever it likes with it and sod copyright:

                https://docs.github.com/en/site-policy/github-terms/github-terms-of-service

                "7. Moral Rights

                You retain all moral rights to Your Content that you upload, publish, or submit to any part of the Service, including the rights of integrity and attribution. However, you waive these rights and agree not to assert them against us, to enable us to reasonably exercise the rights granted in Section D.4, but not otherwise.

                To the extent this agreement is not enforceable by applicable law, you grant GitHub the rights we need to use Your Content without attribution and to make reasonable adaptations of Your Content as necessary to render the Website and provide the Service."

          2. Frogmelon

            Re: Is copying large amounts of text or images for training the model fair use?

            "We are the Borg. Copyright is irrelevant. Licenses are irrelevant. You will be assimilated. Resistance is futile. "

        3. Anonymous Coward
          Anonymous Coward

          Re: Is copying large amounts of text or images for training the model fair use?

          I think you can be taken to court (and people have been), not only for 'pure' copyright breach, but for 'stealing' an idea (read: one rich business against another rich business). Perhaps not a lose idea, but a few major... knots of a story that's been your 'inspiration', that might be considered copyright breach. Mind you, similar thing goes for logos, and graphics of a certain fruit in any colour and state of consumptions are low-hanging fruit...

      2. Anonymous Coward
        Anonymous Coward

        Re: Is copying large amounts of text or images for training the model fair use?

        it's little, or no different to how copyright applies against humans. It's fair use for people to read (having paid a licence to absorb this material in paper or e-format, or freely, i.e. via public library system), then 'process' this information, and then, via this non-AI factor we call 'inspiration', humans 'create' their 'own', thus gaining the purely subjective title of an 'author'. But the original copyright holder might decide it's rather closely based on 'their' work, and take the author, or their publisher, to court, and it's not that it hasn't been done. Only, in this, new scenario, it's going to be that much harder to establish 'how close' is close enough for human judges to approve a fat check for copyright holder(s). I'd say this opens a new, greener field of profit for copyright lawyers, only that they themselves see the 'exit' sign on the door, and a beaming bing-clippy waiting by that door to wave them goodbye. And why leave it to a human judge to decide, who might be, after all, 'partial', when you could feed the whole history of precedents to an AI bot to give a 'fair' judgement...

    2. mpi Silver badge

      Re: Is copying large amounts of text or images for training the model fair use?

      The thing with "fair use" is, once a critical mass of countries have instituted the concept for long enough, the others are ... basically pulled along.

      Because, in a globalized world, and especially with the internet, both the audience and the content creating entities are pretty much everywhere all at once. If someone in country A, with fair use, makes a video and includes copyrighted material in a way that falls under "fair use" in A, then what's country B going to do about it? At some point, which the world passed long ago, the practice becomes so entrenched in society, it's pretty much seen as a given. Sure, country B can wave the "but but but ... we don't do that here!" flag, but it would be like trying to outlaw sneezing.

      1. Peter2 Silver badge

        Re: Is copying large amounts of text or images for training the model fair use?

        The thing with "fair use" is, once a critical mass of countries have instituted the concept for long enough, the others are ... basically pulled along.

        Indeed. After the 1886 Berne convention on copyright that allowed for these things [fair dealing, right to quote] plus protecting the author automatically the US was eventually pulled along. In 1989.

    3. Nifty

      Re: Is copying large amounts of text or images for training the model fair use?

      "Is copying large amounts of text or images for training the model fair use?"

      Dunno. I'll start with small amounts ©

  3. Peter Prof Fox

    A mirror-image legal issue

    Suppose in my great work I explain in detail the steps needed to fix your gonkulator using only a gormwacket. Then you, having brought one of those cheap gonkulators, ask some AI-bot for repair instructions. You followed the instructions naively and didn't do what any normal trichycoographer would do. Bang! Who do your relatives sue? Perhaps on page one of my book it says in big red letters 'For qualified trichycoographers only' but the AI-bot simplifies this to 'make sure you know what you're doing'.

    This would seem to imply that when it comes to instructions, AI-bots needs to direct the enquirer to the original with all the context and never paraphrase. That is, act as intelligent librarians not all-knowing experts in their own right.

    1. Michael Wojcik Silver badge

      Re: A mirror-image legal issue

      Yes, getting accurate, complete, and safe instructions for complex procedures is definitely another of the problematic use cases for LLM chatbots. It's maybe worth noting that ChatGPT was created as an improvement on InstructGPT, which itself was created (by introducing an early form of RLHF) from GPT-2 specifically to improve performance in the instruction-giving domain. But we know that ChatGPT often doesn't do well here, and that failure modes (as you suggest) can be quite bad.

      Personally I don't think this is solvable for pure large unidirectional transformer models, and even if it is in theory (and I don't see how), complications of interpretability, corrigibility, and the Waluigi Effect probably make it infeasible in practice. The "automatic reference librarian" mode you suggest would be an improvement for most uses, but it's significantly less shiny and demands more cognitive effort from users, which means less use and money for organizations likely to deploy it in public – Google / Microsoft / etc. (It would still probably find a market for domain-specific usage, where existing information-management products do well, for example as a search mechanism for corporate documents.)

      1. vogon00

        Re: A mirror-image legal issue

        "getting accurate, complete, and safe instructions for complex procedures is definitely another of the problematic use cases for LLM chatbots"

        Now, I'm a self-confessed non-believer in AI, and only have a basic understanding of the tech involved..(Artificial Intelligence? No Way. Artificial rules-based weighted decisions? Maybe.).

        With people's willingness to unthinkingly accept Internet content as gospel, who's to say the stuff is inaccurate, incomplete or unsafe? Any idea what the metrics to measure AI performance re accuracy etc. are? Who makes the decision? Who sues who for the bad advise?

        ISTR that AI often get trained on internet content.....which AI is now capable of writing itself.

        Here's hoping that 'AI' incestuously reads/trains on it's own output and descends into confused insensibility, hopefully much faster that us humans [1]:-)

        [1]Or at least faster than me!

      2. Nifty

        Re: A mirror-image legal issue

        I've asked ChatGPT for instructions on Excel formulae recently, and also for Bing's version to generate a few images. Every time it's got me 2/3 of the way there and required testing/tweaking. But I still got there faster and better than by doing Google searches.

    2. Anonymous Coward
      Anonymous Coward

      Good luck with that

      These models aren't AI. You can hammer on them to train them to append a disclaimer to things that smell like requests for instructions or advice, but it will never be reliable, and they are more likely to fail on low traffic (but potentially high impact) questions.

      It's likely to miss the first time someone asks how much HCN you should add to your air freshener to make it smell like almonds.

      The smart people aren't in charge at these companies, and the people that are are latching on because it is the new money making buzzword after bitcoin and NFTs. They neither know nor care how it actually works, and most of them don't view it as their problem to fix the issues the tech causes, even when it kills somebody.

      Big surprise when many of them came from or still work for Google, Facebook, and Micro$oft. Their bonus is tied to market share or profits. Concerns about suitability or safety weren't even a voice in room.

      Now when there is a voice of reason they are still outvoted. This needs regulation authored by the open source community, as neither the big players or the government can be trusted to touch it.

  4. mjgardner

    Stop anthropomorphizing computers. They hate that.

    “OpenAI models have memorized…;” “GPT-4 was found to have memorized…”

    Does your computer “memorize” things when you copy files from the network?

    Is a BitTorrent swarm just another way of saying “study group”?

    1. Michael Wojcik Silver badge

      Re: Stop anthropomorphizing computers. They hate that.

      Oh, please. "Memorized" is a perfectly usable gloss for "created an internal representation of X in its parameter space", which is a very different thing from creating an exact copy. Reducing technical precision in the discussion helps no one.

      1. mpi Silver badge

        Re: Stop anthropomorphizing computers. They hate that.

        > "created an internal representation of X in its parameter space"

        Weeeeell...not quite.

        Because the parameters simply don't store that information. As demonstrated by the fact that the size of the model is a constant depending on the architecture. It doesn't matter if it gets fed one page of text, or all the libraries in the world.

        They store parameters that allow predicting likely sequences.

        1. Simon Harris

          Re: Stop anthropomorphizing computers. They hate that.

          If it read everything known to humankind only once (and there were no quotations of other works within those, which excludes a lot of works) then it might be that it doesn’t have an internal representation of any one work.

          However, if it has had the same document as input multiple times (e.g. it’s crawled the web and found multiple copies, or multiple copies of extracts) or there is something that is often quoted in other works, wouldn’t the model parameters be biased towards reproducing sequences found in those works?

          Could that reinforcement then be considered ‘memorising’?

    2. doublelayer Silver badge

      Re: Stop anthropomorphizing computers. They hate that.

      And when a computer "stores" some data, isn't that anthropomorphizing it? After all, "store" is used for many operations, including writing some bits to volatile memory from the registers, which certainly doesn't count as putting something in storage. How about reading? A computer doesn't read in a human sense; it copies data from one form of storing it into a different form, or it takes a series of input events and stores them. Or how about "file". A computer doesn't use files in the sense that the term used to be used; it's a chunk of bytes that can be read or written in series. Except, of course, that every definition I've provided here relies on another of the words that represents a different human activity.

      It's natural for us to use verbs for an action that a computer does that is analogous to it, and we do so all the time. For example, we frequently refer to RAM or sometimes storage as "memory" since it records data which can be referred to later, and the process of "memorizing" a book involves conveying that book into "memory" from which it can later be retrieved. Why shouldn't memorize describe that action?

      1. Anonymous Coward
        Anonymous Coward

        No, it is not anthropomorphizing anything

        You don't know what words mean or how LLMs work. Sit this part out before someone scrapes this forum again and feeds it to GPT-5.

        Anthropomorphizing is what people do to themselves when they start to think things that aren't people have human qualities.

        Computers (and ML code) only anthropomorphize things when someone asks (for example) Stable Diff for a picture of a toaster that looks like Robert Duvall.

        Stop murdering the English language, hasn't it suffered enough?

  5. Missing Semicolon Silver badge
    WTF?

    Odd how the copyright problem gets swerved.

    OpenAI have slurped *everything*. With no regard as to copyright, with the old "if it's on the web, it's public" nostrum. Books, code, news articles, everything. It has been copied and "stored in an electronic retrieval system", to quote the notice in the front pages of many books. And then published, if you can get ChatGPT to regurgitate great chunks of it.

    So why haven't they been sued into a smoking hole in the ground?

    1. Zippy´s Sausage Factory

      Re: Odd how the copyright problem gets swerved.

      Because people are only just noticing.

      Publishing companies - not just books, but song lyrics, as well as other categories I've not thought of - are probably ripe fields for lawsuits.

      I haven't seen anything yet, but I'm ordering extra popcorn right now to stock up before the rush.

      1. Michael Wojcik Silver badge

        Re: Odd how the copyright problem gets swerved.

        There are also practical considerations. Publishers, and their lawyers, are going to want strong evidence to show infringement, so they're letting researchers like this team demonstrate how to get it. Also there's a bit of game theory in play: everyone would like someone else to run the first, most expensive case up through the court system, pay for appeals, etc; then everyone else can coast on precedent.

        What will likely happen is that one of the industry associations will put together a consortium so costs can be shared, unless the LLM vendors decide to preempt it by offering some kind of licensing agreement. With Microsoft and Alphabet leading the pack on the LLM side, though, the latter seems unlikely to me. They haven't historically shown much inclination to play nicely with others.

        1. Anonymous Coward
          Anonymous Coward

          Re: Odd how the copyright problem gets swerved.

          Wait until ChatGPT regurgitates back information belonging to Pfizer & the like, then you'll see what happens to MS & Alphabet & Apple & Musk...

        2. Anonymous Coward
          Anonymous Coward

          Nobody wants to open up a giant can of worms

          After all, what is the difference between a human speaking aloud parts of a book they memorised and a AI model doing the same thing? Fundamentally, human learning involves copying, just like machine learning.

          To draw a more direct comparison: Using Bloom’s Taxonomy, one can see that memorising (copying) information is the lowest order of learning upon which the rest follows. If we look at what OpenAI has achieved, we can see a model which convincingly ticks the box for comprehension (of its training data) and which often hits the mark when it comes to applying what it knows, demonstrating a learning potential equivalent someone with a straight set of GCSEs (even if some responses represent D grades at times).

          I don’t think this is a battle any publisher wants to fight. They would need to justify to laypeople why a machine engaging in learning can’t legally derive knowledge from copyrighted works, but a human can, while simultaneously trying to explain why it should be illegal for humans to memorise and regurgitate copyrighted materials. After all, people can sing songs they’ve listened to in the past to entertain their friends, right?

          No intellectual property lawyer or major rights holder wants to be demonstrating to the mass public how stupid copyright law really is, especially in a major case which could be perceived as negatively affecting world-changing technological progress. Everyone and their dog is talking about ChatGPT (and similar LLMs) right now and it is doing a great job of demonstrating how much of a farce “intellectual property” is as a concept, copyright holders would be very wise to look the other way and hope everyday people don’t cotton on.

      2. Roland6 Silver badge

        Re: Odd how the copyright problem gets swerved.

        Given current copyright protections, it will be song lyrics and tunes that will be the first cases: your Ed Sheeran AI has infringed our Marvin Gaye AI…

        1. The commentard formerly known as Mister_C Silver badge

          Re: Odd how the copyright problem gets swerved.

          So if we ask ChatGPT to tell us about Kookaburras living Down Under, we should email the result to Larrikin Music?

          Cry Havoc and let slip the trolls of copyright!

          background: https://www.theregister.com/2011/03/31/down_under_appeal/

          1. Strahd Ivarius Silver badge
            Trollface

            Re: Odd how the copyright problem gets swerved.

            isn't "Cry Havoc" a registered trade mark for a board game?

      3. GruntyMcPugh

        Re: Odd how the copyright problem gets swerved.

        Would a publishing company sue a person for being well read, and being able to quote literature? Would a record company sue an artist, who was deeply immersed in their field, and could play any piece of music from memory?

        1. Strahd Ivarius Silver badge

          Re: Odd how the copyright problem gets swerved.

          In France yes, SACEM do it already, if it is a public performance...

          1. GruntyMcPugh

            Re: Odd how the copyright problem gets swerved.

            I didn't really mean performing, but I get your point, having been on the pointy end of the point some time ago. I now work for local Govt, and one of our web sites got scraped, and somehow a draft proposal that was never published, got into the hands of the UK PRS (Performing Rights Society). The proposal was to license busking pitches around town as part of urban development, so shoppers got live music and they were spread out so didn't interfere with each other. It was never enacted, but that didn't stop the PRS assuming the performers would be singing covers of copyright material and demand a licensing fee, from us. Now, at least one of the regular acts that busks (and we do not license, or charge, it's just an allowed thing in certain places) perform traditional, non copyright music, but the PRS wanted a slice of that, so Ed Sheeran could have more money. I'm not knocking Ed, but I doubt the PRS reimburse small artists pro-rata. But then this reminds me of the CDR levy, and the music industry wanting a slice as they assumed they were losing revenue.

        2. Peter2 Silver badge

          Re: Odd how the copyright problem gets swerved.

          Would any of those examples violate what amounts to being the EULA on the front page below?

          "All rights reserved. No part of this book shall be reproduced, stored in a retrieval system or transmitted by any means, electronic, mechanical, photocopying, recording other otherwise, without written permission from the publisher."

          In any case; let us be honest. What a handful of wealthy companies wish for is to push the law on copyright back to the state it existed in prior to the 1710 copyright act, which can easily be understood through reading it's preamble:

          Whereas Printers, Booksellers, and other Persons, have of late frequently taken the Liberty of Printing, Reprinting, and Publishing, or causing to be Printed, Reprinted, and Published Books, and other Writings, without the Consent of the Authors or Proprietors of such Books and Writings, to their very great Detriment, and too often to the Ruin of them and their Families: For Preventing therefore such Practices for the future, and for the Encouragement of Learned Men to Compose and Write useful Books; May it please Your Majesty, that it may be Enacted ...

          We face the same problem as resolved over three centuries ago when printers were making use of published works without paying the authors to their very great detriment, and to often to the ruin of the authors and their families. Copyright exists to prevent such practices for the future, and to encourage learned men (and now woman) to compose and write useful books.

          A company creating a machine to create and distribute derivative works does not appear to be encouraging the learned to compose and write useful books (especially when said machine fails to in fact even create useful paragraphs...) and so will eventually end up with the long arm of the law hammering the companies creating such machines. Emphasis on "eventually"; the law has always worked better at punishing retrospectively than preventing harm.

        3. doublelayer Silver badge

          Re: Odd how the copyright problem gets swerved.

          "Would a publishing company sue a person for being well read, and being able to quote literature?"

          No, but a publishing company would sue a person who was well read, able to quote literature, and used those skills to write, sell, or give away copies of their books. The thing that would make their actions illegal is the distribution, not the knowledge.

          "Would a record company sue an artist, who was deeply immersed in their field, and could play any piece of music from memory?"

          No, if they played from memory to a group of friends. If they performed those songs in public, or if they recorded themselves playing them and sold the recordings, a lawsuit is more likely.

          1. GruntyMcPugh

            Re: Odd how the copyright problem gets swerved.

            Well yeah, but those are obvious infringements of copyright. It seems the angle here is that ingesting copyright material is somehow an infringement. So we're redefining fair use, not rehashing redistribution?

    2. PRR Silver badge

      Re: Odd how the copyright problem gets swerved.

      > why haven't they been sued into a smoking hole in the ground?

      There are no smoking holes in copyright law.

      Most cases take 20 to 50 years to ripen. (Though Sheeran settled another claim quick in 2016.)

      Jail-time for criminal copyright infraction is unheard-of. Civil copyright cases tend to have legal fees far in excess of actual or statutory damages. When the lawyers get bored and move on to new and exciting cases, everybody settles.

      Happy Birthday may never have been a valid copyright. The Steamboat Willie Mickey Mouse is finally going out of copyright in 2024. Happy birthday, Mickey!

      1. Anonymous Coward
        Anonymous Coward

        Re: Odd how the copyright problem gets swerved.

        >Jail-time for criminal copyright infraction is unheard-of

        What happens when this copyright abuse happens on a massive scale all enabled by a single product/type of product? Has this ever occurred before?

        1. doublelayer Silver badge

          Re: Odd how the copyright problem gets swerved.

          Yes. Usually it started first with music, with the music publishers being certain that every advancement in technology would mean a complete loss in music revenues. Most of the time, they were mostly wrong, but they've had several occasions where they had a point, such as the first music sharing services on the web. They sued those services, and they won, but it didn't happen in a day. It didn't happen that quickly even when it was a clear case, such as Napster and similar services which were effectively Piracy Inc, with an easily found corporate entity running it. OpenAI has a lot more arguments for why their thing is acceptable which Napster never did, and although I don't find those arguments convincing, they'll likely need to be tested by lawyers for a while.

    3. CatWithChainsaw

      Re: Odd how the copyright problem gets swerved.

      The medium probably has something to do with how things will shake out.

      Reusing text the way it is, likely a copyright violation, headed for lawsuits. Image generators are rather hotly debated because of technical language over what the generators are doing and "you can't copyright style". But the AI-generated song in Drake's "style" (voice, basically) was taken down. Writing and Music have an army of lawyers ready to sue, artists do not.

    4. Anonymous Coward
      Anonymous Coward

      Re: Odd how the copyright problem gets swerved.

      >So why haven't they been sued into a smoking hole in the ground?

      They haven't been sued into a smoking hole in the ground _yet_.

      But one thing is for sure - _something_ is going to happen, because it can't simply go on as-is.

    5. Anonymous Coward
      Anonymous Coward

      Re: So why haven't they been sued into a smoking hole in the ground?

      it takes time to discuss your option with your legal team ;) And to look around how others go about doing it, having gone through 'talking to their legal team'. You need to allocate funds (likely blowing the budget many times over), brainstorm the whole idea with your board (perhaps not in this order). But this court chapter IS coming...

  6. elsergiovolador Silver badge

    Confidential

    I wonder if ChatGPT also learned some confidential information about certain corporations.

    For instance if you ask it if company X has done a project related to Y for Z, when it knows something, it will even give you the internal project name and quite a lot of detail.

    Then when you try to find it in official company materials there is nothing about it.

    If some information is true and not a hallucination, then some companies may be in a pickle.

    1. Anonymous Coward
      Anonymous Coward

      Re: Confidential

      It's been blocked at our companies firewall for just that reason along with warnings about using it for anything company related.

      1. Roland6 Silver badge

        Re: Confidential

        What exactly have you blocked?

        I’m not aware of any advisories about blocking GPT’s LLM model building web crawlers. Even the advisories about disallowing web crawlers from scanning your (public) website note the crawler blocking is crawler specific.

        About the only thing you can block is user access to the ChatGPT website and interactive usage of the previously created LLM.

        1. Anonymous Coward
          Anonymous Coward

          About the spiders...

          You can't stop them from scraping your site if they want to ignore the robots.txt unless you bury it behind a login page and don't give them a login.

          There is a magic GUID that might help, if you know it, but again, only if the person crawling the site wants to obey it. And many of the companies training these models aren't crawling and scraping it themselves. They are paying for it, or just grabbing stuff if someone else posted it in a public forum, regardless of the licensing. (eg CC-Attribution, which can't be attributed inside an LLM at this point, so making the public offering of the resulting model in violation of the license).

          Not like that is a new problem. You can write some code to detect the user agent and have your server block them, but that can be faked easily enough(also the basis for add fraud and a bunch of other shady stuff). If you don't want it being scraped, the sole effective choice is not to post it. Which is a pretty crappy choice for a lot of people, creatives especially.

    2. Michael Wojcik Silver badge

      Re: Confidential

      ChatGPT is not learning anything – its training is done and it doesn't dynamically update.

      Hybrid systems like Microsoft's Sydney can perform searches against current online content, and so even though its weights are no longer being updated, it has that as a secondary learning mechanism. And because it can find records of its own output (in tweets and blog posts and published papers and the like) it even has fragmentary memory.

      We already know of cases where people have put proprietary information into LLM prompts, and only recently did OpenAI, for one, introduce an option to not have prompt data saved for ingestion by future systems. I don't know Sydney/Bing offers such an option.

      And, of course, information can be revealed inadvertently through correlation. De-anonymization studies show just how much information that was presumed private can be derived this way; that's why differential privacy is A Thing.

      1. This post has been deleted by its author

    3. Anonymous Coward
      Anonymous Coward

      Re: Confidential

      Additionally, you don't know if the content ChatGPT is producing is _true_ or not.

      What if it's not true? Is that company going to sue ChatGPT for some sort of defamation?

      What happens if in a court ruling it is ruled that someone _didn't_ commit a horrible crime, but because "people on the internet" keep saying that they did do it ChatGPT begins to say that they did do it? Can they sue for defamation, etc?

      I feel these aren't just hypotheticals, and will all occur in the near future...

  7. Mike 137 Silver badge

    "The authors note that science fiction and fantasy books dominate the list"

    Does this partly explain some of the vagaries of the output? Departure from reality is inevitable if you train on fantasy. But it does surprise me that this seems not to have occurred to the trainers.

    If we want to use LLMs for more than just entertainment, they will have to be tied much more firmly to reliable information on the real world, and that poses a big problem if web content is the source, as most of it is not.

    1. Michael Wojcik Silver badge

      Re: "The authors note that science fiction and fantasy books dominate the list"

      I doubt the effect is significant. There's a vast amount of misinformation, deliberate and accidental, in the "real world" and supposedly factual documents. And that doesn't include pseudo-deliberate (i.e. undesirable gradients emerging from prompt directions that aren't directly about the topic at hand) falsification and hallucination by LLMs, such as Waluigis.

    2. Strahd Ivarius Silver badge
      Joke

      Re: "The authors note that science fiction and fantasy books dominate the list"

      To be honest, they trained their model on economics books, not being aware that they are SF and/or fantasy...

    3. Anonymous Coward
      Anonymous Coward

      Departure from reality

      No, the subject matter isn't really the issue so much as drowning the clean and useful training data in garbage. It will bias the the outputs to be certain, but no necessarily in the ways you might think. It's a reflection of a snapshort of parts of the online world from when it was trained after all. So the content it creates will be familiar to the terminally online.

      But if you try to use that model for domain specific tasks outside Star Trek trivia and furry porn, be prepared for a bumpy ride.

      But also using these for knowledge based queries when they contain internet scraped material is a fools journey. The model doesn't know anything beyond what the probabilistic combinations of patterns are. So if you ask a question that coincidentally generates a lot of meme replies, expect poor accuracy from a facts standpoint.

      But it will probably tell you what plant Klingon tea is made from, because nobody cares about mentioning anything other than the correct answer.

  8. TheMaskedMan Silver badge

    "Bang! Who do your relatives sue?"

    Those responsible for failing to instill in the ex-trichycoographer enough critical thinking / common sense to prevent them from blindly following instructions without the knowledge to be certain they're correct.

    It's depressing that the first thought, upon the demise of an untrained Muppet attempting something that they don't know how to do is who can we sue rather than the daft bugger should have known better.

    All that said, links back to sources would be a very useful thing, and I'd like to see that implemented.

    1. Zippy´s Sausage Factory

      It's depressing that the first thought, upon the demise of an untrained Muppet attempting something that they don't know how to do is who can we sue rather than the daft bugger should have known better.

      On the other hand, there are clear laws. And it's long been held that "ignorance of the law is no defence".

      "Did you kill him then?"

      "Um, yep. Poisoned his tea and everything."

      "Then I'll have to arrest you for -"

      "Wait, that's illegal? Killing someone is illegal?"

      "Yes. Yes it is."

      "Nobody told me."

      "Oh well, we better left you off then. Just don't do it again, eh?"

      1. Anonymous Coward
        Anonymous Coward

        It works for politicians, does it?

    2. Neil Barnes Silver badge

      blindly following instructions without the knowledge to be certain they're correct

      But if they had the knowledge to be certain, they'd not need to be following the instructions. They'd already know how to do it.

      I don't think it's possible to give in every case instructions which can be followed by someone who initially has not a clue how to do something; the detail would be both too much and yet still too little.

      From https://chat.openai.com/ with the question 'how do I launch a paraglider':

      Launching a paraglider involves a number of steps, and it is important to follow proper procedures to ensure safety. Here are the basic steps to launch a paraglider:

      1. Check your equipment: Make sure your paraglider, harness, and helmet are in good condition and properly adjusted.

      2. Choose a launch site: Look for an open area that is free of obstacles and has a clear approach to takeoff.

      3. Set up the glider: Spread out the wing and attach the lines to the harness. Check that the lines are untangled and the risers are straight.

      4. Check wind conditions: Evaluate the wind direction and strength to determine if it is suitable for launching. You should have a steady wind of about 10-15 km/h.

      5. Inflate the glider: With the wind at your back, take hold of the risers and pull them up while running forward. This will inflate the wing and get it ready to lift you off the ground.

      6. Check your position: Once the wing is inflated, check that it is directly above you and that you are facing into the wind.

      7. Take off: When everything is in order, run forward, accelerating gradually as the wing lifts you off the ground. Keep your body position upright and centered beneath the glider.

      8. Fly away: Once you are airborne, control your direction and altitude by manipulating the brakes and weight shifting. Enjoy the ride!

      It is important to note that paragliding can be dangerous if not done properly, so it is recommended to take lessons from a certified instructor before attempting to fly on your own.

      Which is ok in principle, I suppose, but (2) is insufficient; (3) requires technical knowledge of the equipment to do correctly; (4) is plain wrong - a zero wind launch is perfectly feasible, and it makes no recommendation of wind direction; (5) seems to have the pilot running directly at the wing; (6) makes no mention of the importance of turning in the correct direction; (7) doesn't explain how you accelerate once your feet are off the ground; (8) is completely insufficient. Enjoy the crash!

      At least there as a warning and a suggestion to take lessons, though I would suggest 'is' instead of 'can be' dangerous if not done properly...

      And yet someone who's just picked up a cheap wing off ebay and watched a couple of youtube videos might think this sufficient to start flying.

  9. Long John Silver
    Pirate

    Death throes for copyright beckon

    Post WW2 developments in access to computation, in power of computation, interconnection of computers, and the nowadays key role of digital representation of information, have unleashed qualitative changes to life of great magnitude which previously were the realm of speculation. Quantitative changes, e.g. ease and speed of communication across the globe for ordinary people, impact life considerably too, but these are easily comprehended enhancements to analogue broadcasting, telephony, and so forth.

    Few people grasp just how significant the "paradigm shift" (Thomas Kuhn) is. Only people who have lived the entirety of post-WW2 times can take a broad view of conceptual and societal changes that are being wrought, and which depend upon modern information storage, processing, and transmission. Put differently, an explosion of technological developments produces concomitant enlargement of opportunities (for good and ill). Similar events have occurred throughout history, e.g. the Italian Renaissance, but at a much smaller scale.

    Unlike times such as circa 1800 until WW1 - a period when the First Industrial Revolution (of machines and tangible products) was gathering pace, a time of intellectual ferment, an era of the polymath - present day imaginations among leaders of thought (e.g. academics) and leaders of societal governance (politicians nowadays as dumb as the masses they 'represent') are single track with narrowed horizons: they follow very few threads in the immensely complicated tapestry of modern life. They are incapable of standing back and making an attempt at perceiving the whole.

    What eludes leaders of academia, industry, commerce, and governance, is appreciation of the qualitative-change aspect of post-WW2 'Western' life. By simply following threads one cannot grasp much other than quantitative gains in a particular area of thought and its applications.

    The concept of 'intellectual property' (IP), and its anticipated fate during what may be termed The Second Industrial Revolution, that is return of much manufacturing to cottage basis (aided by 3D-printing of widgets, pharmaceuticals, etc.) offers a case study with parallels to other qualitative changes in progress or anticipated.

    The biggest blow to the specious concept of IP arose from digital representation of information which made a clear distinction between 'medium' and 'message'. Unlike print on paper, images on celluloid, and wave patterns on vinyl disks, representation as digital sequences can easily be shifted from one medium to another, easily copied with complete accuracy, and cheaply transmitted to other locations. The edifice of property rights, these mirroring those applicable to physical property, when applied to ideas, their expression, and to derivation therefrom, has collapsed as intellectually sustainable, and, all bar the shouting, become unenforceable in law.

    The 'single thread followers' are approaching a rude awakening. They have not considered that idea production can be incentivised by means other than a ramshackle structure of anomalous and restrictive law. However, it's far too late for the major, and monolithic, publishers, distributors, and the patent reliant, straddling the planet, to adopt a differing business model. They shall go the way of the Luddites. This collapse hastened by fortuitous diminution of the USA economic hegemony, coupled with global multi-polarity. ChatGPT, and its like, merely underlines the hopeless case for copyright rentiers.

  10. Tron Silver badge

    May I offer a solution to this...

    Release a plain vanilla baby chatbot and allow users to pick and choose which texts it consumes. They would of course have to purchase an electronic copy of copyrighted works before their baby chatbot could read them, and add them to its memory. Out of copyright works could be read by it for free, as could publicly accessible web pages.

    Your baby chatbot would learn what you would allow it to learn. It could be based entirely on the Potterverse, or James Joyce or Das Kapital or 4chan and de Sade, as well as the plain vanilla stuff like Wikipedia, creating its own 'personality'.

    We can each rear our own chatbot, and then share it online.

    We can even mate chatbots, the two pouring their shared knowledgebase into a new baby.

    Much cool stuff to do.

    1. Anonymous Coward
      Anonymous Coward

      Already built mine

      Check out the bits from that leaked google "Moat" memo, it will go over a bunch of it, and also it's interesting reading.

    2. tiggity Silver badge

      Re: May I offer a solution to this...

      Its already possible to use AI systems and just have them give answers based on defined data you feed it (e.g. you could feed it your product guides and use it as a "customer support helper") - but take advantage of the wider "knowledge training" of the AI e.g. it can accept questions / give replies in language different to that of the source material you have told it to base answers on. (Translation quality at the mercy of how good its general training was, but based on playing with the languages I have a rudimentary few years of schooling knowledge of, the "AI" translations are often better than expected, certainly "passable" most of the time, and translations tend to be better the more well worded / precise the question is as reduces the going off at a wild tangent type of replies you sometimes get). It's not up to the job of translating your training documents from language A to language B as well as a human translator would, but manages a good stab at giving reasonably coherent answers in language B when training data was only in language A. Does struggle a lot if key data in training documents is in image / diagram form, quite often have to tweak input docs to be more AI friendly when key content is in images but not text (as if often the case in e.g. PDfs describing products / processes).

      Obviously you do not use a public AI for this sort of training and use but a private one (stares at Samsung - https://www.theregister.com/2023/04/06/samsung_reportedly_leaked_its_own/)

  11. bertkaye

    just a yes or no would do

    I did get suspicious when I asked ChatGPT for a weather report and it responded with "It was a dark and stormy night; the rain fell in torrents--except at occasional intervals, when it was checked by a violent gust of wind which swept up the streets (for it is in London that our scene lies), rattling along the housetops, and fiercely agitating the scanty flame of the lamps that struggled against the darkness."

    Hey, all I wanted was to know whether to take an umbrella with me going out for takeaway curry, not read a Victorian novel.

    1. PRR Silver badge

      Re: just a yes or no would do

      > ...it responded with "It was a dark and stormy night;..." -- Hey, all I wanted was to know whether to take an umbrella with me.., not read a Victorian novel.

      Thanks! I never knew where that came from.

      And FWIW: It is soon coming up on TWO hundred years old, and is pre-Victorian. So Georgian period. Who knew??

      Novel: 'Paul Clifford' by Edward Bulwer-Lytton, 436 pages, First published January 1, 1830.

      Victorian era was from 20 June 1837.

      1. bertkaye

        Re: just a yes or no would do

        You are a scholar with an inquiring mind. Bulwer-Lytton was ahead of his time you see.

  12. This post has been deleted by its author

  13. Murphy's Lawyer
    Mushroom

    Same old tech(bro) bias

    Absolutely no surprise that this has been fed SFF that (mostly) white (mostly) blokes like. At least Celeste Ng and Nnedi Okorafor won't have to reach for their lawyers.

  14. FeepingCreature

    "Memorized passages"

    In other words, the "Quotes" section of goodreads.com violates copyright on a massive scale.

    Everyone jumping on the "ChatGPT Bad" bandwagon misses the forest for the trees. The model knows these passages because they're being quoted massively on the Internet. There are hugely popular websites dedicated entirely to gathering quotes. Inasmuch as the model violates copyright, it's only because nobody else has ever given a damn about it.

    Ask it about song lyrics next! I reckon they'll all be in there - not because OpenAI have maliciously fed it with copyrighted datasets, but because copyrighted segments of text on an internet crawl are as common as dirt.

  15. adam 40
    Big Brother

    If a human memorises a book...

    ... is that also a copyright offence? A thought crime?

    If not, why take it out on poor old AI's?

    1. Anonymous Coward
      Anonymous Coward

      Re: If a human memorises a book...

      if a human memorises a book AND they make it available to others (with the exception of their family, friends, etc, depending on this or that legal system) AND PARTICULARLY if they do it for profit, I imagine that, yes, they would be sued.

  16. IglooDame
    Terminator

    As long as AI isn't being made to 'read' the novelizations of War Games, The Matrix series or the Terminator series (or SF literature of a similar vein), I'm okay with it.

  17. vimalkumarkansal12345678

    LLMs generated Text — Possible Legal Ramifications and an AI’s take on it

    After reading this article I wrote a blog post, read it here : https://medium.com/@vkansal/llms-generated-text-possible-legal-ramifications-and-an-ais-take-on-it-b3676c3a789d

  18. Mostly Irrelevant

    I think it's just a matter of time until we get a ruling that an AI model is a derivative work of it's training data. It's the only way they're going to be able to continue IP protections in a generative AI world.

  19. YouStupidBoy

    Fly fishing

    But does it have an accessible copy of Fly Fishing by J.R. Hartley?

  20. Anonymous Coward
    Anonymous Coward

    Again?

    Again the cry is ‘It’s not black enough’? Is that the gist here? Is this not reminiscent of Google’s ‘messy hair’ search results? Change the AI it’s too … AI, make it woke.

  21. Frogmelon

    "Hey Chat GPT! Read to me "Harry Potter and the Philosopher's Stone!"

    "Absolutely, but due to copyright laws I must paraphrase and not read back to you verbatim.

    Which author's style would you like me to use?"

    "I'd like you to read it to me in the style of Frank Herbert."

    "OK, here is Harry Potter and the Philosopher's Stone in the style of Frank Herbert."

    "A beginning is a very delicate time. Know then, that is is the year 1991..."

  22. call-me-mark

    Minor nitpick but...

    The Lord Of The Rings isn't a trilogy. It's a single novel that is sometimes published in three volumes.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like