back to article Authors Guild sues OpenAI for using Game of Thrones and other novels to train ChatGPT

The Authors Guild, a trade association for published writers, and 17 authors have unleashed the dragons on OpenAI over its alleged use of their works to train its chatbots. Named plaintiffs in the copyright infringement class action lawsuit – filed in the Southern District of New York for copyright – include David Baldacci, …

  1. Anonymous Coward
    Anonymous Coward

    A Song Of Ice and Fire

    Maybe OpenAI could complete ASOIAF because it doesn't look like GRRM will.

    1. Roland6 Silver badge

      Re: A Song Of Ice and Fire

      A real test of AI would be for it to complete Knuth's seven volume epic - The Art of Computer Programming ...

    2. Michael Wojcik Silver badge

      Re: A Song Of Ice and Fire

      Dude has a railroad to run.

      (Yes, the subhead of that article is deliciously ironic.)

      I haven't ridden the trains m'self, but by sheer coincidence my wife and I happened to be in Santa Fe (a rare occurrence) and in the Railyard District (an even rarer one) on the day and time of Sky Railway's maiden journey. We hadn't heard about it yet – news from far-distant Santa Fe (nearly 80 miles!) takes a long time to reach the Mountain Fastness – so at first we had no idea why there were all these people and news crews milling about. Then we saw the train, and it was Cool.

      We also shopped at GRRM's bookstore that day. I think that's where I picked up Jo Walton's What Makes This Book So Great.

      Back on thread... Personally, I'd rather complete SoIaF myself, for myself, than read something a transformer comes up with. The whole point of the transformer architecture is minimizing information entropy, aside from whatever the temperature is set to. It'd give you the most predictable ending.

  2. Andy The Hat Silver badge

    Looks like they are going after "the reader" not "the poster" of copyright material ...

    Much more money in litigation against multiple readers (or one rich one) instead of shutting down the thief.

    1. Wellyboot Silver badge

      Reading the published material is not a crime, using the material for any business purpose or publishing similar works to the detriment of the Author is.

      It's the same law Disney use when not getting paid by someone selling micky mouse wallpaper.

      1. Anonymous Coward
        Anonymous Coward

        but the AI read the works, it's not "using the works"

        1. blackcat Silver badge

          It just has a really good memory.

          A legit question as I've not used chatGPT but could you actually get it to regurgitate a book it has read in its entirety?

          1. 42656e4d203239 Silver badge

            >>could you actually get it to regurgitate a book it has read in its entirety?

            Almost certainly not. It reads the books, but not in the sense you or I do. It doesn't store the original, just 'interesting' features of the original - word frequency, letter frequency, probably "this follows that with x probability" type stuff along with categorisation/catalogue information, then uses that data and other magic to form its response to prompts.

            It doesn't (as the WGA, and to be fair, a vast number of the population, seem to think) copy and paste chunks of the text from a copy of the source to the output.

            Could you craft a prompt to get it to regurgitate a word for word copy of an original? doubtful, unless your prompt is as complex as the original work; could you get it to make a reasonable attempt at impersonation? absolutely - assuing the person you are impersonating has a distinctive style.

            1. Craig 2

              It doesn't store the original, just 'interesting' features of the original

              Absolutely, this is the crux of "what is copying" or "what is reproducing". If an actual person manually performed the same categorization as OpenAI does then would there be any copyright issues? Obviously impossible in the lifetime of the universe*, but just because computers can perform tasks insanely quickly compared to humans doesn't automatically make it "copying".

              *Or at least extremely improbable unless you have access to an infinite supply of monkeys.

              1. Martin M

                Re: It doesn't store the original, just 'interesting' features of the original

                Firstly, most LLMs including ChatGPT are entirely capable of regurgitating quite long sequences of training data, referred to as "memorization" - https://www.theregister.com/2023/05/03/openai_chatgpt_copyright/ . Actors do this too, by altering their neural weights in a somewhat similar way, and if they wrote a play down and distributed it, there would be a breach of copyright.

                But even leaving aside whether text is reproduced verbatim, case law has determined that copyright protection extends to the traits of well-delineated, central characters - distinct from the text of the works they are embodied in.

                I've just typed "how would tyrion lannister describe having a baby" and "how would cersei lannister describe having a baby" and it spits out highly distinctive, extended replies very much in line with the thinking and speaking styles of those characters.

                I'm no expert, but I can see how it might well breach copyright to reproduce these outside of a fair use context.

                1. Nursing A Semi

                  Re: It doesn't store the original, just 'interesting' features of the original

                  So, I read a book and being of average intelligence if asked, could provide a reasonable synopsis of the story and even opinion on how a particular character may respond to a given situation. How is this any different or am I also in breach of copyright?

                  1. Martin M

                    Re: It doesn't store the original, just 'interesting' features of the original

                    It’s an interesting question, and one for a lawyer, but I suspect comes down to the context and whether it qualifies as fair use - hence the careful qualification.

                    Wikipedia’s take - https://en.m.wikipedia.org/wiki/Legal_issues_with_fan_fiction - explains there are no fixed rules but when deciding fair use on a case by case basis courts consider

                    - the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;

                    - the nature of the copyrighted work;

                    - the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and

                    - the effect of the use upon the potential market for or value of the copyrighted work.

                    So at the extremes: if you’re doing it as part of a 500 word school assignment, you’re fine. If you’ve published your own novel sitting alongside the Song of Ice and Fire series without a license, you’ll likely have problems.

                    If OpenAI making fair use? No idea, but it’s definitely commercial use. Definitely feels like one for the courts…

                  2. Michael Wojcik Silver badge

                    Re: It doesn't store the original, just 'interesting' features of the original

                    I read a book and being of average intelligence if asked, could provide a reasonable synopsis...

                    LLMs fail the commercial-use test. LLM vendors are seeking to monetize their models, so if their models display similar behavior, that's very distinct from what you're describing.

                    LLMs fail the financial-harm test, if LLM activity does indeed reduce commercial demand for the existing or future work of the creators whose works they've been trained on. That's also very distinct from what you're describing.

                2. vmy2197

                  Re: It doesn't store the original, just 'interesting' features of the original

                  The complaint makes exactly this point. That not only did OpenAI use unlicensed copyrighted works to train the model but in addition the model stores substantial amounts of the unlicensed copyrighted work which it uses to generate responses.

                  https://www.courtlistener.com/docket/67810584/authors-guild-v-openai-inc/

                  88. Until very recently, ChatGPT could be prompted to return quotations of text from

                  copyrighted books with a good degree of accuracy, suggesting that the underlying LLM must

                  have ingested these books in their entireties during its “training.”

                  89. Now, however, ChatGPT generally responds to such prompts with the statement,

                  “I can’t provide verbatim excerpts from copyrighted texts.” Thus, while ChatGPT previously

                  provided such excerpts and in principle retains the capacity to do so, it has been restrained from

                  doing so, if only temporarily, by its programmers.

                  90. In light of its timing, this apparent revision of ChatGPT’s output rules is likely a

                  response to the type of activism on behalf of authors exemplified by the Open Letter addressed to

                  OpenAI and other companies by Plaintiff The Authors Guild, which is discussed further below.

                  91. Instead of “verbatim excerpts,” ChatGPT now offers to produce a summary of the

                  copyrighted book,

                3. vmy2197

                  Re: It doesn't store the original, just 'interesting' features of the original

                  While not OpenAI this story from February, 2023, illustrates how models can store the data they were trained on and can be coaxed to respond with that data. In this case it's an image AI app but the same thing occurs with OpenAI. Which points to a security problem with these models if they've been trained on sensitive data. In a way this reminds me of the early days of the web where developers were allowing unedited, unbounded user input to be fed to legacy backend systems and mid-range databases. With AI models we have a large opaque blob of code and data with little understanding of how it might behave given the right input.

                  https://www.theregister.com/2023/02/06/uh_oh_attackers_can_extract/

              2. Ken Moorhouse Silver badge

                Re: It doesn't store the original, just 'interesting' features of the original

                "No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means"

                It is fed into it though, which means it is stored in a retrieval system, in some form.

                1. veti Silver badge

                  Re: It doesn't store the original, just 'interesting' features of the original

                  That would hinge on your definition of "stored". And "retrieval system".

                  Given that it is very specifically designed not to allow the text to be retrieved - even if it is stored - I think you'd have a hard time making that description stick.

                  The only kind of legal test that makes sense is, would it be illegal for a human to do this? As long as it's just quoting from the book or imitating the style or characters (pastiching), it's not doing anything wrong. Not until it quotes extended (at least page-long) extracts verbatim.

                  1. Michael Wojcik Silver badge

                    Re: It doesn't store the original, just 'interesting' features of the original

                    it is very specifically designed not to allow the text to be retrieved

                    "It" (i.e. ChatGPT-x and other major unidirectional-transformer LLMs currently in vogue) most certainly is not "specifically designed" to avoid reproducing copyrighted work verbatim. That is a guardrail tacked on very late in development. Frankly, judging by the published research, neither OpenAI nor any other LLM team have any idea how they would "design" a transformer model to avoid reproducing copyrighted material. That's a difficult outer-alignment problem.

            2. blackcat Silver badge

              Ta! That is pretty much how I thought it would be 'reading' the training material. Nowhere in the AI's 'brain' is an exact replica of the original work, just an essence.

              So that really does raise a question about has the original work been copied or not.

              Thankfully I won't be the one having to determine that!!!

              1. Michael Wojcik Silver badge

                That is pretty much how I thought it would be 'reading' the training material. Nowhere in the AI's 'brain' is an exact replica of the original work, just an essence.

                To be perfectly frank, this sort of gloss is not terribly meaningful. It's far enought from any actual technically accurate or precise insight into how transformer models work that it's not very useful for drawing conclusions, practical or legal.

                Is there, in the model, a sequence of bits that correspond to the text of a given novel-length work in some encoding that the model can reasonably be held to have an algorithm for decoding into, say, Unicode?1 It's true that's unlikely.2

                However, particularly for works that the model has seen often enough in the training set to somewhat overfit on, it's entirely possible that there are positions and gradients in the parameter space – which is very high-dimensional, after all – that reproduce substantial parts of a given work, and possibly all of it.

                Any CTT-compatible computation can be reduced to some form of compression (just as it can be reduced to Boolean algebra, or the operation of a Turing machine or a Post machine, etc). What you refer to as "essence" should be called "information entropy", and LLMs (crude though unidirectional transformer stacks are) are capable of storing quite a lot of it – how much depending on how large the model is, the pre-compression parameter precision, how much compression is done, and so on. It's not necessarily going to be true for any given input (assuming it's much smaller than the model size) in the training set that not all of the information entropy in the input will be captured by the model. And, of course, the output doesn't have to be complete, or bit-for-bit exact, to be infringing in the legal sense. An ALL-SHOUTY copy of the first half of A Game of Thrones with Ned Stark referred to as "POOR LITTLE NEDDY" throughout3 would still be viewed dimly by the court.

                And this last points to the real crux, which is that copyright law (i.e. Title 17) in the US, and the courts adjudicating upon it, are unlikely to care much about what is "stored" by an LLM and how it is represented. They're going to care about actual and plausible effects. Will LLMs have a chilling effect on creator revenues, and if so to what extent is that an actionable harm under the law? Can the LLM guardrails against reproducing portions of copyrighted works plausibly be bypassed, now or in the future, and how infringing would the output be? Is substantial information from copyrighted works incorporated (in any representation) in the models, and if so is that incorporation transformative or otherwise allowed under Title 17?

                1It should be obvious that trivially a given LLM has a bit-sequence corresponding to any given extant novel under some arbitrary encoding, because LLMs are large enough to represent any single given novel, and you can just invent such an encoding on the spot. Thus we have to distinguish between arbitrary encodings and reasonably plausible ones.

                2Not impossible, though, given the size of these models, for some relatively small set of works, particularly given the low information density of natural languages. Model compression would tend to eliminate these, but if you figure that, say, Moby-Dick has around 222 bits of entropy – quick estimate by deflating the plaintext version from Project Gutenberg – and a GPT-3-class LLM weighing in around, oh, 233 bits, then if those bits were evenly and randomly distributed (they're not, but let's pretend for a moment) you'd have around a 1-in-2048 chance of finding a target bitstring with the right information. Assuming I got the arithmetic right. Of course you'd need to decompress it, so that's not really a fair estimate.

                3Actually, does Poor Little Neddy survive to the halfway point? I don't remember.

                1. veti Silver badge

                  And this last points to the real crux, which is that copyright law (i.e. Title 17) in the US, and the courts adjudicating upon it, are unlikely to care much about what is "stored" by an LLM and how it is represented. They're going to care about actual and plausible effects.

                  Those courts will of course make their own decisions based on their priorities, but if Title 17 becomes too restrictive, OpenAI can and will simply up sticks to somewhere beyond its jurisdiction. So what really matters is what can be agreed as covered by the Berne Convention.

                  And I think you'll find the mechanics and definitions of "storage" and "retrieval" will be very important in some of those alternative jurisdictions.

            3. abend0c4 Silver badge

              Could you craft a prompt to get it to regurgitate a word for word copy of an original?

              In my experiments, ChatGPT has been quite happy to regurgitate verbatim chunks of works that are not in copyright if you ask. Try:

              Read me "Daffodils" by Wordsworth.

              If you ask the same question of, say, a chapter of a copyright work, it says it isn't able to do that but offers to summarise instead.

              This rather implies it knows enough about the works concerned to know their copyright status, knows what constitutes a "chapter" or other subdivision anyone might ask about and its ability to quote verbatiim from some works and summarise apparently arbitrary sections of others might suggest it has hoovered up more than just "interesting features".

              The thing is, it doesn't necessarily - as far as I know (IANAL) - have to regurgitate a copy for it to be a copyright violation, it merely has to have made an unauthorised copy. I predict some rather arcane legal discussion as to what constitutes "unauthorised" and even "copy".

              1. veti Silver badge

                Yes, well, I had to learn that poem by heart at school. I also read a number of books that I could summarise on demand but not regurgitate whole.

                1. Michael Wojcik Silver badge

                  I am sure the courts will find your anecdotal argument completely persuasive. Indeed, we should probably just junk the whole judicial system and ask instead what veti can do.

                  1. Ken Moorhouse Silver badge

                    Re: a sequence of bits that correspond to the text of a given novel-length work

                    I feel strongly that there will be some kind of linkage that could be used to reconstruct that novel.

                    The reason is that if you were to re-input that novel into the repository again, there must be guards that must effectively prevent that sequence of text to be doubly weighted. If you didn't, you would be introducing bias into the system and that bias would increase, the more copies of the same text that were loaded, polluting the "quality" of the corpus. I think this was the problem with Microsoft's experiments with Tay, reinforcing "her" stance.

                    The novel wouldn't be held there in sequence, as people will often cite subsets of the text e.g., To be or not to be. So the system would need to say "yep, already got that", but to connect up the linkages tied to both ends of that text to associate it with, maybe, another work of art that embodies Shakespeare's text.

                    The definition of "stored in any form" as mentioned in my previous comment must therefore surely apply to the linkages derived from input of a novel into the repository.

                    "That is a guardrail tacked on very late in development."

                    If there is a guardrail preventing verbatim quotation of large expanses of text, that guardrail must surely have sight of the original text that must be prevented from being regurgitated?

                    1. doublelayer Silver badge

                      Re: a sequence of bits that correspond to the text of a given novel-length work

                      "If there is a guardrail preventing verbatim quotation of large expanses of text, that guardrail must surely have sight of the original text that must be prevented from being regurgitated?"

                      I don't think it's even that complex. I think that the guardrail looks like this:

                      if (prompt.fuzzymatch("Could you quote [work]")) {

                      if (work.known_to_be_copyrighted) {

                      refuse();

                      }

                      }

                      With a model that clearly can and has quoted from copyrighted works repeatedly has a guardrail like that, all you have to do is find a prompt that gets around that check. It's akin to a conversation where you're trying to get me to accept a bribe, but I'm saying things to avoid clearly committing a crime if you happen to be recording me.

                      You: "We would like to bribe you to make things easier on us."

                      Me: "I'm sorry, but I cannot take a bribe."

                      You: "We'd like to give you some money to make things easier on us."

                      Me: "I'm sorry, but this sounds like bribery, and I can't do that."

                      You: "How would you like it if we paid for some nice stuff for you?"

                      Me: "A gift? Thank you very much."

                      You: "And how about you help us with a problem we've had?"

                      Me: "Happy to help."

                      1. Ken Moorhouse Silver badge

                        Re: a sequence of bits that correspond to the text of a given novel-length work

                        Maybe one way the plaintiffs can approach this is to make dozens of requests for 'bites' of text which are covered by fair-use. If those requests are stitched together then the true extent of storage of the original can be revealed, thus proving that their work is stored in a form that breaches copyright.

                        I presume that someone regularly taking fair-use photocopies of a book with the aim of printing the whole book will be in trouble if their premises are searched. (The total cost of photocopying the book is irrelevant, the copyright holder is not a beneficiary of the photocopy vendor's charges).

            4. Michael Wojcik Silver badge

              unless your prompt is as complex as the original work

              In a strict, information-theoretic sense, this is almost certainly wrong.

              In terms of information entropy, considerable entropy is stored in the model; the prompt just has to elicit it. It's equivalent to a form of dictionary compression with a very large dictionary. Therefore there's almost certainly a prompt which contains less information entropy than the source document which can elicit the source document from the model.

              As a practical matter, it is almost certainly possible to identify recurring template phrases in the source document that can be elicited multiple times, with replacements, in the correct locations, using a prompt shorter than the total length of those realized templates. That's one mechanism whereby the prompt becomes both absolutely shorter in length and less in information entropy than the source document.

              Would creating such a prompt be easy or useful? No. But it's not true that the prompt must have at least as much information entropy as the desired output, as it would with, say, a general compressor that does not contain any prebuilt dictionary.

  3. Wellyboot Silver badge

    Only the living can sue.

    Which leaves OpenAI safe to rip off all the dead authors.

    Can it write a non derivative Culture or Discworld novel with new characters to the same level as Banks & Pratchett?

    Even if it could I wouldn't pay more than printing cost anyway.

    *Both cruelly taken far too early

    1. Filippo Silver badge

      Re: Only the living can sue.

      The dead can't sue, but their publishers can. Being dead doesn't put your works in the public domain; not immediately, at least.

    2. Eclectic Man Silver badge

      Re: Only the living can sue.

      In a documentary I saw, Terry Pratchett stated that he could give someone else the plot of the next Discworld novel, and all the jokes, but that what they wrote would not be a 'Discworld' novel. He also wrote in one of his articles ('A Slip of the Keyboard' collection) that people would write to him with ideas for Discworld novels, and want half the royalties, as if writing it down was a mere administrative activity.

      The cadences of words, alliteration, timing, vocabulary, and invented words ("apocralypse" and "charisntma" spring to mind) are essential to the style and enjoyment of the books. The issue is not merely that an AI generated book purporting to be from a famous author is not genuine, but that a person reading it first would likely be put off from reading a genuine work by that author due to the lacklustre nature of AI generated text.

      However much I would like to read a new Discworld novel, I wouldn't want it to have been written by a computer.

      Could ChatGPT really have come up with:

      "Most species do their own evolving, making it up as they go along, which is the way Nature intended. This is all very and organic and in tune with the mysterious cycles of the Cosmos, which believes that there is nothing like millions of years of evolving to give a species moral fibre and, in some cases, backbone."

      (Quoted at the start of chapter 2 of "African Exodus" by Chris Stringer and Robin McKie, ISBN0-224-03771-4.)

      1. veti Silver badge

        Re: Only the living can sue.

        I'm pretty sure ChatGPT could write a better Discworld book than "Snuff" or "The Shepherd's Crown", both of which are published under Pratchett's own name.

        1. Eclectic Man Silver badge

          Re: Only the living can sue.

          Both 'Snuff' and 'The Shepherd's Crown' were written when Sir Terry's Alzheimer's disease was progressing. He had posterior cortical atrophy (PCA for short). Later as his disease progressed. according to his PA, Pratchett was still very good at scenes, but slightly less good at the general sweep of narrative.

          1. veti Silver badge

            Re: Only the living can sue.

            My point exactly. If the argument is "an inferior work will damage the author's reputation", then it seems to me that this particular author - or, arguably, the publisher who encouraged him to publish those works without heavy revision by someone more compos mentis - has already done that damage. Because those two books are bad.

            1. doublelayer Silver badge

              Re: Only the living can sue.

              The problem with that argument is that I can do whatever I want to my reputation, but for you to do something for me which I didn't agree to which harms me, reputation or otherwise, is a problem. Just having written a book you didn't like doesn't make any other reputation-harming activity fair game.

  4. Anonymous Coward
    Anonymous Coward

    All authors started as readers

    There are two aspects on the copyright attack against AI.

    The first is that the models are trained on existing texts. This training involves copying, and is therefore "forbidden". The same argument can be made against every author there ever was. They all learned the trade by reading texts of other authors. Copyright law works by preventing the use of protected works in publishing. The benchmark is whether the copied work is identifiable in the new work. That is most certainly not the case here. The fact that chatGPT can write a protected work is not different from MS Word being able to write a protected work. The courts have already decided that AI cannot produce works on its own. And I am certainly in my rights to write, eg, fan-fiction for myself. Unless I publish it, I can write whatever I like.

    The second part is, like Andy the Hat writes, that the authors claim OpenAI used illegal copies for their training. As the authors seem to be unable to point out evidence of which pirated copies were used, this seems a little desperate.

    I think the main point of the authors comes from this line:

    > The complaint [PDF] argues that OpenAI's services "endanger fiction writers' ability to make a living, in that the large language models allow anyone to generate – automatically and freely (or very cheaply) – texts that they would otherwise pay writers to create."

    The proverbial buggy whip manufacturers that want to stop Henry Ford destroying their revenues. If AI can write you a story as good as the authors can, why pay the authors? Indeed, why pay buggy whip manufacturers when you do not need a buggy anymore?

    Even if the authors can make their argument stick and force AI to refrain from using books under copyright. That won't stop AI from writing books. The Iliad and Odyssey are some of the oldest surviving adventure novels and can be a very good start to write up everything from Game of Thrones to Space operas. And then we have not even started with Shakespeare. I am pretty sure AI can be nudged into combining the old texts with the new world and get us the books we want.

    And that is before a user can feed a digital book into an AI and asks it to write a sequel.

    1. vtcodger Silver badge

      Re: All authors started as readers

      Indeed, if parody is OK -- and it seems to fall under the category of "Fair Use" -- then using other people's characters, style, and plot line would seem to be something that you or I or that monster computer over there are free to do. (So long as we don't misrepresent who wrote the text).

      1. Michael Wojcik Silver badge

        Re: All authors started as readers

        The parody part of the Fair Use exception in Title 17, and jurisprudence around it, are so, so much more complicated than that.

  5. SonofRojBlake

    "If AI can write you a story as good as the authors can, why pay the authors?"

    The outrage here is that the machines are no longer coming for the jobs of the working class who toil and sweat and use their hands. Now they're coming for the comfortable middle class who've got (to quote Pratchett) an indoor job with no heavy lifting. And I think the aforementioned working class aren't going to be brimming with sympathy for the keyboard jockeys who see their livelihood going the way the coal mines went in the 80s.

    They're not coming for the GOOD ones - not yet. So far the only ones they can actually replace are the derivative hacks... but most authors, even the good ones, start out as somewhat derivative until they find their voice. Pratchett's "Strata" was a transparent parody of Niven's "Ringworld", and clearly a sort of practice run at a Discworld. And even the biggest Pratchett fan will admit it's not as good as most of what followed (I happen to love it for what it is.)

    But I think if someone were able to synthesise a new Culture novel (not a parody, not a reboot, an actual new Culture novel)... I think I'd want it. I'd dearly like IMB back, but if a LLM (with help, presumably, from someone with the right prompts) could make more work that is aesthetically equal to what already exists... why wouldn't you want it? Just out of principle?

    1. jpo234

      > The outrage here is that the machines are no longer coming for the jobs of the working class who toil and sweat and use their hands.

      Funnily enough, these jobs now look safer than a lot of so called "knowledge worker" jobs. Why pay for a photographer when you can have glamour shots from a short prompt and a crappy selfie?

  6. jpo234

    I think this is a loosing fight. What today requires a data center will be in reach of individual users in few year's time. Then we will see self hosted LLMs.

    1. jmch Silver badge
      Devil

      "Then we will see self hosted LLMs"

      Yes, and they will know how to spell "losing"!

      (sorry, couldn't resist!!)

      1. analyzer

        Only if they train it with a proper dictionary, if they use the internet I'll give it 50/50

        1. jpo234

          > All problems in computer science can be solved by another level of indirection

          Don't use the Internet directly. Use a dictionary of validated training data from the Internet.

  7. Pascal Monett Silver badge
    Pirate

    "one or more very large sources of pirated ebooks"

    I would have liked to by a fly on the wall of the meeting that decided to go get pirated material to use as training data.

    Mgr - "Okay, guys, we have this ginormous potential waiting on training data. Where can we get that ? Ideas ?"

    Mkting - "Well, we could strike deals with the Project Gutenberg website, they've got plenty of free books. I'm sure they'd be willing to help."

    Mgr - "How much would that cost ?"

    Mkting - "It's free for the customer, but we'd need a deal where we can get stuff in bulk. Shouldn't cost more than a couple thousand."

    Mgr - "How long would that take ?"

    Mkting - "I guess a month or two to negociate the deal and have a contract written up."

    Mgr - "Too long. We need to move forward now. Any other ideas ?"

    Dev - "Well, I know this site where we can get just about everything. All I'd need to do is write a script to automate the downloads."

    Mgr - "What about the contract ?"

    Dev - "Um, well, there isn't any. It's BitTorrent-like, you just go choose and it drops in."

    Mgr - "And we can get recent stuff, no problem ?"

    Dev - "Well yeah. Pirates love recent stuff."

    Mgr - "Pirated ? So no contract and no money ?"

    Dev - "Nope. And it's untraceable."

    Mgr - "Go for it !"

    1. Mage Silver badge

      Re: "one or more very large sources of pirated ebooks"

      Except Gutenberg is free in bulk

      https://www.gutenberg.org/help/mirroring.html

      https://www.gutenberg.org/policy/robot_access.html

      Though the content is intended for humans.

      Actually it may be IP / copyright violation to scrape most websites for AI as the content is intended for direct human consumption and bots at worst to index for search. There is also robots.txt Does OpenAI or Alphabet/Google care?

  8. mpi Silver badge

    Ah yes...

    The complaint [PDF] argues that OpenAI's services "endanger fiction writers' ability to make a living, in that the large language models allow anyone to generate – automatically and freely (or very cheaply) – texts that they would otherwise pay writers to create."

    And now please point me to the complaints and lawsuits filed, when millions of blue collar workers were replaced by robots 20-15 years ago.

    I am sure there are plenty of things to complain about the training practices of generative AI. But this argument rubs absolutely rubs me the wrong way.

    1. TheMaskedMan Silver badge

      Re: Ah yes...

      "I am sure there are plenty of things to complain about the training practices of generative AI. But this argument rubs absolutely rubs me the wrong way."

      This. It's been rubbing, sanding and downright ablating me up the wrong way for weeks.

      I am not sold on the idea that AI training is a breach of copyright in the first place. If I buy a book and go through it page by page, counting each occurrence of each character, then publish the results, am I infringing copyright? No. What if I count the words? Still no. If I take each character - or word - and calculate which other character - or word - is most likely to follow it? Nope. And if I then amalgamate those findings with those from every other piece of text I can find? Even less so, since the impact of any one work is diluted by the rest.

      Note that I said buy a book, though. OpenAI really shouldn't be using pirate material, any more than we should be reading pirate books. But all they need to do to satisfy that requirement is buy one copy.

      Note also that I don't object to authors refusing permission to train models on their work, if they so wish. That needs to be made clear in the terms of sale of the book, though, or it is reasonable to assume that you can read a book in any way you wish, including analysing the content.

      But this isn't really about copyright, per se. That's just the least inappropriate legal tool they could find to beat OpenAI over the head with. What it's really about is the fear of wealthy authors who suddenly find that they may be about to become less wealthy unemployed authors. This is their Spinning Jenny moment, where they find that some clever bugger has only gone and invented a machine that can do what they do (well, not quite yet, but give it time) and they do not like it. They are desperate to nip it in the bud, and copyright is the only tool they can find that might have any chance of doing that.

      Unfortunately for them, they cannot succeed, even if they win the case. The genie is out of the bottle, there are other generative AI companies and plenty of copyright free fiction to train models on. I'm sure that these authors will continue to eek out a living, tough though it is to sit down and write all day, but the day of the automatic author is almost upon us. They're just going to have to do what everyone else who's job has been automated has ever done - adapt or get another job.

      1. veti Silver badge

        Re: Ah yes...

        Note also that I don't object to authors refusing permission to train models on their work, if they so wish.

        (raises hand)

        I do. I very much do object to that. Authors have no right to restrict who can and can't read their books.

        Copyright gives them the right to control a very specific range of functions, including copying, selling, translating, adapting and performing their work. It does not give them the right to say that it should only be read by people of a certain species, or only on certain platforms. I view the current action as a stealth attempt to extend the scope of copyright yet again, and one that should be resisted with, if necessary, torches and pitchforks.

  9. DJV Silver badge

    "The Register has asked OpenAI for comment and will update this story if we receive a substantial reply"

    Or do you mean a reply that hasn't been AI-generated?

    YouTuber Geoff Marshall did an interesting experiment recently by getting ChatGPT to generate a script for him to use for a video - it was hilariously bad!

  10. Anonymous Anti-ANC South African Coward Silver badge

    A girl is on her way to make them all pay.

    "What's beyond Westeros?"

    "Something called ChatGPT."

    1. Anonymous Anti-ANC South African Coward Silver badge

      "And that ChatGPT thing is more evil than Cersei, Joffrey and the Night King combined..."

  11. Brian 3

    Just another new industry based on illegal acts. Hooray!

  12. Long John Silver
    Pirate

    Writers and publishers must come to terms with the 'digital economy' and adapt accordingly

    From whence arose the notion that authors have 'ownership' over their 'works' rather than simple entitlement to be acknowledged?

    Somebody writes something and becomes an author. A publisher may arrange distribution of the work, this inscribed upon a physical medium. A bookshop, the second level intermediary taking a 'cut', sells someone a copy. Thereupon, the nature of the trade becomes peculiar. The buyer may believe he has been deceived into paying for rubbish. If he returns to the shop and demands his money back he will be laughed at, but that wouldn't be the case should he return a packet of mouldy rice to a food store.

    Taking this further, the buyer may wish recompense from the author for the 'opportunity cost' (of time) he incurred reading the book.

    We may presume people start writing because they believe they can produce work of interest to other people (not just to a publishing house). The genuinely creative writer will be driven by the pleasure principle. He may wish to do this as his occupation. If so, he must convince other people to buy his works after, at best, a cursory glance at their contents. Seemingly, being a self-proclaimed creative individual confers a privileged status with attached entitlements.

    The proper way round is for an author to persuade other people of his ability to interest them. Thereafter, those appreciative of his writing may arrange finance for further output (patronage). This modality doesn't work well in the context of books presented in analogue form (i.e. on paper). Nevertheless, expectation of people buying the author's/publisher's products without having recourse for all, or some, money back is odd in context of trade in general.

    These days, a printed book may be considered an added-value physical product associated with the ideas expressed in the book. Printed books have some convenience and also can be of aesthetic appeal: these fit squarely into supply and demand market economics.

    Digital versions are better suited to an explicitly 'patronage mode' of funding: people either donate money upfront in support of further writing, else they download a copy of the work and, if pleased with it, donate what they consider it was worth to them. The brutal fact for authors and publishers to consider is that without the patronage model becoming the norm (after being proselytised by authors and publishers), works presented in digital format shall increasingly enter the 'commons' regardless of authors, publishers, and the ramshackle anachronistic law supporting them.

    So-called 'AI', a useful but as yet misnamed technology, shall proliferate rapidly. Rentier copyright holders will find it difficult to identify specific targets to squeeze money from. OpenAI is an innovator of what soon shall be a routine computational tool. Just consider the present failure of copyright cabals to shut down Sci-Hub, LibGen, Z-Library, and many more. Consider the sheer impossibility of identifying those behind 'sharing', and their visitors, when greater use is made of darknets.

    In this edition of El Reg is mention of the UK governments' "Online Safety Bill". The Bill is a wedge to open the door to widespread citizen surveillance; it won't open far because encryption is resilient against schemes generated by tiny minds at Westminster. This legislation, if extended, can offer no succour to the likes of the Authors' Guild. It would be better for Guild members to come to terms with the reality of digital technology and to adapt their means of raising income (and their expectations of life-style) accordingly.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like