Para. 3 in the overview of the first filing reminds me of how the Oxford English Dictionary was compiled (and is still updated): individuals are invited to submit passages from published works to provide examples of the use of a given word.
Award-winning novelists Paul Tremblay and Mona Awad, and, separately comedian Sarah Silverman and novelists Christopher Golden and Richard Kadrey, have sued OpenAI and accused the startup of training ChatGPT on their books without consent, violating copyright laws. The lawsuits, both filed in the Northern District Court of San …
It likely does. I say likely because fair use isn't a carte blanche, merely a defence that can be employed during a legal action (I'm not aware of any dictionary publishers being sued). LLMs may also fall under fair use, in both cases, they could be regarded as transformative works but, again, it will require a legal precedent.
I'm absolutely not a lawyer but, given the ability/tendency of LLMs to 'hallucinate', which I believe is a linguistic equivalent to the visuals of Charles Bonnet Syndrome in humans, it demonstrates that there is original synthesis taking place rather than straight regurgitation. I would still distinguish that synthesis does not require intent.
IANAL and have no idea how they distinguish between "transformative" and "derivative".
Just using commoner's English, not high falutin' lawyer speak, *all* transformative works are derivative, are they not? I.e. your question becomes a trivial observation (sorry).
If I wrote a precis of a book (shades of third form English Lit., shudder) then I'm definitely deriving all my info from the original book, even if the result has no actual sentences copied verbatim. Reminder: stuff like the plot isn't copyright, only the particular expression of it, so even though my precis contains the plot (or I'm getting a C+ at best), that isn't breaching copyright even if it *is* derivative!
In other words, unless we do actually have the lawyerly mindset on this, any discussions on the matter is going to end up getting in a twist.
And that doesn't even consider that the LLM has mushed up its training material so, again trivially, any output is "derivative" of the entire corpus taken as a whole - but so is every word I'll ever utter (from my personal corpus, that is), even when (was going to say "if", but...) I say total gibberish words, they are derived from what I've heard makes up "something that sounds like a word". They we go - another twisty argument, of the sort that His Honour and members of The Bar know how to deal with but is just likely to raise merry hell in an Internet comments section.
Oh, and any use of words like "transformative" are tricksy around programmers and the mathematicians who dreamt up the systems we use: a compiler most definitely transforms source code but no matter how many transforms it goes through, the output object is agreed to be absolutely 100% derived from the sources.
> If you want to read a story, reading a dictionary isn’t going to be a substitute for that, so it is a transformative use.
Are you referring back to the use of snippets of text as examples in a dictionary? If so, then yes, I'd probably agree (although as IANAL my agreement doesn't stand for much, as previously noted). But then I'm agreeing because it seems "very transformative", i.e. way over to one side of the argument, so it gives one datum but doesn't help much in understanding where the boundaries between "transformative (enough)" and "ripping off" lie - that is still all twisty.
> Reading a pdf of the book on a torrent site would be a substitute for reading it on the Kindle Store.
Sorry, not seeing how that fits into all this. Care to clarify?
Often "fair use" hinges on how much of a pre-existing work has been used. Me using Blackadder quotes to quip at what others have said counts as fair use, me posting the script of whole episodes of Blackadder does not. Ingesting a work wholesale would definitely cross any threshold used to assess whether a 'substantial' quantity of the original work had been used.
If an LLM can be provoked to produce exact quotes of any arbitrary part of a work by prompting (e.g. "What did Captain Blackadder say in response to Lieutenant George St Barleigh talking about 'willing suspension of disbelief'?") then I suspect it would fail the test of fair use by its ability to produce arbitrarily long quotations of any part of the original text.
I agree that limited source material is the most common defence invoked but Fair Use isn't a list of tick boxes and doesn't even provide immunity from legal action. Rather, it serves as a guide for ways in which copyrighted materials might be reproduced in another work in a way that doesn't require permission. For example, movie spectrums (there's a variety of terms) are posters which use the average colour of each frame from a film, laid out in a pattern as an interesting way to study how a film is visually directed. They are constructed with every bit of information from the film but, as far as I'm aware, haven't been challenged legally as far as the content.
The lawsuits cited in the article don't pertain to the potential for complete reproductions of their materials but, rather, are arguing that removal of DRM and the use of their works in the building of the model is infringing.
Upvoted - but can you point to where removal of DRM is given as part of the complaint? I've read the PDFs but must have missed that. Ta.
If you were thinking of this bit of the article:
> certain copyright management information that would have been included in the legit, copyrighted books. This is the basis of the third count they allege against OpenAI, a claim it breached the DCMA by removing the copyright management info.
This isn't referring to DRM - the PDFs describe that "info" as being ISBN, author name, book title etc - i.e. attribution.
But the PDFs also describe that the LLM was prompted to give a precis - and if you ask "precis Twilight book 1" and the session log (which is mentioned as an exhibit in at least one of the cases, but I've not seen) shows the LLM diving straight in, there is a fair argument (IMO, IANAL) that you have to take the question and answer together, in which case the attribution is (probably) clearly visible
 Q: "What is 2 + 2?" A: "4" - that is a reasonable response to expect in an (automated) conversation, you don't need to demand that the answer repeats the question.
 you might be able to create a set of question prompts that get the model to spit out the precis without either question or answer containing attribution, but hopefully the defence would raise objection to a session log that started "Without identifying the book by name, author or other explicit means, precis that awful story about a teenager being groomed by a centuries old glittery vampire"
The OED does not publish extracts (or complete issues) of copyright works. It merely states the first known usage of a particular word or possibly phrase. If a work is in the public domain (i.e. is published) then it is a legitimate reference for the OED.
What I want to know is whether any of these LLMs use the Register's comments or articles for 'training'. Could provide some 'amusement' if it does not understand the icons...
"are requesting compensatory damages and permanent injunctions to stop OpenAI from continuing in its actions."
Can they show that they've suffered pecuniary damage from the alleged copying? If so, it can't be very much. I'm not convinced that a summary amounts to copyright violation either, though I guess that depends on how big the summary is compared to the original material.
Have any of them approached OpenAI and asked them to remove their material and / or requested that they don't use anything else they may produce? Probably not.
Grant them the injunction by all means - if they don't want OpenAI et all consuming their material then they are surely entitled to prevent it. But I'm not so sure about damages.
Still, the proposed litigation is fulfilling its purpose in gaining publicity; I, for one, had never heard of any of these award-winning litigants until now.
"It will take an author quite a long time to write a book. If, after scraping a lot of novels, an LLM can knock one out in minutes then they stand to lose future income."
Oh yes, definitely. I've had a single short story on the go for over a decade. Every now and again, when time and inclination coincide, I write a few more paragraphs. Publication shortly after hell freezes at this rate.
Of course, if the LLM can do it well enough to be published (ie not self published on Kindle et al) then the author might be better of dreaming up good plots then having the LLM write it, rinse and repeat.
It can't, though. No jobbing author is likely to lose out to an LLM any time soon. Sure, it can produce a short story, but there's no depth to the output, no creativity. I'm not sure that there ever will be, either Naturally, these things are only going to improve, but I doubt they'll be able to compete with a genuine meatsack.
Nor am I convinced that loss of future income is a risk that could be mitigated through this litigation. There is plenty of other fiction out there to train on, much of it out of copyright or public domain. Unless these authors, of whose existence I am now blissfully aware, can show that they bring something unique, I doubt that their inclusion in the training material will have that much impact.
> If, after scraping a lot of novels, an LLM can knock one out in minutes then they stand to lose future income.
Hmm, "if". Surely you can't sue on an "if"?
*When* an LLM has been told to knock out a novel - and, note, it would have to be a novel that clearly infringed on that person's work, not just any old novel, then there may be a case to answer for. But is it OpenAI *or* the person who asked for the novel to be created *or* the person who published it to the world who should be tackled?
Remember, as a private individual you can write a blindingly obvious ripoff of a novel, only having changed the name of the heroine to your own. So long as you keep it locked in your bedside cabinet and only read it to yourself on chilly winter nights, you will be fine. Publish it and prepare to be damned.
 okay, yes, that person who has brought a case just on the off-chance that she one day offers wedding websites *and* is asked to create one for a gay couple *and* they sue *if* she refuses, she is trying to bring a case on an "if". Wild.
 yes, yes, it will be OpenAI, because they have all the money in the bank - never mind that in this scenario they made either no money or the same as they charge for any arbitrary k many words of output (i.e. the didn't profit because it was a novel that was generated by the runtime they sold).
Court action very much is asking them to remove it.
Of course, it's technically impossible for them to remove any of the millions of copyrighted works they've already compressed into GPT-3 and GPT-4 etc.
They'll have to delete them entirely and start over, being careful to only use works licenced for that purpose.
Which is why they're trying to insist that they are permitted to do whatever they want with anything they find on the Internet. At some point they might realise that would also mean the literal end of copyright protection for their own works.
If someone makes a TV adaptation of a book they have sort out some kind of licensing arrangement under the expectation that everyone is going to make money off the deal. Why should scraping it to train AI be any different? You're using an authors work to create a commercial product that makes money for you.
If you don't want to license it, then don't use it- at the very least authors should be able to say "I don't want my work to be used for this purpose" but like soup ingredients, it seems as though once it's in an LLM you're not going to be able to get it out.
Would adding a copyright clause that explicitly prohibits any use for LLM learning even be enforceable? Otherwise I can see that emerge as a new part of the copyright notice soon.
If I see the freedoms that Google awards itself in its conditions you'd think they would be better positioned to use any information that is hosted on their services or has passed through it than anyone else. They do have that freedom if my reading of their conditions is correct.
"How about <aibot="yes"> to allow people to choose to permit scraping?"
I wondered about that, too, but it would be very restrictive - since no site will have that tag in place at first, there would be nothing to scrape.
There is an assumption that a site will be indexed and cached unless you include noindex and / or noarchive, so it seems reasonable to assume that that it could be scraped unless explicitly denied, too.
Regurgitating them wholesale, with or without attribution, not allowed (unless you specify otherwise - i.e. you explicitly licence it, perhaps using one of the CC licences).
Creating a précis of them and publishing that without attribution - that is the different matter.
What if a Stackoverflow answer is just a quick précis (such as leaving out all the reasons why) of something the person learnt from your web page?
Does it matter whether the précis was typed out by a human or by a program? Or a human using the program? A human who "read it somewhere and can't remember where"? A program that "read it somewhere and can't remember where"?
What if your web page was the only place that such a brilliant piece of work was ever published and the human could have been reminded where they read it with only a quick search? With a day's hard web searching, because they didn't know the magic search term?
"Possibly need something similar to the <noindex> tag to prevent LLM scraping*
Yes, exactly! I suggested a possible noai tag in comments on another AI/copyright article the other day. It wouldn't stop rogue bots indexing the page, but it would make it much easier to sue the companies that ignored it.
Authors/ publishers who don't want to have their work used can make that plain, the AI companies can train on the rest of the web without fear of repercussions and we can all get on with what we do in peace. Win win. How do we go about proposing it?
I very much doubt that such a clause would be allowed to stand because it's discriminatory. This sounds like fair use so the only the real thing to work out is whether attribution is possible. I suspect something like LegalAIgles™ is already working on this… build a model from the same corpus that can detect sources.
Ban LLM "scraping" and next year, LLMs are no longer being used, we have much better algorithms that we want to test on all this lovely text, off we go.
Ban AIs reading them? But what is an AI? Nope, this is just a program to build up a search index - yes, it does a Certified Really Clever General AI in *that* module, but all that the complete program does is create a really good search index (it is a bit over engineered, if I'm honest, but the CRCGAI was going cheap). Or even " come off it, AI? This is an LLM, you know full well we don't consider them as AIs, not after the Great Hype Ban of 2040".
Ban any machines reading the full texts of, say, novels? Bang goes the study of linguistic forms and how they change (with a whopping great note added by the researcher naming and shaming all the authors and publishers standing in the way of linguistics research).
 how is " scraping" different to "reading", btw? The use of unlicenced copyright material? Because that is already disallowed by existing laws and all you ought to need to do is buy a copy from the publisher.
"No part of this book may be reproduced in whole or in part ... stored in a retrieval system, or transmitted in any form or by any means ... without written permission from the publisher."
That's in the front of almost every book I own. It's very clear and explicitly prohibits what OpenAI has done.
"Some tasks, like requesting information from specific military units, can sometimes take staff hours or days to complete, but large language models can provide data within minutes."
Only takes minutes if the info is already included in the LLM. How long does it take for the info to get compiled, and how do you ensure it's not out of date?
Looks like replacing a simple telephone call to the unit, with having to employ someone in each unit whose entire job is updating the LLM. Not to mention ArmyGPT making stuff up when it doesn't have the answer.
"Only takes minutes if the info is already included in the LLM."
Also only takes a few minutes if it just
makes stuff up hallucinates it if it doesn't have it already.
It sounds like this is information that should have been assembled and kept up to date anyway. The value of the LLM would appear to be compelling whoever it might be to do what they should be doing anyway.
"Prone to generating false information" certainly doesn't feel like something you want in your command and control chain.
There's probably a fun sci-fi story in an army that manoeuvres in specific ways designed to trigger the weakness in command and control AI systems by manipulating what is reported about them.
Just by spending a fraction of their runtime training costs on a bulk deal or two with ebook publishers.
They have a paid for copy, they can let the machine read it. Job done.
(And I'm not going over the arguments about whether LLMs just store verbatim text of the books 'cos they don't, and even the plaintiffs are complaining that the program can provide a precis, not about regurgitating the whole).
 and snarfed as much copyright free material as they wanted, of course.
 Which raises the question: leaving aside features like "I want the LLM to be able, specifically, to precis *this* book", are you going to get more pleasant, literary, witty and urbane English out of a model trained on the out of copyright contents of Project Gutenberg or from scraping the latest airport bonkbusters? I know which one I would personally prefer to read (you, of course, are free to have your own personal preferences) and my suspicion is that, as these things are being pushed as "good for generating texts for businesses to use", mayhap businesses would also be better off with just the older material as a style guide?
 Arguments about "the information in old books is out of fate and useless, so that won't work" are met with: "You are getting your knowledge of, e.g. current tax law from bonkbusters? Well, that explains a lot".
They have a paid for copy, they can let the machine read it
Authors have something called "moral rights" that can't be assigned. One of these is an identity right: to be credited as the author of the work in question. It's questionable whether the LLM "understands" authorship, particularly if it's mixing content from a variety of sources. However, if it regurgitates part of an author's work and fails to attribute it, that would seem to be a possible breach of their moral rights. Another is a right to object to "derogatory treatment". If the LLM were to produce a recognisable variation of the author's work that constituted a significant distortion or that might bring the reputation of the author into disrepute, that would also be a potential breach. And that's even assuming that their contract with the publisher permits the publisher to issue "training" rights to LLM developers, which I'm not sure is a given.
Copyright isn't entirely about money.
I may be wrong but I don't believe reproduction of work is the issue per-se, it's the scraping they are worried about, the summary reproduction is the evidence of that scraping.
I see this issue in a different light. Assume a book is *legally* online. As a human, I learn the content of that material, cannot unlearn it and could theoretically use it to produce a summary later with or without copyright notices from my memory. An LLM scrapes a book online. As an LLM, the machine "learns" the content of that material, cannot unlearn it and could theoretically use it to produce a summary later with or without copyright notices from its memory. Does this indicate any fundamental difference between me as a human and a LLM? In other words, is this not just another case of the publisher running after a potential new source of copyright fees as the material by differentiating between human or machine entities on the end of the internet connection and restricting what they can do?
And "online" should make no difference.
If I go out and buy their latest book, read it over the weekend, and post a 300 word summary to my blog, is that a copyright violation? Genuine question - I am not a lawyer.
What about if I pay someone on Amazon Mechanical Turk to do the same thing?
Is it different if I pay an AI company to get their software to do the same thing for me?
And if the AI company then uses the model they created above to answer someone else's question? Is it different if the answer to the question involves reproducing any text? (consider "Does Great Aunt Agatha actually ever meet the Magician in person at any time in the novel?" vs. "What does the Magician say to Great Aunt Agatha the first time they meet?").
I guess my feeling is that AI should really behave like Google Search does today. If the information is on the web, not behind a security wall, it gets incorporated. However, you can instruct the spider not to do that by using specific steps. Of course, the result would be that the AI becomes much less useful because it doesn't have a lot of real-world knowledge - for example about anything protected by copyright. So "summarize the latest Spiderman movie in 1000 words" would be impossible. And it would mean the AI companies would need to create ways that institutions (universities, companies, ...) that want to apply AI to material not available for free on the web would have to be able to add data to their own instance of the model (after paying a licence to the necessary publishers).
> And it would mean the AI companies would need to create ways that institutions (universities, companies, ...) that want to apply AI to material not available for free on the web would have to be able to add data to their own instance of the model
That mechanism already exists - it is referred to as "pre-training". You just get a copy of the model after it has been fed "generic" material and then keep on using exactly the same method to give it all your domain-specific information.
To illustrate the idea, you can get hold of pre-trained models, of various utility - e.g. https://github.com/onnx/models
The attribution question is the core issue in the conversations about Github's Copilot - if you've not been reading those, please do so; there is a lot of info there and it doesn't seem sensible to copy and paste it all here.
However, attribution doesn't seem to be at the bottom of the cases here, unless I missed something: the models were asked to provide a precis of a work, which they did, so the attribution to the original work is clear.
Similarly, this case isn't claiming that the precis were derogatory or otherwise damaging. Again, if I missed where that is claimed, tell me and I shall happily update myself.
Given you highlighted a specific line, are you trying to imply that there is anything wrong in that sentiment, that "They have a paid for copy, they can let the machine read it"? Are you trying to say the the LLM should *not* be allowed to read the work, just in case it did something naughty with it? Because that argument applies to *everyone* and *everything* that is capable of accessing the book, LLM or not.
> Copyright isn't entirely about money
Did anyone say it was? How could it *possibly* be entirely about money, given that copyright equally protects all the works that are made freely available without the need to pay to licence a copy? Such as this very comment you are reading now.
Even though all the wording about "moral rights" etc is there to provide the law with the ability to function and, at bottom, provide for ways to give recompense when the copyrights are breached; and that pretty uniformly comes down to money. Such is the world we live in.
Paying for a licence to access a copy of a copyrighted work is just that - the work has been offered for sale, anyone can buy a copy and read it. This is relevant purely because that is the condition these authors had offered their work under.
Other copyright material has been read as well, a lot of it, which has been offered for general consumption without the need for payment.
All that was suggested is that OpenAI should have just followed the full agreement for access to every work it had the LLM read and that would have prevented the basis for these particular cases.
I don't see why they'd need to do this to write a review: these are summaries and not "new" works being passed off as original. Furthermore, if you think that this kind of negotiation is quick, cheap or easy, where have you been living for the last two decades? YouTube is full of far more egregious abuses of copyright, even in "transformative" works. One of the justifiable reasons for safe harbour provisions is that rightsholders can be considered to be guilty of restrictive practices. No, the solution here will be about ensuring attribution and defining fair use. Otherwise you just can bury the result behind a process that looks like humans are involved in The GPT Literary Review. Once some kind of process for attribution has been established then you can go after any "publisher" that fails to provide it with the full force of copyright law. As anyone who's ever a received an e-mail from Getty Images knows only too well!
> Furthermore, if you think that this kind of negotiation is quick, cheap or easy, where have you been living for the last two decades
Did anyone say it would be easy or quick? Cheap, yes, in terms of the daft sums being poured into the creation of LLMs.
If getting reduced-cost access is too annoying for them, they don't need it: just buy at the standard market price, one book at a time. If that is the better option, that is the one they'd take, but they'd be bound to try asking, just on the off chance (be embarrassing in the finance meeting otherwise).
Hardly a deal-breaker for them, either way. <shrug>
"my suspicion is that, as these things are being pushed as "good for generating texts for businesses to use", mayhap businesses would also be better off with just the older material as a style guide?"
I don't think they would be. I'm not sure if modern books are very useful, but the modern text will allow the models to generate text of dubious usefulness that at least sounds like modern text. People generally want to read messages from businesses that sound natural, not old, although those old books at least get unnecessary formality down. Also, it would make a more limited style guide consisting of centuries of English which didn't have any particular styles in common, also including translations from other languages which may not always be the best (check out the quality of Gutenberg's translations of Verne compared to modern translations that do such things as getting the characters' names correct). One style in a book which only uses that one is easier to understand than a mash of styles when modern English is expected.
It is my pleasure to inform you that the communication service which you, our esteemed customer, hath purchased is indeed without function as to your particular place of residence. This we have ascribed to the deficiency of a telegraph to be found betwixt your abode and the house of business. When we are granted a personage of adequate knowledge and a horse upon which to send him, the fault in thy resource shall be witnessed and, if it please he which to him is all glory, fixed. I remain your obedient servant, Local Internet, LTD.
> I remain your obedient servant, Local Internet, LTD
Hah, given that form of address was used at least up until the 1970s, I would be glad to see it return.
Except, of course, it should read:
I remain, Sir, your obedient and faithful servant, Local Internet, LTD
Or at least bring back,
Faithfully, yours, Local Internet, LTD
and "Sincerely" for informal missives.
True. And you - or OpenAI - pay the publisher to get access to your copy, which you are now at liberty to read, unhindered. In fact, you now have a paper trail to show that you have the right to read and inwardly digest that copy of the work, just in case anyone should ask.
It goes further, as the licence is specifically tied to that copy, which is why both SWMBO and myself can read the same paperback copy and afterwards we can pass it on to a friend, family member or total stranger. All very neat.
Not 100% sure of you were trying to make a point or just show that you know about licensing.
"At no point did ChatGPT reproduce any of the copyright management information Plaintiffs included with their published works."
Nor would ChatGPT need to if it was providing a summary of the book. Because the "management information" is irrelevant for that (and probably has no legal standing anyway).
I assume by that they meant they questioned ChatGPT and it was able to answer lots of questions about the content of the book indicating it had "read" and stored it all, but could not answer questions about the copyright.
I'm guessing when they programmed ChatGPT to "learn" it was told to skip copyright related information. Either because it is redundant/useless for its goal or because they thought they might get them out of trouble ("it doesn't know the difference between copyrighted and public domain")
The "Copyright management Information" was just the normal stuff like author's name, book title, ISBN (it is described in the PDFs linked to from the article).
I.e. the stuff you would include (some of) as the attribution.
There is no legal problem with leaving great chunks of it out (aka "removing it"): library of congress record number, ISBN, copyright date, printing date, .... Unless, of course, you are explicitly storing a wholesale copy of the work (but even then, just 'cos otherwise the copy is incomplete qua being a copy). For purposes of attribution, book title and author is generally sufficient (academic references and the like require more precision, but not for copyright compliance).
Whether a summary needs to include a statement of what it is summarising (as in, the attribution) is arguable, as I've noted elsewhere: it depends upon how the material is published: if the session log shows that the question provided that identification, there is no necessity for the answer to repeat it (but if only a part of that log is reproduced elsewhere ...)
With all these claims of plagiarism by artists and writers, I have to wonder why no one points out that humans "ingest" (more commonly referred to as "read" or "study) other works, and incorporate ideas, language, idioms, etc. from those same types of works. AI LLMs are doing nothing new beyond this, and the complaint is simply FUD and Luddism.
I would certainly see potential for infringement in terms of "art".
If you get some of the image AIs to produce an art work in the style of a particular person, occasionally it will produce something where you think, yes that really has captured the style of person X.
Lets take a real world example, UK based people who are aware of the long running B3ta boards may well recognise the name "HappyToast". He sells artworks and has a very distinctive style (I deliberately chose him for that reason as a suitably trained AI could grok the style). I have seen AI generated works done in his style that, whilst not as good as his works, is a reasonable pastiche & there would be nothing to stop someone then trying to sell that.
I personally think that image generating AIs should not be allowed to do "in the style of" image generation when the style is for a living artist
When a film critic deconstructs a film, they're not required to get the studio's permission, nor are they required to compensate the studio. That's true even if the critic uses digital tools to perform the analysis, and it remains true even if the critic then takes what he's learned and applies it to creating his own films. In other words, it would be an entirely novel interpretation of copyright law to say that performing an analysis of prior art requires permission or compensation to the artist. AI models are not copying works, but they are analyzing works. So I think these lawsuits face as real uphill battle to make their case. It would open a huge can of worms to conclude that merely analyzing works invokes copyright protection. I'm not sure any of us would want to live with that precedent and its consequences.
@TrueJim "AI models are not copying works, but they are analyzing works."
These cases claim "OpenAI made copies of Plaintiffs' books during the training process of the OpenAI Language Models without Plaintiffs' permission."
So is it, AI models are not copying works, but they are analyzing copies of works. and if so, does making those copies infringe the author's copyright.
The question that's being asked is this: Does OpenAI infringe copyright when it scrapes copyrighted works from the internet?
In my opinion it does. But that is just my opinion.
> These cases claim "OpenAI made copies of Plaintiffs' books during the training process
Yes, it seems to me the case hinges in part on what the plaintiffs mean when they say "made copies". If the plaintiffs are alleging that a computer analyzing a digital copy of a book constitutes "making another copy" then I think the plaintiffs will have an uphill battle. Computer programs that perform statistical analyses of famous works have been scanning digital copies for years without protest from publishers or authors, so there is some precedent. I suspect this is the sort of case that will probably involve a decade's worth of appeals as these kinds of nuances are sorted out; we're probably ten years away from having a definitive judgment.
Let’s try this thought experiment:
Suppose you were to take the source code for the Oracle Database, compile it yourself and distribute the resulting binary. Do you think Oracle’s lawyers would complain about it? Do you think they would win their court case?
The binary you distribute would look nothing like the source code. Decompiling it would be quite difficult and would like not look much like the source code you started with. A human could read the source code, “learn” what it was doing, and write assembly to do the same thing, but would the judge consider that to be a relevant argument?
It depends on the detail. If you could question a film critic and ask them about any particular scene and they could quote it word for word, and you could continue that until he had quoted the entire movie, I think doing that in public would be a violation of copyright. Perhaps there's an argument to be made that quoting an entire movie isn't the same as seeing an entire movie, but replace movie critic with book critic and it is pretty obvious a book critic who could be induced to quote the entire book would be no different than an audiobook of the work which most definitely would be a violation of copyright if it isn't a licensed audiobook.
> and they could quote it word for word
AIs such as GPT don't quote anything word for word though. AIs learn about the kinds of patterns that constitute human word usage (based on a combination of semantic similarity between words, part-of-speech, and the relative positioning of words to one another), then apply those patterns to generate entirely new sentences. One can copyright an expression of content, but one cannot copyright the patterns of how words are used, nor has it historically been considered a violation of copyright to study other people's word usage to understand their patterns.
It will have read just about all of the Bible and Shakespeare in the form of those very quotations, many, many times, as well as the complete works themselves (and whopping great chunks of each, much longer than your average quote) again, many, many times.
There will be massive correlations between "Out, out" and "damn spot".
Which fits TrueJim's description.
The same will be true for everything you can think of that is considered quotable. It will have been quoted.
If you wish to demonstrate otherwise, you have got to get it to (re)generate something thoroughly dull, that it will only have seen a few times - preferably, only from one single source - and see what it kind of prompting it needs to do that.
We don't have a great idea of what this model contains within it. The point is that it is not purely an analysis or a copying program, but a combination of the two. You can get it to produce verbatim quotes, including of copyrighted material. That has been covered in articles here and elsewhere; I'm sure you've seen them. That wouldn't have happened if it was purely a lossy analysis tool. What we don't know as well is exactly how much you can get it to quote, because it's set up in such a way that it will not print an entire book in one go.
The data it has read is radically altered from the original format, and some of it has likely been discarded. Neither proves that it is not a copying tool or violating copyright. If I write a program to mash up data into a form that's nearly unrecognizable, it doesn't prevent me from violating copyright if it can be used to reconstitute the data I don't have rights to. If I take a book and discard a few chapters, but then quote the rest, I have still violated copyright. A perfect example is if I take copyrighted music and run it through a lossy compression algorithm. Some of the original sound is no longer available in the file I created and will never be retrieved by someone who only has my file. The original strings of bytes aren't there either because the format is different. Yet, because it makes the same general impression to someone who listens to it, I am not off the hook.
Music also provides a good example of how even small uses of copyrighted data can be a problem. Some people have decided that they like certain sounds produced by other musicians, frequently drum beats. It would seem pretty easy to make more whenever you want, because you just have to get a drum (or basically any object that makes a nice sound) and hit it with something, but still they value them and integrate them into their songs. These beats are very short slices of audio, and they are not the original audio because, when they're reused, processing has to be done to remove other instrument sounds that are unwanted. If you choose to do this, keep around some money because you aren't allowed to do that for free. You have to license the sound from the person who created it, even despite its brevity, your changes, and the fact that your song is likely completely different from theirs. You used something original to them for your purposes, and so has OpenAI. It isn't as simple as determining whether it produces the entire original work. Even if it can't, it could still be violating.
After reading this, I have been seeing just how far this scraping has got.
It just gave a very informed report on the reading books I used at school in the mid 1960s! (Dick and Dora).
I will probably look through other things I haven't read in 50+ years. This is unproductive and fascinating, just why I got into IT!
It does mean that "write a story about x in the style of z" could be iffy - it is very likely to contain chunky quotes of existing work and cannot be passed off as parody or fair use ( ChatGPT cannot generate parody or fair use extracts on request as it has no understanding of, err, anything - its not as though its an AI ).
The future problem for aspiring authors, musicians, etc is that they will be lost in the flood of model-generated works (not identified as such) swamping publishers/the market.It will be interesting to see what companies like Spotify do to handle this.
One of my novels - a fantasy set in Japan - is online and free to download (although as the author, I retain copyright). AI are welcome to learn from it. Hopefully it will improve them, as I hope readers will enjoy it.
I would certainly like to see an AI chatbot based on pre-modern fiction. Lots of Jane Austen, George Eliot and Dickens. Plus 'Fanny Hill' of course.
'It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a faster internet connection'.
If you want to render an AI useless, direct it to scrape 'Finnegans Wake'.
AI is a machine learning system so in order to learn it has to obtain information. Just like a student reading books for an English Literature class, in fact. Now I've never heard of an author suing a student for copying because they read their work or performed literary analysis on it or even if they build original works using some of the knowledge gleaned from their reading. There is straight up copying -- plagiarism -- but that's reasonably well defined. So it occurs to me that the authors are not only trying to hold back whatever progress is made by machine learning** but also seeking to profit from whatever they can get their hands on.
(**Not as much as you'd think. Its not that ML is going to render authors redundant. At least, if it did -- and 'it' doesn't have a choice in the matter, its whoever uses it who chooses how and why its used -- its likely to be about as convincing as a ghostwriter putting together a summer potboiler.)
Incidentally, I didn't know Sarah Silverman was an author. But I suppose we all are potentially, its just a matter of what we call ourselves and what we publish.