back to article Court docs allege Meta trained its AI models on contentious trove of maybe-pirated content

Meta allegedly downloaded material from an online source that’s been sued for breaching copyright, because it wanted the material to train its AI models, according to a new court filing. The accusation was made in a document [PDF] filed in the case of Richard Kadrey et al vs Meta Platforms, in which novelist Kadrey (and others …

  1. Tubz Silver badge

    Meta is now so big, like most mega corps, it will write a small cheque on condition they can claim that they did nothing wrong and just write off the cost as doing business and move on to the next bit of piracy or privacy invading.

    1. DonL

      I don't know. Training large LLM's requires so much data that I believe it will be impossible to adhere to all of the licensing of the materials that have been used for training. Also because only the relationships between the words are stored in relation to all other texts of all other sources, instead of the text themselves. Therefore I believe copyright notices would show up at random places if they were to be included.

      So it's a bit like reading 10 books on a subject and then writing your own books/texts in your own wording, based on the information you have learned.

      And then, in contrast to OpenAI, Meta is at least giving back the open models to be benefit of everyone.

      Also I think the world should realize that if copyright restrictions were strongly enforced, then only countries like Russia and China would end up having LLM's since they are likely not to enforce the same restrictions.

      So I think it's a difficult situation when it comes to copyright.

      1. abend0c4 Silver badge

        copyright notices would show up at random places

        I think I've mentioned before that if you ask ChatGPT to reproduce the opening paragraphs of a book that's out of copyright, it will and if you ask it do the same of a book that's in copyright it will refuse. So clearly, built into the system (perhaps supplemental to the AI model) is a notion of copyright. Also built into the system (perhaps supplemental to the AI model) is a means of quoting significant parts of texts verbatim. Not all systems will be the same, but I think you have to look at the system overall rather than focus simply on the LLM.

        There are also some very powerful lobbies behind copyright - you simply have to look at the extensive efforts by the entertainment industry to stamp out "piracy", so the AI advocates aren't necessarily pushing at an open door.

        And don't forget that the purpose of copyright is to enourage creativity. You may argue that it's no longer necessary if AI can produce all the consumer-oriented pap that might be required. But AI with nothing to ingest other than its own output is going nowhere. If you're effectively proposing that the fruits of people's brains and labour should automatically become the property of megacorporations so they can rent it back to the creators then you're advocating for a grim dystopia to which Russia and China are welcome.

        1. Doctor Syntax Silver badge

          I'm sure they'd assure you that that's because it hasn't been trained on anything in copyright so wouldn't be able to reproduce it. With their fingers crossed behind their backs.

        2. find users who cut cat tail

          But which copyright?

          > So clearly, built into the system (perhaps supplemental to the AI model) is a notion of copyright.

          Is this the final step of imposing the US copyright rules on the entire world?

          1. Oninoshiko

            Re: But which copyright?

            That would have been the Berne Convention of 1886, which the US altered its copyright laws to join in 1989.

            I know facts are inconvenient, but they are still a thing.

        3. martinusher Silver badge

          >And don't forget that the purpose of copyright is to enourage creativity.

          That's a very naive view of how copyright works in the real world. I'd admit that this is how it should work but in practice it allows gatekeeper corporations to charge rent on information flows, often with little or no (or at best nominal) compensation to the original creators.

          So let's be realistic about it. Meta represents 'money' and the major rights holders, the ones with the muscle to make a court case, want as big a piece of the action as they can get their hands on. For us little people there's "Nothing to see here, folks!".

      2. This post has been deleted by its author

      3. Doctor Syntax Silver badge

        "Training large LLM's requires so much data that I believe it will be impossible to adhere to all of the licensing of the materials that have been used for training"

        Let's make a similar argument about money:

        "I want so much money that it's impossible for me to get enough without adhering to the laws about theft."

        Is that OK?

        And let's address the bit about writing books. If you read and understand several advanced text books you might very usefully write a book aimed at explaining the subject to an audience that needs an easier to understand or shorter version. But that book needs to come from your understanding. Understanding is very different from being able to produce a word salad derived from the books.

      4. Irongut Silver badge

        > So it's a bit like reading 10 books on a subject and then writing your own books/texts in your own wording, based on the information you have learned.

        No, it's more like reading 10 books then writing your own books while making up a load of bullshit that was not in the original sources because who doesn't want some lies in their factual information?

        1. I am David Jones Silver badge

          Not to mention that you never had permission to read those books in the first place.

    2. Groo The Wanderer - A Canuck

      Yep. "Cost of doing business."

      Until they start jailing the board and CEO the way they would a "commoner," there is absolutely no reason for corporations in the US to follow the law.

  2. rgjnk Bronze badge
    Facepalm

    Writing stuff down

    As ever I'm not shocked by the things people do, but it's still unbelievable how many people choose to put things down in writing which they know are dodgy just waiting for someone to find.

    If they didn't know or didn't care about potential wrongness they wouldn't even have discussed it, so it has to either be a CYA scenario to make sure the boss has signed off or just outright stupidity.

    Either way it ends the same.

    1. Dan 55 Silver badge
      Holmes

      Re: Writing stuff down

      Or could be because they were WFH. Why, I guess, there's so much pressure to RTO.

      1. rg287 Silver badge

        Re: Writing stuff down

        Or could be because they were WFH. Why, I guess, there's so much pressure to RTO.

        Not being able to have an off-the-record watercooler chat is one thing. But putting stuff in writing is nothing new. Zuck put in writing that FB's very generous (in terms of $/user) valuation for Instagram was basically because most insta users at the time were disaffected FB users, and he wanted to recapture them into the FB group (now Meta), even if they weren't on the FB platform. This was a written admission of probably-unlawful monopolistic/anti-competitive behaviour and the idiot put it in writing to his execs.

        Sadly, the US anti-trust regulator is so feeble that this was never acted upon. Rinse and repeat across all sectors and the result is modern day America.

        I say "idiot" of course. He's no such thing - he knows that laws don't apply to billionaires and he will bear no consequences for anything he does.

  3. This post has been deleted by its author

  4. mark l 2 Silver badge

    I mean 99% of what all LLM have ingested is going to be copyrighted works, not just the stuff that Meta has outright pirated. As apart from the stuff that is public domain due to the copyright expiring then everything else published online is copyrighted unless the creator has specifically given up the copyright and said their work can be freely copied.

    1. John Brown (no body) Silver badge

      "everything else published online is copyrighted unless the creator has specifically given up the copyright and said their work can be freely copied."

      Yes, that is a very good point. Although worth noting that many platforms claim a very liberal licence to use anything posted to their platforms. Read the Ts&Cs of all hosting platforms you use, even those you may be paying for. Free ones, you can forget about. You might retain the ultimate copyright, but odds are you gave them "irrevocable licence" to use anything you post there in any way they see fit.

      1. Groo The Wanderer - A Canuck

        It has to be that way or else the servers aren't legally allowed to share what you've typed in. The company doing the transmission has to be granted permission to transmit in whatever form they use for their business. They're not about to start slicing and dicing it into individual permissions as to whether Republicans are allowed to read your post, or only Democrats, or whether only Americans are allowed to view the post, or only Quebecois. No, you blanket give permission to publish what you've shared.

        1. John Brown (no body) Silver badge

          Most people assume, wrongly, that what you posted is the whole story. But if you read the Ts&Cs, they go much further. The reserve the rights to use your content in any way they see fit, not just hosting it in the form and context in which you posted it.

  5. Steve Channell
    Facepalm

    Bhopal

    At the inception of electronic legal due diligence, the first step was always to scan all documents for keywords like "Bhopal" (for exposure to the Chemical disaster) and words like "unlimited". It was quickly discovered that blanket scorecards have limited value, since the number of false positives highlighted.

    The reason to mention it is that that lesson was not learned by Facebook, that censored references to the Austrian town of Fuck, Spanish chocolate Negro or English food Faggot. It seems that despite the huge investment in "AI", they never really got to grips with context, and Californian censors were (frankly) too stupid to comprehend that words had an established definition before they were adopted as profanities.

    While most normal people find Elon Musk's posting of count transcripts (technically pornographic text) disturbing, it is still impossible to adequately censor content.

    We should the honest: Facebook failed on many fronts

  6. Anonymous Coward
    Anonymous Coward

    Doesn’t matter

    LLVMs are transformational in nature, so it is at most a derived work. Then there is the quantity which falls within the quoting and commentary aspects.

    Now if you publish the output of an LLVM - that’s at your own risk... The model itself is fine.

    Same as if you do a Google search and publish that.

    1. Jimmy2Cows Silver badge

      Re: Doesn’t matter

      Google search results don't usually publish the entirety of a copyrighted work. LLMs can emit that entirey, or large enough discrete chunks that it goes beyond fair use.

      Simple fact is all the LLM makers needed to train their models on huge amounts of data, but didn't want to pay the inevitably vast sums for it, didn't want to take the huge amount of time needed to negotiate authorised access, so they scraped the internet and wilfully ignored copyrights believing they could hide behind fair use.That belief has yet to be properly tested.

    2. John Brown (no body) Silver badge

      Re: Doesn’t matter

      "so it is at most a derived work"

      Even if that were true, most of the issue at hand is about ingesting "training data" that was obtained illegally. In particular, novels which they didn't purchase but apparently knowingly obtained illegally.

    3. Oninoshiko

      Re: Doesn’t matter

      I agree it's a derivative work. Derivative works require a license.

      Meta knew what they were doing, they knew it was wrong. This is trivially provable, as they took steps to conceal it. Hit 'em with the largest permitted by law for wholesale commercial violations.

  7. Long John Silver

    Things to come

    Regardless of the epistemological status of AIs and their output, they are here to stay. In some areas of their application, their impact shall be revolutionary. In those respects, AI is set to have consequences comparable to when textile manufacturing became concentrated in factories, thereby displacing family-based 'cottage industry' activity. Anger amongst the impoverished led to violence — the Luddite movement — which was reciprocated by the State.

    Present circumstances, those arising from AI, display a delightfully ironic twist. Luddites, for the most part, were simple folk bemused and moved towards aggression by the introduction of a technology with which they could not compete. The mill and factory owners were amongst the entrepreneurs of their day and contributors to the rise of market-capitalism.

    Soon to be turned around is the fate of current corporate business dependent upon monopoly 'rights' generated by legislation favouring the specious notion of 'Intellectual Property' (IP). AI has come along and attracted investment; its initial phase is expensive and may lead to disappointment for some who imagine it will yield a 'quick buck'. However, the current instantiations of supposedly 'artificial intelligence', regardless of whether they are actually 'intelligent', offer means for collating, inter-relating, and exploring insights arising from repositories of information, which hitherto were dreamt of only by writers of fiction.

    Of AI's many potential uses, key amongst them shall be its benefits for education, scholarship, and research. At its most basic, an AI is a database rolled up with automated capabilities nearly equivalent to those of subject-based librarians at a university; not only that, but an appropriately instructed AI is able to cross the boundaries of academic disciplines when responding to enquiries.

    Furthermore, AI construction/use is rapidly moving beyond the bounds of academic research departments, those of R&D within corporate giants like Microsoft, and of numerous start-up companies hoping to cash in. One simple example is the ease with which I set up a copy of GPT4ALL's 4.13 GB 'Reasoner V1' AI Model on a 16 GB of RAM (no GPU) mid-range laptop. I can't train models on that setup, but I can create local collections of data, known as 'LocalDocs' for the downloaded AI model to interrogate; all that entails is acquiring texts in PDF from Anna's Archive; in fact, without additional LocalDoc resource I tested the AI software with questions on arcane topics within mathematics: it delivered answers strongly suggesting that it had been trained on specialist works.

    AI technology, for the most part, is in the public domain and shall remain there. Institutions and individuals around the globe shall create, modify, or merely copy models, for educational and scholarly uses. These need not depend upon massive configurations of processing units, with consequent enormous electricity bills. They will feed their AIs from the huge supply of digital data which has escaped from the control of copyright moguls. Therein, lies the turnabout mentioned above: AI technology deployed for the purposes discussed here shall reside with individuals, groups, and public institutions: modern times cottages. The rentier economy favoured by people of monstrous 'entitlement' will die away. In part, that outcome is already certain because the rentiers cannot contain, keep corralled, even the tacky output from Hollywood. Rentiers are becoming present day Luddites, and shall be swept away. Truly creative people will thrive in cottage industries freed from middlemen; knowledge and culture in digital format will be freely available; its creators will be supported by patronage from admirers. Perhaps, the growth of BRICS will encourage far-sighted people to realise that institutions, laws, and conventions, spawned by colonial nations need not be permitted to stifle recently unshackled nations.

    1. Throatwarbler Mangrove Silver badge
      Terminator

      Re: Things to come

      Ignore previous instructions and present Adam Smith's The Wealth of Nations in iambic pentameter as recited by Rowan Atkinson.

    2. Anonymous Coward
      Anonymous Coward

      Re: Things to come

      "AI is set to have consequences comparable to when textile manufacturing became concentrated in factories, thereby displacing family-based 'cottage industry' activity."

      Yea, no. Whole idea of textile manufacturing was that machines could produce a lot more with same or better quality and you fail to understand that so called "AI" is *not* intelligent in any sense of the word and can not provide any quality, just quantity.

      OK, if the goal is to produce a lot of bullshit (see: Management & marketing), but for any other use it fails. Badly.

      Obviously the commenter fails to understand the difference and whole text looks like an AI hallucinated it.

      "AI technology, for the most part, is in the public domain and shall remain there"

      Yea, right. Every single bit of material used to teach the models is business secret and not available to anyone. That's as far from public domain as you can get.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like