back to article To pay or not to pay for AI's creative 'borrowing' – that is the question

In the UK's Parliament this week, Microsoft and Meta ducked the question of whether creators should be paid when their copyrighted material is used to train large language models. The tech titans, with combined revenues well in excess of $200 billion, were being grilled by the House of Lords Communications and Digital …

  1. Catkin Silver badge

    Two questions for the price of one

    It seems that the primary (broader) question is whether ingesting material into a training set, when that material would be legal for a human eye to read or view, is infringement. The secondary question is whether certain materials going into certain training sets were obtained in a way that wouldn't have been legal if a human were reading or viewing them.

    For the former, I would propose that pattern extraction is no more copyright infringement than making a graph of how frequently particular words appear in a given book or how many clouds appear in the landscape paintings of a particular artist. In other words, if a human can do it legally, a machine should be able to do it legally; regardless of whether the machine can do it "better" (a subjective appraisal).

    1. Lurko

      Re: Two questions for the price of one

      Well, certainly in the UK, "copying" includes copying to long term, temporary or random access storage (and thus even caching) of a machine. The fact that a machine may be conceptually "reading" and analysing much as a human would is irrelevant - the machine has made a copy of the original work, and unless anybody wishes to plead altruism on behalf of the LLM owners, does so in most cases for commercial purposes.

      1. Catkin Silver badge

        Re: Two questions for the price of one

        I think the law* is currently vague because the language covering copying to and from storage relates to programmes (unless I've missed the part to which you refer, it's a big document). For programs, it's actually quite legal for a legal user to copy, decompile and inspect the operation of programs and no EULA can prohibit this in such a way that renders the actions a copyright infringement (the agreements may break other contracts).

        *based on Copyrights, Designs and Patents Act 1988

      2. mpi Silver badge

        Re: Two questions for the price of one

        > temporary or random access storage

        Like all web browsers do, whenever they access a webpage?

        1. zuckzuckgo

          Re: Two questions for the price of one

          > whenever they access a webpage?

          There is no practical way that any web content can be viewed without that content being temporarily stored in multiple servers on it's route to the viewer. So it would seem to me that if the content is allowed to be published on a publicly accessible website, then they are waving the restriction on temporary storage.

          1. jdiebdhidbsusbvwbsidnsoskebid Silver badge

            Re: Two questions for the price of one

            The copyright, designs and patents act in the UK is explicit in that this sort of temporary storage, where only to facilitate use, does not infringe copyright.

            Section 28a of the act says:

            Copyright in a literary work, other than a computer program or a database, or in a dramatic, musical or artistic work, the typographical arrangement of a published edition, a sound recording or a film, is not infringed by the making of a temporary copy which is transient or incidental, which is an integral and essential part of a technological process and the sole purpose of which is to enable—

            (a)a transmission of the work in a network between third parties by an intermediary; or

            (b)a lawful use of the work;

            and which has no independent economic significance.

            1. Felonmarmer

              Re: Two questions for the price of one

              So the argument is, if the AI copy made during learning is temporary, is this a lawful use of the work?

              If it is, then copyright infringement hasn't taken place.

              1. doublelayer Silver badge

                Re: Two questions for the price of one

                That's not the argument at all. This is not about temporarily copying the text into buffers during processing. It's about two other copies:

                1. The copy in the training data, which is not temporary because it's kept around for months to train models on, if not forever so it's available for subsequent models.

                2. The storage of the processed work, which in many cases includes most or all of the work, just sliced into pieces, in the final model.

                The copyright holders are claiming that point 1 is a violation of their rights because the companies did not get permission to obtain the work at all, and that point 2 is also a violation because it involves the storage and reproduction of their work. There are arguments that the second is not a violation which I don't find convincing, but either of those can be a problem for those who use copyrighted material as training data.

                1. mpi Silver badge

                  Re: Two questions for the price of one

                  > The storage of the processed work, which in many cases includes most or all of the work, just sliced into pieces, in the final model.

                  Please, do take any freely available ML model, open it (it's essentially a large collection of float32 values), and show me where it stores "most or all of the work".

                  1. doublelayer Silver badge

                    Re: Two questions for the price of one

                    Just because something's been reorganized and turned into floats doesn't mean the original data is not there. If things were that simple, I could eliminate piracy and copyright in one plan by making a suitably annoying obfuscation system. The way you can determine that the data is still there is when models like that start to reiterate the training data verbatim. They have been known to do so, sometimes on their own and more often when prompted with a starting point. They have to do some calculations to reconstitute the original work, but it's in there.

                    1. mpi Silver badge

                      Re: Two questions for the price of one

                      > Just because something's been reorganized and turned into floats doesn't mean the original data is not there.

                      If it were "reorganized and turned into floats", that would be correct. However, that's not what happens during model training.

                      To start with, the floats don't come from the data ingested, they are already there. Model training starts with the full-sized model, only it's weights (the massive amount of floats) are initialized to random values in the range 0.0 to 1.0 Training merely adjusts the weights of the models.

                      Next, the amount of data ingested has nothing to do with the model size. If my model is 5GiB in size, it will be 5GiB after 1 sample, 1000 samples, or 10E12 samples have been ingestested. So by application of the laws of entropy, the model is incapable of storing the ingested data. In a storage device, the amount of information stored is directly correlated to the amount of space needed to store it. Even if I had a 1:1000 compression, I would still need 42x more storage if I want to store 42x the information.

                      > They have been known to do so

                      Yes, they have. In most experiments showcasing that, those were very few, very specific cases, that were only found because people looked at the training data, picked samples they knew occured in many copies in the dataset, and then prompting for exactly that data.

                      And even then the output wasn't exactly the same as the sample data.

                      This is the result of overfitting. and it is something that methodology in ML actively tries do avoid, because it degrades model performance by negatively impacting the models ability to generalize.

                      Does that mean the model stores the data? No. It means the model is biased towards certain patterns more than it is to others.

                      1. doublelayer Silver badge

                        Re: Two questions for the price of one

                        This is all true, but if you train a 5 GB model on 1 GB of training data, those weights would end up including a great deal of the training data. Not so much if your 5 GB model was trained on 1 TB of data, though some of it could be. How large is GPT4 again? We have no idea, because they didn't tell us. This means that it's difficult to know how likely it is to contain certain chunks of the source material. Without knowing that, we have to start relying on less reliable measures such as whether it quotes large chunks, and it really isn't as difficult as you state to make it do so to the extent that OpenAI had to implement extra guard rails to reject any question that explicitly asks it to quote something copyrighted. If it didn't have that information in there, they would not have had to do anything, as the model would consistently fail. They added it because it was not consistently failing.

                        1. mpi Silver badge

                          Re: Two questions for the price of one

                          > but if you train a 5 GB model on 1 GB of training data, those weights would end up including a great deal of the training data.

                          And if I cut down trees, grind them into dust and make paper out of them, the resulting Books include a great deal of that forest. But I doubt that squirrels will be very happy living in a library.

      3. TheMeerkat Silver badge

        Re: Two questions for the price of one

        > the machine has made a copy of the original work

        Every time you read a book on Kindle it makes a copy of a part of the book to show it on the screen

    2. Filippo Silver badge

      Re: Two questions for the price of one

      >In other words, if a human can do it legally, a machine should be able to do it legally

      But that's not obvious at all. There are many things that humans can legally do, but machines can't. Drive a car, in most jurisdictions. Enter legally binding contracts (you can have a machine automate this, but it's the human or company that's bound by it, not the machine). Fart (scale it up, at some point the machine will run into environmental regulations). Memorize large copyrighted texts verbatim (LLMs don't do this, but if they did, they would definitely be infringing; a human wouldn't). More.

      I really don't believe that "but it's legal for humans" would be a solid argument in court, no matter how much it may or may not make sense to us.

      I'm not saying that LLM training is legal, and I'm not saying that it's illegal. I'm saying that it's not at all clear or obvious which way it goes, existing legislation does not really cover this, and if you attempt to shoehorn this case into current copyright law, I could see very good arguments for it to go either way.

      Because of this, I think that the current crop of "AI" is standing on very shaky legal ground, until some high court manages to shoehorn this one way or the other, or lawmakers take action. The idea that someone somewhere wins a case and suddenly the entire industry is illegal is not so far-fetched.

      1. Catkin Silver badge

        Re: Two questions for the price of one

        It's just my opinion on the matter, I don't think it applies universally (e.g. driving) but, as far as copyright infringement for deconstructing reality, I think it does. As per your example of memorising, a human that made a perfect reproduction of a copyrighted work from memory by hand would be equally as infringing.

        1. heyrick Silver badge

          Re: Two questions for the price of one

          I think it comes down, also, to a question of scale. A person with a memory way better than mine could perhaps copy Harry Potter word for word. But it takes time and effort to do so.

          A machine can easily do it a hundred times per second, and throw in little variations to tweak the story, without skipping a beat.

          It's like how many of us pay a levy on blank media. In the old days, copying a record was not feasible. Then came tapes and how the companies freaked out about that, despite the fact that it could take over half an hour per single tape on a high speed deck. Now? Dump an MP3 on a website and untold thousands can grab a copy instantly.

          That, I think, is the danger of AI ingesting stuff. It can spew out near copies faster than anybody could keep up.

          So, what cost creativity?

      2. Anonymous Coward
        Anonymous Coward

        Re: Two questions for the price of one

        Unlawful is the word you want, not illegal. But still, the implications of telling AI companies that they can't use other people's stuff for free any more - and should pay damages for past abuses - could be quite sudden and quite dramatic.

      3. Peter2

        Re: Two questions for the price of one

        I'm not saying that LLM training is legal, and I'm not saying that it's illegal. I'm saying that it's not at all clear or obvious which way it goes, existing legislation does not really cover this, and if you attempt to shoehorn this case into current copyright law, I could see very good arguments for it to go either way.

        Because of this, I think that the current crop of "AI" is standing on very shaky legal ground, until some high court manages to shoehorn this one way or the other, or lawmakers take action. The idea that someone somewhere wins a case and suddenly the entire industry is illegal is not so far-fetched.

        Law is read by exactingly what the legislation actually says, then if that's not sufficiently clear by what the laws stated intention to do is (which is in the preface) and then what the people who wrote the text intended it to do.

        If we accept that the existing wording is ambiguous then the simple argument on the part of the authors is almost certainly going to point to the preface of the first copyright act.

        Whereas Printers, Booksellers, and other Persons, have of late frequently taken the liberty of printing, reprinting, and publishing, or causing to be printed, reprinted, and published Books, and other writings, without the consent of the authors or proprietors of such books and writings, to their very great detriment, and too often to the ruin of them and their families: for preventing therefore such practices for the future, and for the encouragement of learned men to compose and write useful books; May it please Your Majesty, that [the copyright act] may be enacted...

        Simply, it cannot be reasonably argued that copying the artwork and style of existing authors and artists to create derivative works without payment to the original authors will work to anything but the very great detriment and possible ruin of those authors and artists. Copyright exists as per the explanatory preface for the encouragement of learned men to compose and write useful books.

        By law, an "AI" has already been found to not not possess it's own "legal personality" which means that as so far as the law is concerned it is seen as a machine like a printing press or a photocopier which is controlled by an owner.

        This makes the current situation very exactly what the copyright act exists to prevent as per the original preface; and while i'm sure that there will be a great number of novel arguments as to why copyright should be effectively abolished in favour of the people copying it I can't see how the law will actually be able to permit that. In the US then you'd simply drag the argument out until the less funded side gives up and you win by default, but that sort of strategy is simply not done in the UK because our judges really don't like that sort of abuse of process in their courts and tend to robustly retaliate.

    3. Doctor Syntax Silver badge

      Re: Two questions for the price of one

      "The secondary question is whether certain materials going into certain training sets were obtained in a way that wouldn't have been legal if a human were reading or viewing them."

      Some books include words to the effect of "not to be stored in an electronic storage system" as a condition of sale. That would be a clear infringement if such a book was used without specific permission. Even if the trained model doesn't contain verbatim text the training data would be an electronically stored copy falling foul of the condition.

      As to the wider issue, if the trained model is not a derivative work of all the previous works that were in the training set what's the point of training it on that data as opposed to random lists of words? Would such a trained model be simply fair use of the individual works? My understanding of fair use would be that I can embed one or several quotes from some author(s) into a work which is mostly my own. I'm not sure that embedding the entirety of another's work would count as fair use and much less so the concatenation of several such works in their entirety.

      If I were to produce a work which was simply a collection of material from other sources my understanding is that I would have database rights to the collection but not necessarily to the material which went into it. I think I'd have to agree that the training of the model would add database rights for the trainer. However, unless the original material can be passed off as fair use then surely the trained model remains a derivative work of its training material. As such it must surely also include the collected rights of the authors of the training material.

      If the production of the derived work is the infringing act then it seems somewhat disingenuous to offer protection against legal costs of those who use a product of it. It's misdirection as to where the potentially infringing act occurred.

      1. Mike 137 Silver badge

        Re: Two questions for the price of one

        "Some books include words to the effect of "not to be stored in an electronic storage system" as a condition of sale"

        The notional get-out for the LLM folks is that the text is not stored, it's scanned and tokenised and the probabilistic relationships between these tokens and all other token so far generated are statistically analysed. It could be argued (and probably will be) that the first stage of this process is essentially no different from borrowing a book and reading it, and once tokenised copyright is not relevant because the text is no longer identifiable from the set of tokens. So it's probably not strictly a derivative work, because the presentation (what is actually protected by copyright) has been entirely eradicated from what is stored.

        I don't condone plagiarism (which some might construe this as) but I can't see how it's going to be controlled if the sources used are openly accessible. Offering to cover clients' legal costs for copyright infringement suits only works for those who use the LLM, not the copyright holders (many of whom would not be able to afford the cost of legal redress), and reimbursing copyright holders would involve enormous and complicated human intervention, quite apart from the direct costs.

        Probably changing the law to allow the abuse (as is likely here in Blighty in the case of privacy legislation and has just been proposed in the case of the human rights of migrants) will be the ultimate way out.

        1. Paul Kinsler

          Re: can't see how it's going to be controlled

          I think the distinction between likely human and machine learning is not necessarily about the input stage, but rather the output stage, and the possibility of (non-trivial) monetisation.

          If I were to read and remember a lot of Harry Potter trivia, I have spent a great deal of effort, and at best will be able to impress Harry Potter fans, or appear on quiz programs; if I try to make money off it at some point some rights holder's are going to make claims. I might instead program that weird and entirely unreplicable machine I (hypothetically) have in my shed; but again this doesn't change the situation - it can be an amusing curiousity, but little more, or rights holder could quite reasonably come calling.

          In contrast the whole *point* of these text-ingesting LLM's is to be reproduced at scale, used at scale, and monetised at scale. So naturally they (should) have to deal with the rights holders whose materials they have ingested, and ingested primarily as a commercial activity in search of profit. Of course, they are perfectly welcome to instead keep their LLM's as amusing and unique curiosities, making little or no money for anybody; and if so rights holders would no doubt be mostly content to simply marvel at all this LLM cleverness. But there seems little likelihood that the "amusing curiosity" route is the point - the intent is replication, use, and significant monetisation.

          1. zuckzuckgo

            Re: can't see how it's going to be controlled

            The original rights holders might be more upset, and inclined to sue, if the AI operator was giving content away for free rather then charging for it. Charging for it would at least limit the distribution of the content and might make limits on the final customers right to reuse enforceable. Freely distributed content would make it harder to track distribution and enforce copyrights after the fact.

            1. Paul Kinsler

              Re: if the AI operator was giving content away for free ...

              Perhaps, ... if the giveaway was at sufficient scale.

            2. Chet Mannly

              Re: can't see how it's going to be controlled

              "The original rights holders might be more upset, and inclined to sue, if the AI operator was giving content away for free rather then charging for it."

              I'd have thought that someone else making money off what they consider to be copyright infringement would anger them more not less.

              Sets up an argument that they should get a cut of whatever you built using their IP. Could end up like Spotify where the service throw a few crumbs the way of the creators while making out like bandits themselves.

        2. Falmari Silver badge

          Re: Two questions for the price of one

          ”It could be argued (and probably will be) that the first stage of this process is essentially no different from borrowing a book and reading it”

          I say it is more akin to borrowing a book, making a copy, returning the book and keeping the copy to read whenever you want.

      2. Catkin Silver badge

        Re: Two questions for the price of one

        "Some books include words to the effect of "not to be stored in an electronic storage system" as a condition of sale. That would be a clear infringement if such a book was used without specific permission."

        That seems to be a separate (additional) legal question. As per CDPA, it's not an enforceable restriction for programs (in terms of being a copyright infringement) and there's no provisions one way or the other as far as use for other copyrightable works. I'm not proposing that this allows a carte blanche on electronic reproduction but, rather, the presence or absence of such a clause is immaterial. If I'm missing a specific part of the CDPA, I'd really appreciate my attention being drawn to that paragraph.

        If it were instead a breach of contract with such a stiff penalty, that would seem to open the door for very onerous EULAs.

        1. doublelayer Silver badge

          Re: Two questions for the price of one

          "If it were instead a breach of contract with such a stiff penalty, that would seem to open the door for very onerous EULAs."

          I don't see that as any stronger than an open source license. It's still based on the copyright rights to the content, and rather than applying a license to modifications you make, it limits your ability to store it on a different system. Not to mention that most of the ways you could store it on a system that would actually incur their investigations would themselves be copyright infringement, and they would go after that instead. While their term technically means that scanning it is not allowed, they're unlikely to do anything to someone who did for their own use unless that person also published, sold, or made a commercial derived work from those scans.

    4. I am David Jones Silver badge

      Re: Two questions for the price of one

      “In other words, if a human can do it legally, a machine should be able to do it legally”

      But in this case it is alleged that the Books databases contain pirated books -wholly plausible- and so it would not be legal for a human to read/store the material.

      1. Catkin Silver badge

        Re: Two questions for the price of one

        Precisely, which is why I believe it's two separate legal questions.

  2. Knightlie

    "We're thieves and parasites profiting from the hard work of others, and we want the law changed to allow us to continue doing that."

    Imagine a burglar standing up in court and saying this.

  3. Anonymous Coward
    Anonymous Coward

    Information wants to be free

    I fully understand it might be hard to recover what was exactly taken without permission and from whom.

    Perhaps the solution is to make the fruits of this free perpetually. Force the big LLM players to give everyone unlimited access to all their models, data and uses until perpetuity. Force crooks such as Altman and Andreessen to pay any cost in the business from their personal assets to make good until they have to live in a tent and the world is liberated from them.

    1. mpi Silver badge

      Re: Information wants to be free

      > Force the big LLM players to give everyone unlimited access to all their models, data and uses until perpetuity

      And whos paying for the compute these models require?

      Who runs the datacenter where all those GPUs run?

      Who pays the people running these datacenters?

      1. Anonymous Coward
        Anonymous Coward

        Re: Information wants to be free

        Like I said, Altman and Andreessen can be ordered by a court to pay for most of the costs, staff can just work for free, just like they expected from the people whose data they use. Provide your time and effort for free for the good of humanity.

        If you see what might the world could summon up to persecute Fredrik Neij and Peter Sunde, relatively small players compared to Altman and Andreessen, then anything should be possible.

        1. mpi Silver badge

          Re: Information wants to be free

          > staff can just work for free

          And how many skilled engineers do you think would be willing to do that?

          As compared to how many would just burst into laughter and leave?

      2. Jellied Eel Silver badge

        Re: Information wants to be free

        And whos paying for the compute these models require?

        Who cares? It's the principle that's important. AI modellers ingest huge amounts of copyright material, without payment and create derivative works to make money. Why shouldn't others do the same? I'm pretty sure if I created my own 'Book 1' containing all of MS or FaceMelta's source code for all of their applications and systems, then published it as 'training data', they would object.

        What's the difference?

  4. Mike 137 Silver badge

    Exactly what it's not doing

    "It's a large model trained on text data, learning the associations between different ideas Owen Larter

    It's not learning about ideas -- it's meaning blind. It's just calculating the statistical associations between tokens the meanings of which it has no comprehension of -- indeed it has no comprehension of anything at all except what's the most likely next token to follow this one. The hype would persuade us that these machines can think, but what they do has no real correspondence with normal human mentation, they're just glorified auto-complete tools.

  5. Missing Semicolon Silver badge
    FAIL

    Legal right to our business model

    This is another example of big-business inventing a new way to cheat or steal, using interesting technology, and then to claim that compliance with the law is too hard. So we don't have to, because our business model relies on the mass, uncontrolled theft of something, or other offence.

    Examples

    • Youtube does not moderate meaningfully. Copyrighted stuff stays up, copyright strikes cannot be appealed, creators get taken down for spurious strikes, etc.
    • Amazon does not check self-published books for plagiarism.
    • Amazon sells dangerous and knock-off goods, and only takes them down after complaint, it does not pre-emptively check items.
    • Uber "is not an employer"
    • Pinterest. Ha!

  6. martinusher Silver badge

    IP is an industry

    I read a fascinating press release this morning about the annual conference of the IP litigation industry (their term, not mine). That's right -- its the idea that IP litigation is an industry with investors who put capital into it get a return on their investment.

    So, like mining for gold or other material, the search is on for likely veins of ore that could be mined for profit. Part of this process is convincing the public at large that everything that's ever been done has intrinsic value and so mus be rented or otherwise value extracted from others for it.

    Those of us who use sheet music are familiar with the concept. Until relatively recently sheet music was incredibly overpriced due to copyright being held by a cartel of publishers. This became difficult once music could be easily reproduced because the majority of classical music we still use was published before 1923 so is now well out of copyright (and the traditional way of indefinitely extending copyright -- making small editorial changes which were themselves copyrighted -- could then be bypassed).

    If you're into music you'll know that the vast bulk of music that's written and so published is crap. This isn't new -- I've got some very old music dating from the mid-19th century, the era of Chopin and the like, which not only sold for amazingly high prices for the era but is utter crap. Time winnows the field, and I'd guess that books and pictures are no different -- most might have temporary merit but won't stand the test of time. It also tells us that there's really nothing new -- everything we create is based on what came before......so stop trying to pretend that everything is valuable!

    1. doublelayer Silver badge

      Re: IP is an industry

      Whether it is valuable is not important. It could be valuable, and thus we find it useful to protect it. If it's crap, then nobody will buy it and its protected value will still be low. If it is not, the people who put in the effort which resulted in it not being crap deserve to benefit from that effort.

      And yes, there will be an IP litigation industry, just as there is an industry for any profession, including ones that rely on negative aspects of our world. There is a toxic waste disposal industry, a fraud prevention industry, a repair of electronics after their manufacturer has dropped support industry, and an IP litigation industry. If we had less toxic waste, fraud, premature obsolescence, and copyright and patent violations, then we would need less of those things.

  7. aks

    QI-QO

    Why not train the model using solely out-of-copyright materials?

    One clear bonus would be to raise the quality of the output

  8. Old 555

    Microsoft, Meta, Google, ..., all need to make like the US Publishing giants

    The US, historically, like present day Iran, Eritrea, ... , once limited Copyright to works produced by its own citizens, on its own territory, and limited protection to 14 years, form publication. Many a large US publishing house, and media group, being founded on selling bootlegged copies of works by non-US Citizens, or the Works of US Citizens initially published overseas, without any compensation for the creator (Google 'Dickens US copyright").

    If the tech giants were to equip a fleet of ships as data centres, reflag them, or park them in the territorial waters of a nation that doesn't hold with the ever more prohibitive copyright treaties, and still uses a bootlegged copy of the US's own 1790 copyright law, they could legally digest, by Starlink or similar, the entire Inter-web, let their models transform an author, or genre, into a simple set of mathematical weights, discard the now redundant, un-transformed copies, and transfer the notes (weight set) back to their parent company, via some space magic, to copyright, and sell the use of, in a LLM. No copyright laws being violated in the attempt.

  9. AlexV

    Is it legal? Who cares. *Should* it be legal is the question to debate

    Copyright law was not written with anything like AI training sets in mind, it makes little sense to argue about whether the current law allows it or not. The result of that is pretty much going to be a coincidence either way it falls out. Might as well ask whether the bible declares it a sin or not.

    What they should instead be debating is, do we as a society want to allow AI training on copyrighted material without explicit permission from the copyright holder? Arguments on both sides, of course, there's benefit to society to be argued in either case, but it's government's job to decide which they think is more beneficial, allow it or not. Then make the law say that.

    1. doublelayer Silver badge

      Re: Is it legal? Who cares. *Should* it be legal is the question to debate

      True, we should be discussing that, but it's likely not to happen until some court has decided what the current law says. Once a decision has been made, lobbyists for AI companies and publishers will start to try to change the law to better serve their companies, and we can start having that conversation, not that our views will be at all important to the politicians making the final decision.

      In the spirit of having that conversation, I'm on the side of copyright here. I don't think the benefits of more articulate programs outweigh the costs of effectively telling anyone that, if their program is large enough, they can use anyone's copyrighted information in any way they please. We all know that this power would only be available to companies that are large enough; if I ran a copy of the Windows source code through as training data, Microsoft would not agree that it's acceptable, even as their friends at OpenAI effectively do the same to lots of others.

  10. SonofRojBlake

    "I'm a little bit cautious about the idea of forcing companies that are building AI to enter into bespoke agreements with individual rights holders or an order to pay for content that doesn't have economic value for them"

    You're a little bit cautious about having to deal with rights holders? I've got a solution for that - stop using their output. Simples.

    You're "a little bit cautious" about an order to pay for something that doesn't have economic value? Well that's perfectly reasonable. If you think something doesn't have economic value, then you have no need of it for your profit-making AI business... right? Because if YOU NEED IT to train your profit-making AI, then by definition it has economic value. I can see why you'd be "a bit cautious" about being made to pay for it, because that'll impact your profits, obviously.

    It's clear that the economic model used to justify the creation of LLMs took into account the colossal cost of the compute required, but assumed that the training data would be free. That's like assuming that your costs for a gold mine will be things like shovels, picks, dynamite etc., but not taking into account anyone who might have been living on the land before you turned up and started digging.... and assuming you can just make them, ahem, go away.

  11. Eclectic Man Silver badge

    Firstly, I reckon that any living author, or the rights holders to any copyright work, used to train an LLM should be informed before it is used. The mere fact that something is 'available' in electronic form should not be taken as free licence to use it in this way. When academics publish papers they often cite work that is relevant or whose results are used, or which inspired them. LLMs should do the same and be completely open about which works have been used for training.

    Secondly, as LLMs seem only to fins statistical associations between words and their most likely successors, how can this possibly be considered to be 'intelligence' of any sort?

    Thirdly, as an LLM automates the process of writing in the word style of someone else (presumably a respected or talented person), rather than their actual intellectual ability, this is clearly some form of mimicry rather than creativity. If, for example, the works of Newton, Gauss, Fermat and Euler were used to train an LLM, I expect it could mimic their prose styles quite effectively, but not their mathematical abilities.

  12. J.G.Harston Silver badge

    "if a company uses copyrighted material to build an LLM for profit, the copyright owner should be reimbursed."

    Isn't the copyright holder reimbursed when the trainer buys the book in the first place?

    1. Felonmarmer

      If they do.

      My understanding is that Books 1 to 3 contain originally pilfered material which have been made available by the pilferers for use by others. If they contain processed material then there's an element of open-sourcing going on, albeit without out the originators consent.

      The truth is the law on copyright was not set up to cover this sort of stuff and some new legislation and agreement is needed. But you won't get world wide agreement for any new stuff like they could get for the base copyright protection laws because some states which very much want to ignore copyright in this case won't agree to it.

    2. doublelayer Silver badge

      "Isn't the copyright holder reimbursed when the trainer buys the book in the first place?"

      So far, no, because they didn't buy the book. They found illegal copies online and used those for free.

      Even if they started buying individual copies, buying a copy of the book doesn't necessarily let you do whatever you want with it. For a very simple example, if I buy a copy of a book, I don't get to start printing and selling my own copies and saying that the author got their compensation when I bought the first copy. There are limitations on the use of the content of the books, and it is not clear whether AI training qualifies. I think it should not, but the law doesn't clearly answer either way.

      1. Jellied Eel Silver badge

        For a very simple example, if I buy a copy of a book, I don't get to start printing and selling my own copies and saying that the author got their compensation when I bought the first copy.

        Existing copyright also has provisions for derivative works. If my works are ingested into an AI training model, then used to new works in my style, or based on my work, then surely that's derivative? Prior law would seem to already apply, ie writers of fan fiction being sued, or even plagiarism. One defence against plagiarism can sometimes be you've never read the original copyright work. But the AI's are being fed this, without being granted any rights to create derivatives or the original work itself.

  13. Anonymous Coward
    Anonymous Coward

    whether creators should be paid

    did they bring the fence to prove their bleeding obvious point? Or a cat in a box?

  14. Nifty

    Oi! You got that off Shakespeare.

  15. Eclectic Man Silver badge

    Resignation from an AI company

    Today on the BBC web site:

    "A senior executive at the tech firm Stability AI has resigned over the company's view that it is acceptable to use copyrighted work without permission to train its products.

    Ed Newton-Rex was head of audio at the firm, which is based in the UK and US.

    He told the BBC he thought it was "exploitative" for any AI developer to use creative work without consent."

    Link: https://www.bbc.co.uk/news/technology-67446000

  16. Eric Kimminau TREG

    Libraries pay for content. Once.

    My .02 on training an LLM is that even a library pays for content to put on its shelf. Once. It can then be checked out ad infinem and no further compensation is generated. Now, ALL use by anyone must be reported in the biliography of references when quoted. GenAI LLM must therefore also generate the list of references used when generating their content. Otherwise the LLM is publishing referenced content without attesting the content to the author. Its only fair (to me). Cite your sources and give credit where credit is due and the argument is no different than checking out a book from the library and attesting your quote to the book and author in your biblography.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like