back to article OpenAI: 'Impossible to train today’s leading AI models without using copyrighted materials'

OpenAI has said it would be "impossible" to build top-tier neural networks that meet today's needs without using people's copyrighted work. The Microsoft-backed lab, which believes it is lawfully harvesting said content for training its models, said using out-of-copyright public domain material would result in sub-par AI …

  1. elsergiovolador Silver badge

    Do

    It's the classic US model.

    Do something.

    Fill your boots.

    Worry about lawsuits later.

    If you made crazy amount of money, play concerned and dumb.

    Feed your lawyers.

    Feed policymakers.

    Ensure new regulation comes out of it that pulls up the ladders.

    Enjoy virtual monopoly.

    Don't forget to pay artists a pittance for PR and laughs.

    1. ecofeco Silver badge

      Re: Do

      Nailed it. To a "T".

    2. Anonymous Coward
      Anonymous Coward

      Re: Do

      We British were great teachers!

      1. Sceptic Tank Silver badge
        Pirate

        Re: Do

        Yeah, colonialism and all that.

    3. Groo The Wanderer Silver badge

      Re: Do

      To clarify, it's not "do something", it's "do something that most people would consider illegal under current regulations." e.g. OpenAI "interpreting" the "need to be able to publish" permissions granted to websites by their users as approval for mass-scraping and harvesting, which it was NOT. Or any of the many ride-sharing services that claim they "aren't taxis" just because you hail them with an internet connection instead of a phone call.

  2. Anonymous Coward
    Anonymous Coward

    Imagine how wonderful the future could be when this pops up on some techbro's computer:

    -"you have generated an image that couldn't have been generated without the following artist's involuntary contributions, your account has been charged and no consent has been requested because you didn't request their consent when you stole from them"

    -"this image generation released 97kg of carbon dioxide into the amostphere, your account has been further charged without consent because you didn't ask for the planet's consent when you helped ruin the climate"

    -"have a nice day"

    1. Cliffwilliams44 Silver badge

      The simplest thing would be to add the following to any material that may be copyrighted.

      "The material produces is copyrighted by {insert company/individual}. There may be other material that is copyrighted by other entities that cannot be determined. The use of this material in any commercial endeavor may put you at risk of copyright violations.

      I believe the real question is, "is showing the material on the screen, for a fee, enough to violate fair use"? Or is it what the end user does with that material.

      1. Mage Silver badge

        Re: simplest thing would be

        No need.

        Either stuff is copyrighted or it isn't. It's not Fair Use simply because Open AI says so.

        You can give away rights in a copyright statement (CC, GPL, Free BSD, Apache etc). You can't create extra rights.

      2. Anonymous Coward
        Anonymous Coward

        Tech bros only respect money. Their "AI"s are like someone copying the style or content of another artist and doing things with that style or content the original artist didn't consent to. It wouldn't be legal for a human to do it so it isn't any more legal because a machine is doing it.

        1. mdfischer

          Copying content is a problem. Copying style is not so long as you don’t try to pass yourself off as the person whose style you cover, which is a criminal offence, not civil like copyright. Every writer, painter, musician, and ordinary person learns from the work of others. I don’t think we would want to monetise that.

          1. arthoss

            20 copies a minute

            Problem is the style copied by persons will take considerable time to learn and imitate but a machine can do it fast and often.

  3. FF22

    Sounds like...

    ...OpenAI did not present a defense, but just supplied the perfect evidence and proof for their mass copyright violation.

    1. Sorry that handle is already taken. Silver badge

      Re: Sounds like...

      Indeed if "OpenAI has said it would be "impossible" to build top-tier neural networks that meet today's needs without using people's copyrighted work." is the best they can do, it's time to pack it up.

      1. Snake Silver badge

        Re: Sounds like...

        I'm going to play Devil's Advocate here.

        Is training your AI on copyrighted works actually illegal? Can it be argued that training is similar to a child reading a copyrighted book - learning what the book teaches but simply not allowed to duplicate it, which is plagiarism?

        *IF* the AI is not allowed to copy form or function of the copyrighted work, using it as a very rough template but creating pure original work, with no similarity to any original work, isn't that "learning" and can be accepted as not a copyright violation? But is our current form of "AI" actually capable of that at all?

        These are questions that will need to be answered if 'real' AI is ever to go...anywhere. We, as humans, always take in other's works, nowadays copyrighted, and transform it into our own. When we don't transform it...we get into trouble.

        If "AI" is transformative enough, and/or restricted from mimicking any originals, is using materials still bad?

        1. Anonymous Coward
          Anonymous Coward

          Re: Sounds like...

          Depends on whether you bought said book, borrowed it from a library (which would save some kind of controls in place - restricted number of copy(ies), a suitable arrangement re copyright payments etc - or whether you'd simply stolen it.

          1. Zolko Silver badge

            Re: Sounds like...

            training is similar to a child reading a copyrighted book

            yes, I was thinking about that too

            Depends on whether you bought said book

            oh but yes, good point, I didn't think about that one. So, does OpenAI train its models on copyrighted material that they have paid-for, or merely scrapped for free somewhere ? Makes a BIG difference

            1. mmccul

              Re: Sounds like...

              The arguments being made by several artists (and it sounds like the NYTimes as well) is that OpenAI is scraping material that they had no legal right to access in the first place, material that OpenAI actively evaded technical restrictions governing access.

              1. Anonymous Coward
                Anonymous Coward

                Re: Sounds like...

                And, further, they did not do this merely in order to amuse their grandmother on a wet Sunday afternoon, distract some passing pigeons, or other similarly trivial and non-commercial activities; but to produce a product which they intend to replicate and sell-on at scale.

              2. Long John Silver
                Pirate

                Re: Sounds like...

                Anything sitting on the Internet is fair game.

                If data are in need of protection, it is the responsibility of their custodian to arrange appropriate security.

                People/institutions issuing news reports, opinion, or anything else purportedly protected by copyright should consider from whence (other than anachronistic law) their supposed entitlement to be recompensed for their work arises. The attitude appears to be “pay first, and then read.” It should be, “I have this to offer. If you like it, please support me (e.g. crowdfunding) in producing further works.”

                Where AI is concerned, means are required for quotations and paraphrases to be given attribution of source, so far as practicable.

                1. jdiebdhidbsusbvwbsidnsoskebid Silver badge

                  Re: Sounds like...

                  "Anything sitting on the Internet is fair game."

                  Not in the UK it isn't. Even if you have permission to view it, if you download and store material from the internet for any commercial reason, then unless you have permission to do so in some form of license you've infringed copyright. There are some "fair use" clauses that would allow you to do it, but only for private non commercial purposes. It's the "for commercial use" clause in the UK Designs, Copyrights and Parents Act that stomps over any claim of fair use.

                  Like others have said, training an AI causes copyright issues not because of the training, but because of the copying and storing of the copyrighted material that it is presumed OpenAI and others have done when training their models.

                2. Anonymous Coward
                  Anonymous Coward

                  Re: Sounds like...

                  The publishers didn’t put the books on the internet. This is material illegally uploaded to torrent sites. There aren’t large repositories of copyrighted books free to download other than the illegal ones.

                  1. mdfischer

                    Re: Sounds like...

                    The publishers do put pretty much all their books online. If OpenAI didn’t have the good sense to pay for these, then they deserve whatever problems they incur. I’m sure that the NYTimes did sell a subscription to OpenAI, and probably a commercial one like they would sell to a University. Will the times go after the universities to collect their cut of the lifetime efforts of their students?

            2. Mage Silver badge
              Alert

              Re: Sounds like...

              Except it's noting like "training" a human. It's not training at all.

          2. Anonymous Coward
            Anonymous Coward

            Re: Sounds like...

            Not really. If I swipe a textbook, read it, and learn something from it that I use later in life - it is still not copyright infringement. Every textbook we use to train our children is copyrighted, even if the material in those books is public domain. Training AI on copyrighted works should not be a problem, having AI regurgitate those works is a problem. Having AI generate works impersonating something or someone they are not they are not is a problem. Having AI generate works that are good enough to put Hollywood writers out of work is a problem, not only for AI but because the inherent predictability and lack of real creativity in those scripts created by the humans causes intense cerebral pain in any human who reads them.

        2. Catkin Silver badge

          Re: Sounds like...

          Is there a similar legal decision for books to Warhol vs Goldsmith?

        3. Lusty

          Re: Sounds like...

          The child isn’t being commercialised as a product. In this instance they are creating a commercial product based on copyright work, and no it’s not learning it’s a statistical model much like a search engine, and we have case law for search engines providing copyright material instead of linking back to source.

          And all that is also ignoring the enormous costs to platforms they have pillaged to get the data. Reddit famously crippled their platform to stop it as they couldn’t afford to supply the API access to feed these beasts.

          1. Catkin Silver badge

            Re: Sounds like...

            Can please you cite the case? I had a search but came up with a blank. It seems like at least some data extraction from copyrighted data is permitted, otherwise the search engine wouldn't be legally able to function.

            1. Lusty

              Re: Sounds like...

              Google were hit hard a while ago (probably 20 years) for essentially providing the results in their pages riddled with ads, robbing the actual source of the opportunity to monetise. Any actual information Google returns now, such as Wikipedia data, is licenced and they pay a fee. I believe the newspapers were the ones with the beef originally.

              1. Catkin Silver badge

                Re: Sounds like...

                Thanks, I wasn't aware of the newspapers one as far as their broad search, only the Google News issues. Interestingly, Field vs Google (as well as Perfect 10 vs Amazon/Google & Birgit Clark vs Google but those relates to images) went the other way with the former actually finding that it was legitimate to both index and cache, as well as serve up the cache to the public.

                For reference, the Wikipedia payments are not for appearing in the search results but in the summarising block of text that appears to the right of some searches.

                1. Lusty

                  Re: Sounds like...

                  "For reference, the Wikipedia payments are not for appearing in the search results but in the summarising block of text that appears to the right of some searches."

                  The point is, they have to pay to provide someone else's IP. Doesn't matter if it's in the results or on a bumber sticker, you pay for what you use and how you use it and importantly it's the IP owner that decides the terms.

                  1. Catkin Silver badge

                    Re: Sounds like...

                    Certainly, for verbatim and for the same purpose but please see those other cases for some examples of nuance. Not to say it's clear in the opposite direction either.

                  2. Catkin Silver badge

                    Re: Sounds like...

                    To add, the nuances are important for everyone. For example, if your unambiguous assertion applied universally in enshrined law, it would be impossible to point out lying corporations or biased media because any critique would either be libellous (due to not having a geniue source) or an infringement of copyright.

                    It's important to consider broader outcomes rather than cheering for a blanket legal precedent or act that stops one thing you disapprove of.

                    1. Lusty

                      Re: Sounds like...

                      I certainly do disapprove, and have certainly considered the outcomes. I think the opposite is also true though, a lot of people are abandoning reasonable and fair IP protections because they think AI is cool. I agree it's cool, but that doesn't make it acceptable to steal all of the IP in the world and use it for your own commercial gain.

                      Even ignoring the IP issue - this has cost real money to organisations like Reddit, whose API was hit so hard and so often they had to cripple the platform and charge for the API as a result. Even if you agreed with the IP theft you'd have to acknowledge that the compute costs racked up as a direct result of pillaging the Internet ought to be paid for out of the many billions of profit from the models?

                      1. Catkin Silver badge

                        Re: Sounds like...

                        I agree on the "opposite" and I'm not on the opposite side of the fence from you by any measure, I've just seen too many rights disappear under the thunderous cheers of a populist oversimplification.

                        For the compute costs, unlike the IP, I'd have to firmly disagree. Websites are entitled to use measures to control access but I don't think that anyone but the host should have to bear the cost of non malicious access (malicious being DoS and similar), any more than ISPs should demand that websites fund the cost of delivering their content to the user.

          2. Mike007 Silver badge

            Re: Sounds like...

            Saying it isn't learning is wrong, unless your argument is that technically you never learned anything you simply had signals from your eyes and ears cause a change in the state of your neural network...

            1. sabroni Silver badge

              Re: Saying it isn't learning is wrong

              LLMs don't learn, they shuffle and re-arrange exsiting stuff to produce output that appeals to humans. That's not learning. Learning involves understanding.

              Do you think LLMs understand? Do you believe your brain is running an LLM?

              Do you want to buy a bridge?

              1. Sceptic Tank Silver badge
                Headmaster

                Re: Saying it isn't learning is wrong

                I have encountered individuals with masters degrees that have certainly learned, but definitely did not understand.

              2. Mike007 Silver badge

                Re: Saying it isn't learning is wrong

                Your organic neural network works in the same way as these simulations... Indeed those studying LLMs have already identified evidence of reasoning that looks very much like "thought", suggesting that language is indeed the source of "human intelligence" (as opposed to "animal intelligence"... primates who have been taught sign language behave differently to those who haven't).

                Why do you think it is called a "neural network"? The only practical difference between a simulated neural network running on a CPU(/GPU) and physical one made of organic material is the way data is fed in to it, and of course the initial configuration.

                1. doublelayer Silver badge

                  Re: Saying it isn't learning is wrong

                  "Why do you think it is called a "neural network"?"

                  Because it's designed to simulate neurons. Not that neurons simulate it. I can give you a big bucket of neurons and they won't be doing this stuff. Just because that's the inspiration for the model doesn't mean that anything a neural network does is what a brain would do. Similarly, have you heard of genetic algorithms? They're quite cool and sometimes work well, though like neural networks they're pretty compute intensive to get going. They don't act like DNA does, though.

            2. Lusty

              Re: Sounds like...

              Sorry but it isn't learning, it literally uses vector maths to predict the best fit for the next word based on statistical analysis of lots of words. Image based systems are similar but not quite the same.

              1. Anonymous Coward
                Anonymous Coward

                Re: Sounds like...

                well, in a way it is 'learning', if learning is accommodationg knowledge, re-shifting various bits of it and re-producing it, more or less perfectly, or making up new stuff based on it. Which we humans happen to do when we hallucinate in our dreams. Or, for some (politicians, etc.) in day-to-day dealings.

                1. Lusty

                  Re: Sounds like...

                  sorry but you've misunderstood what it's doing. Hallucinations in this stuff is when it comes up with a statistically reasonable load of bullshit. Take the example where it quoted some laws in a US case and it took months for everyone to realise those cases don't exist, they just sound plausible. That's not imagination, it's pure statistics.

        4. doublelayer Silver badge

          Re: Sounds like...

          "Can it be argued that training is similar to a child reading a copyrighted book"

          It can be argued, and it has been by many people. I have yet to see it argued successfully, however. Usually, the argument goes like this:

          The process of getting text into this model is called training. The process of educating a child can be called training. Therefore they must be the same. The work printed by the model looks like an essay. Students also produce essays. They must be the same. Argument ends here.

          Actually arguing that would require you to demonstrate why the training of a model which can and does memorize large chunks of text and sometimes prints it verbatim is equivalent to human reading, and not by resorting to humans with incredible memories who may or may not be able to recount a book back to you on reading it once. It will require you to determine if you think that reading some books and reading millions of books, more than any human could possibly do, are the same or not. It will require you to prove an equivalence between the statistical methods used on the training material and human intelligence, which will be quite difficult. It will require you to prove that the parts of human experience other than reading which affect their products are sufficiently small that they can be discounted when making the comparison to the way an LLM produces its output. Unfortunately for anyone making these arguments, these are all relatively subjective arguments, but to the extent that they can be argued, they usually produce a stronger conclusion that ingestion of text into a model is not at all like a student's learning.

          1. sabroni Silver badge
            Thumb Up

            Re: downvoted but no rebuttal

            You know you've hit the nail on the head.

            1. Anonymous Coward
              Anonymous Coward

              Re: downvoted but no rebuttal

              that man from mars fellow must seem like a genius to you

            2. Snake Silver badge
              Devil

              Re: downvoted but no rebuttal

              Fine, I just rebutted

          2. Adrian 4

            Re: Sounds like...

            But while a child's essay, which has no publication or circulation may not acknowledge its sources, a student's essay must. In fact, the essay becomes more valuable by doing so, since attribution allows the following of references and additional detail.

            An AI that produces text without references is of little value : it's just boilerplate and the wide occurrence of hallucinations devalues it.

            What the LLM needs to do is tag its sources so it can show attribution and produce a more useful document. But that's hard to do, because it doesn't select arguments meaningfully, it just makes a word soup and statistically selects content from it.

            I'm not very fond of the copyright industry and the abuse by companies like Elsevier. But when AI understands and implements respect of copyright, it will be a lot more valuable.

            1. Long John Silver
              Pirate

              Re: Sounds like...

              Yes, but AI should be enabled to make, when practicable, attribution to sources. Copyright is an irrelevance: an anachronism, which the digital era plus so-called AI will abolish.

            2. Mage Silver badge
              Flame

              Re: Sounds like...

              OpenAI & MS should close down Open AI

              The LLM / ChatGPT issues

              1. IP theft content. They've admitted that they use copyright content and claim it's no use without.

              2. Environmental damage (data centre)

              3. Dubious profit mechanism.

              4. Large amounts of plausible rubbish produced by queries.

              5. No easy way to test veracity of answers.

              6. Privacy issues asking questions.

              7. They aren't prepared to pay for copyright content. They have scraped pirate sites.

              8. They want to redefine Fair Use as wholesale copying of copyright content.

              Training is a marketing speak lie. It's a special kind of database with predictive text engine. All the copyright content is "in" it.

              An expensive exploitive toy of almost no value that exploits creative humans.

          3. Snake Silver badge

            Re: Sounds like...

            "Actually arguing that would require you to demonstrate why the training of a model which can and does memorize large chunks of text and sometimes prints it verbatim is equivalent to human reading, and not by resorting to humans with incredible memories who may or may not be able to recount a book back to you on reading it once."

            That is actually a surprisingly easy "argument" to make.

            "Frankly, my dear, I don't give a damn."

            "Space, the final frontier. These are the voyages..."

            "Do, or do not. There is no try."

            "Something in the way she moves, attracts me like no other lover. Something in the way she woos me..."

            "Happy birthday to you, Happy birthday to you..."

            "I'm sorry Dave, I can't do that."

            ...

            As a collective society we all have hundreds of copyrighted quotations, songs and phrases stuck inside our head from exposure to popular media, books, magazines and TV. We have become so accustomed to their their almost universal recognition that we fail to remember that, yes, the sources are copyrighted material. We sing songs to our lovers and friends, or even strangers in karaoke bars, and all that is copyrighted.

            When we do something like make new phrases in Yoda-speak, or change the words of a song we know, we take a copyrighted work and transform it. We can get inspired by a painting we saw and work to take a photograph based on that expression, we transformed it. When we write an essay based upon our readings and studies, we transformed it.

            IF, and that's a big "if", the AI is transformative enough, creating something that is original but inspired by another work, is that any different than a human?

            1. Justicesays

              Re: Sounds like...

              |IF, and that's a big "if", the AI is transformative enough, creating something that is original but inspired by another work, is that any different than a human?

              https://www.copyright.gov/comp3/chap300/ch300-copyrightable-authorship.pdf

              Section 308.

              Whatever is created by an AI can only be purely derivative, as under US copyright law it would not qualify for copyright itself under section 308 , it doesn't contain any creativity as defined in (US) law.

            2. doublelayer Silver badge

              Re: Sounds like...

              This is subjective, but I do not think your argument qualifies. Remembering a sentence and modifying it is not the same as remembering the entire book and quoting it. LLMs have frequently done the latter. It's not "It was the best of times, it was the worst of times" but me typing the entirety of the opening chapter into this box. I have read that book, but I cannot do that. I don't think any student could unless they had specifically studied the chapter or if they were trapped in a prison cell with only that book for years and had become obsessive. LLMs frequently do it without that being the desired outcome, and when people do want that outcome, it happens quite reliably.

              1. Richard 12 Silver badge

                Re: Sounds like...

                And if I do write out a few complete pages of a book, with or without a few typos and minor alterations, I am in breach of copyright.

          4. jdiebdhidbsusbvwbsidnsoskebid Silver badge

            Re: Sounds like...

            "It will require you to prove an equivalence between the statistical methods used on the training material and human intelligence"

            Not to mount a legal case claiming copyright infringement you wouldn't. Arguing an equivalence between AI training and human learning is not relevant in law, because (in the UK at least), copyright does not apply to a person's memory.

            1. doublelayer Silver badge

              Re: Sounds like...

              True, but this was a discussion about whether you can claim that training an LLM is similar to human learning. You don't need to prove the method of human intelligence if you just want to make a copyright point, but if your defense to the copyright claim is based in neuroscience, you do. The AI companies have made it clear that they're not going to attempt it, likely because they have experts who know how silly it would be to do. While they might succeed at confusing a jury, they'd have to do it by lying to them. Meanwhile, their fair use defense will be easier to argue, so it appears they're going with that. Their analogies will not be to education and human brains, but to libraries and search engines. I don't think that argument is good, but it's a lot closer to valid than the one about learning.

        5. Tom66

          Re: Sounds like...

          It's a question for, ultimately, the Supreme Court I imagine. There is absolutely no way that any original drafters of copyright law had AI in mind. There are good arguments either way.

          Arguments in favour of AI using copyrighted material: Artists and authors do the same. Everyone learns from reading copyrighted material. Artists base their work on what they see in the living world, including that of real art.

          Arguments against: AI massively scales up what any one human can do, it could be said to be infringing because no artist could expect another artist to be able to create their work within seconds based on an extensive training database.

          I suspect that courts will rule in the latter category as they have traditionally always protected copyright holders, which may well lead to the death of many AI companies since their models are not trainable without at least some copyrighted text. It could become legally impractical to verify that content is not copyrighted; for instance, even if you just use Wikipedia as training information, people submit copyrighted content all the time on there, and it may never be noticed.

          1. Lusty

            Re: Sounds like...

            "There is absolutely no way that any original drafters of copyright law had AI in mind"

            They didn't need to, it's generic enough as it is to cover the situation.

            If I have an image and put it online you can assume the right to view that photo since I placed it in a public platform to be consumed. That does not give you a licence to redistribute that photo, nor to use that photo in any of your own commercial offerings.

            If I write some text and put it online you can assume the right to read that text since I placed it in a public platform to be consumed. That does not give you a licence to redistribute that text, nor to use that text in any of your own commercial offerings.

            If you wish to create a textbook, that would be a commercial undertaking and you will need to seek licenses from me to include my work, whether wholly or in part. The AI organisations have such a "text book" in their possession as an offline copy of the Internet and its history, and are using it for commercial purposes right now.

            If you wish to use my IP in a commercial product such as a statistical model, then again you'll need to seek a licence.

            If you don't have a licence for the type of commercial usage you are carrying out then you are outside of the law and will either need to buy a licence or accept any and all legal consequences.

            Pretty much all of the above was covered by the Windows 3.1 EULA back in the '90s. What's good for the goose...

          2. Long John Silver
            Pirate

            Re: Sounds like...

            Copyright, USA style, is enforceable only where US Marines can reach.

        6. The Indomitable Gall

          Re: Sounds like...

          The implied consent related to selling you a book is that you will use the information within for uses that are common for many people. Finding a new use for it and then saying that it's not infringing because you bought a copy... well that's just rubbish.

          Here's a counterpoint:

          if I was to build a university programming course leaning heavily on (eg) Kernighan and Ritchie's text, I would be expected to list it as a set text in order that students might buy it. The students might then go on to teach themselves, and it is accepted that there will be similarities been how they were taught and how they go on to teach, but this is an expected result. Crucially, they will forget details of how K&R's Book on C was, and will instead be passing on information based on their internal model of concepts. AI does not, at present, have any model of concepts -- the second L in LLM is for "language".

          LLMs create language based on language; humans creat language based on [i]ideas[/i].

          I believe that "ideas" are key to the notion of a "creative step".

          1. Mage Silver badge
            Headmaster

            Re: the second L in LLM is for "language".

            Except the machine has less concept (none actually) of language than any creature that has a vocabulary, but not a language. Even the word "Language" is a marketing lie. A big vocabulary, a pattern matching scheme and "predictive" text engine is not language. There is no context or understanding, which is why the LLMs produce plausible rubbish from scanning all the human generated texts and regurgitating them. There is no training or machine learning, or "neural network" just feeding text (and images on some systems) in to a data-flow database with nodes and pattern matching. A computer so called "neural" network is nothing like a biological brain. It's all marketing lies. It's not even actually AI. Also they can't afford to properly curate the input or buy the copyright material, The environmental cost is also vast. TThe LLMs do not hallucinate, because inherently there is no context or understanding, no mechanism to check correctness.

        7. Doctor Syntax Silver badge

          Re: Sounds like...

          "Can it be argued that training is similar to a child reading a copyrighted book - learning what the book teaches but simply not allowed to duplicate it, which is plagiarism?"

          No it can't. Pick up any book published in the last few decades and open it at the page with the copyright, Library of Congress data ets. It will be somewhere near the title page, the dedication, if any, and before the table of contents. You will see something along the lines of "No part of this book may reproduced, stored in an electronic retrieval system" etc. etc.

          The terms on which the book was sold to you explicitly forbid various stuff that isn't like a child reading a copyrighted book.

        8. chololennon
          Facepalm

          Re: Sounds like...

          > training is similar to a child reading a copyrighted book

          No way, the child is not making a lot of money (obscene amounts of money btw) inmedialty after he/she has read the training material.

        9. Mage Silver badge
          Facepalm

          Re: Sounds like...

          LLMs are not "general AI", aka "Real AI" and never will be. Even the word "Training" is marketing speak.

        10. Teesside John

          Re: Sounds like...

          Perhaps a more fundamental question is why have copyright at all? I guess to reward and incentive people to create stuff. Why would somebody spend months writing a novel and publisher pay for it to be edited and formatted if the first person to buy it can put it online for free or print and sell their own copies? There will always be some people who produce stuff for free, e.g fan fiction and open source software, but that should be the creator's choice and a lot won't get produced if there's no way to make money.

          And not every creator is a rich rockstar - it'll be the masses that do copywriting, voice overs for corporate videos, graphic design for company logos etc that lose out first.

          We therefore need a copyright system that works - one that makes it worthwhile to create stuff but doesn't leave people unable to innovate out of fear of being taken to court.

          With AI, we need to make sure the copyright system still works. Imagine I'm a photographer who takes pictures of everyday things and local landmarks etc to sell on stock photo sites. I make a living - I'm happy. People use my photos - they're happy. Now imagine my photos get scraped for next to nothing and used by an AI to generate similar images for free. I now have no money (why pay when an AI can something similar - even customise it - for far less?) If I stop taking photos the AI has less data and eventually the images it produces start to look dated. Now everyone loses.

          Regardless of any philosophical comparison, it's wrong to take lots of hard work, use it to make money and leave the creators with nothing. We therefore need to ask how we adjust the rules to work with AI.

          1. Richard 12 Silver badge

            Re: Sounds like...

            Not quite.

            The question is how we adjust AI to obey the rules.

            If OpenAI cannot do that, then OpenAI must die.

            This is not any different to claiming your business relies on maiming all your workers.

            If your business relies on unlawful acts, then your business is unlawful and must be shut down.

        11. trindflo Silver badge

          Re: training is similar to a child reading

          Only if the child can perfectly reproduce what it has read without understanding it at all. It is the exception for two talented programmers to produce the same exact code, although I have seen it happen. Humans need to strain or plagiarize to produce that level of conformity. The AI seem to require extra work to avoid it.

        12. Sorry that handle is already taken. Silver badge

          Re: Sounds like...

          It's an interesting subject to explore and I should say I'm wholly unqualified to contribute more than base speculation but LLMs being transformational would be something quite special, wouldn't it?

          An author did recently write a piece about how the LLM way of doing things is very similar to the way that authors in general, particularly successful ones do things, but as you say, it's the transformational nature of the human mind that is the key. You can learn how good novels are written by reading them, but the way a human mind works is very complicated and can an LLM do anything more than merely ape what it's learned?

        13. 96percentchimp

          Re: Sounds like...

          "Is training your AI on copyrighted works actually illegal? Can it be argued that training is similar to a child reading a copyrighted book - learning what the book teaches but simply not allowed to duplicate it, which is plagiarism?"

          No. The clue is in the term "works". It's the basis of copyright and the reason why copyright law refers to "works" and not "content". A person learning performs work - time, mental and physical effort and use of artistic resources - by ingesting the copyrighted material and practising to absorb it into their own repertoire of styles. This is as true for a child as it is for a master. An AI performs a very small amount of work ingesting and processing content, and even this is rendered negligible by the massive scale of its outputs to millions of users. It is nothing but algorithmic hijacking of artistic labour.

          If copyright law is to mean anything, then it must protect the work performed by humans* in the creation of copyrighted content. Or any other conscious intelligence, non-human or non-biological, although I draw the line at techbros.

  4. Anonymous Coward
    Anonymous Coward

    In other news, the Pope's leanings towards Catholicism exposed.

  5. Steven Raith

    " The Microsoft-backed lab, which believes it is lawfully harvesting said content for training its models, said using out-of-copyright public domain material would result in sub-par AI software."

    "If I don't go around stealing everyone's posh cars, how am I supposed to present myself as successful?"

    Fucking clowns.

    1. ecofeco Silver badge

      The Microsoft-backed lab...

      I think I see the problem. Like a 50,000 watt searchlight.

    2. thosrtanner

      It also suggests the AI software is somehow not sub-par anyway, which is a claim I find hard to swallow

    3. The Indomitable Gall
      Joke

      Ah, but don't you see...? If Elon Musk had been allowed to copy ecisting cars, Tesla would have never had problems with the brakes. Copyright kills!!!

    4. Duncan10101

      Yup

      The logical fallacy is called "Appeal to consequences."

  6. Anonymous Coward
    Anonymous Coward

    The American way !!!

    It is impossible for me to make a fortune without having a fortune first .... so I will just rob a few banks !!!

    You can have the money back when I have made my 1st $Billion .... if that is Ok !!!

    :)

    1. ecofeco Silver badge

      Re: The American way !!!

      The American CORPORATE way is ALWAYS to steal.

      Always.

    2. Anonymous Coward
      Anonymous Coward

      Re: The Russian way ???

      I want a bigger country and great access to the Black Sea, so I'll help myself. Shouldn't take more than oh, 3 days.

      1. Catkin Silver badge

        Re: The Russian way ???

        Or Line X

      2. Anonymous Coward
        Anonymous Coward

        Re: The Russian way ???

        for now, 2 years down the road, this approach seems successful (in this particular, somewhat narrow, front). In the long(er) run, it depends, whether the world tires and shrugs it off (success!), or whether the short, semi-victorious war, speeds up the rot and causes a collapse. Or whether, the Chinese, at some point, send 25 mln peace keepers, armed purely for self-defence, to help support Russian population against unfolding chaos. In Russia ;)

    3. The Indomitable Gall

      Re: The American way !!!

      Nononono.

      You're only allowed to rob banks if your dad was filthy rich.

      And even then, you need to dress your bank robbery up as a sophisticated economic model that says that robbing people will make them richer.

      That's the real law of the markets!

  7. Anonymous Coward
    Anonymous Coward

    VC playbook

    Steal idea

    Pump like crazy

    Profit

    Dump - leaving suckers to carry the can and legal fees

    Rinse

    Repeat

    The trick to making hodloads of cash is seeing the next fad before the crowd realise they want it.

    Their specialist subject. Regrettably not mine.

    ML currently is effectively a super cool search engine with bells & whistles, sold as "intelligent".

    Sure, it will lose some jobs, and create some others.

    However. As that Shakeyspeare bloke wrote.

    All that glisters is not gold.

    1. This post has been deleted by its author

    2. Mage Silver badge
      Coat

      Re: VC playbook

      Charlie Stross said LLMs are the next crypto-currency / blockchain. And that wasn't a complement.

  8. HuBo Silver badge
    Happy

    Add diction ...

    "OpenAI has said [...] using out-of-copyright public domain material [following the law] would result in sub-par AI software [...] profits"

    Following the law would result in sub-par cocaine ... profits ... too!

    "how to evade copyright law by 'laundering' data 'through a fine tuned codex[?]".

    Codexycodone?

    "We won’t get fabulously rich if you don’t let us steal"

    Say hello to my little friend ...

    (P.S. The linked Gary Marcus and Reid Southen IEEE Spectrum piece is superbly illustrated)

  9. Kiss
    Alert

    Legality and Ethics

    It is strange that these AI models can't be taught the concept of laws. These are public domain in most countries, so therefore they should be able to demonstrate their compliance when generating output.

    Just because one CAN, does not mean one SHOULD. The models should also be taught ethics, and these would need to be aligned to country/culture.

    Maybe the models should pass legality and ethical tests before they have the right to operate publicly, just like many other professions. We have laws, whose purposes are to protect our broader societies, we need protection from e.g theft, lies, purposeful propoganda. A lawyer will provide advice based on described circumstances - you don't expect the advice to be illegal, so when you pay for something it should be legal.

    1. Adrian 4

      Re: Legality and Ethics

      They can learn laws. Some have been shown to be capable of passing legal exams (though IMHO that says more about the exams than the AIs). But they still don't understand them and therefore can't apply them. They match words and sentences but not concepts.

  10. DS999 Silver badge

    Sean Fanning missed out on a winning argument?

    "It would be impossible to build music sharing networks that meet today's needs without using copyrighted songs"

    1. Wellyboot Silver badge

      Re: Sean Fanning missed out on a winning argument?

      It's the 'Original Creator* gets paid' bit that matters with music sharing, sharing being the operative word.

      Plagiarism (aka ‘in the style of’ / ‘inspired by’) is the problem here, as far as AI companies care the original creators are merely another input source.

      *Writer and/or Performer.

  11. Rob 63

    Inspiring

    Of course real artists exist in a vacuum and create works out of pure nothingness having never experienced anyone else’s work

    1. Wellyboot Silver badge

      Re: Inspiring

      Take any series of books from literature, what would the author do if only days after the first is published there are hundreds of new works 'inspired by' that book creating full back and forward stories for every character mentioned.

      Does anyone expect the author to carry on with no hope of earning a living.

      1. Brewster's Angle Grinder Silver badge

        Re: Inspiring

        I think we call that fan fiction. Okay, it take more than a few days to appear. But, given sufficient time, the position stands.

        We already have a culture that values authenticity. NFTs took this to an absurd level. But what makes the Mona Lisa valuable is not the image itself, but that it was painted by Da Vinci. (Witness the recent shenanigans over Salvator Mundi. It's value is tied to it being authentic rather than being one of the many copies of the lost original.)

        So do you want to read the completion of Games Of Thrones by an AI, or wait for George R.R. Martin - even if (as a time traveller) I tell you the AI one is better than what will finally emerge?

        1. Wellyboot Silver badge

          Re: Inspiring

          Yes the time element matters a lot. Fan fiction tends to appear after the original source dries up and that's fine. AI Completing the unfinished Huckleberry Finn books isn't going to harm Twain and may even be quite good.

          I do stick to the point that AI can wipe out an upcoming author the second their first book starts becoming popular.

        2. jmch Silver badge
          Trollface

          Re: Inspiring

          "do you want to read the completion of Games Of Thrones by an AI, or wait for George R.R. Martin"

          That's assuming he ever DOES finish!!

    2. Ken Hagan Gold badge

      Re: Inspiring

      Real artists paid to experience all that earlier work and then created something that wasn't just a copy.

    3. DS999 Silver badge

      Re: Inspiring

      If you're trying to argue ChatGPT and its ilk can transform works like an artist can you are arguing it in the wrong forum. We know far better than the average person how limited today's "AI" is, all the recent hype (and past hype cycles for it that have come and gone) notwithstanding.

  12. Piro

    I'm rarely in favour of heavy handed copyright law

    But in this case I am.

    They're profiting from works without paying royalties.

    Kill it off.

  13. Anonymous Coward
    Anonymous Coward

    I just read that as: AI devs say it's not ready for real life,

    it only escaped the lab...

    Wait, what? Where did I hear that before..?

  14. heyrick Silver badge

    If they want quality content...

    ...do what everybody else has to do - licence it (or pay somebody to make it just for them).

  15. tiggity Silver badge

    this bit of the article nailed it

    "Rough Translation: We won’t get fabulously rich if you don’t let us steal, so please don’t make stealing a crime!" he wrote in a social media post. "Don’t make us pay licensing fees, either! Sure Netflix might pay billions a year in licensing fees, but we shouldn’t have to! More money for us, moar!"

    As we all know, it would be a solution to pay to licence copyright material if that is needed for "up to date" training data (indeed NYT tried for a while to do a licensing deal before exasperatedly launching their legal action) but that would cost OpenAI money / profits.

  16. /\/\j17

    Doesn't say much for the law department at Santa Clara University in California

    "It would be fundamentally unfair for a copyright owner to encourage wide dissemination of still images for publicity purposes, and then complain that those images are being imitated by an AI because the training data included multiple copies of those same images." - Tyler Ochoa, a professor in the law department at Santa Clara University in California.

    Umm, the professor does realise his own statement that copyright owners 'disseminate still images for publicity purposes' kind of explicitly states that they aren't making them "public" or "copyright free" but are granting a limited licence for people to use their copyrighted image for the purpose of publicity, don't they?. And "training a commercial, monitized AI" would NOT could as "publicity" - nor any currently accepted definition of "fare use".

  17. Big_Boomer

    You PAY!!!

    When you buy a book, you pay. When you borrow a book from a library, you pay (via your taxes that fund the library so it can buy books). The problem with these "AI" companies is that they all seem to be made up of Freetards who have never paid for anything on the Internet, and still believe that they shouldn't have to. If there was no Internet/WWW that allowed them to steal what they wanted to train their "AI" how would they train it? Yes, they would have to PAY for the materials. This is just Freetard mentality being extended to corporate level. On that basis I should be allowed to use their "AI" products for free forever. Oh wait, how can they make any profit if nobody is paying them? TANSTAAFL!!!!

  18. Mike 137 Silver badge

    "[The AIs] do not provide any information about the provenance of the images they produce"

    It's (probably) impossible for them to do so, as they don't store any such metadata. Given that internal record is just a massive cluster of fragments with attached weightings, there is no concept present of sources (or indeed entire images). From these fragments, an image is assembled on a probabilistic basis. It's not therefore surprising if a replica of an original image used for training emerges from the machine, as that image is the most probable given the specific set of fragments and weightings that were derived from it.

  19. mark l 2 Silver badge

    If the TPB and other torrent sites can be blocked for piracy and copyright infringement for merely linking to where to download copyrighted materials, then surely slurping up vast quantities and spitting it out from your LLM is clearly a infringement on the original artist copyright if you did not pay to license the content?

    And OpenAI have admitted that a public domain only trained AI would be sub par, so they therefore should have paid to license the copyright content they are training it on if they wish to make it a commercial product.

  20. amanfromMars 1 Silver badge

    Hmmm? Something to ponder on, and wonder at too, when practically true

    OpenAI: 'Impossible to train today’s leading AI models without using copyrighted materials'

    Are not humans similarly impossibly trained?

    Is that a going concern which is not being addressed for it is an impossible task easily conceived but never ever achievable .... ergo a grand folly of a fools' errand to be intelligently ignored and summarily dismissed as not fit for future Greater IntelAIgent Games purpose?

    AI would certainly suggest it be agreed so ..... leading as disagreement would, to non-supporters of the premise doing vain battle against themselves and virtual ghosts leading them nowhere but into the stagnant and petrified ponds of serial despair and certain ruin.

  21. Zippy´s Sausage Factory

    "OpenAI has said it would be "impossible" to build top-tier neural networks that meet today's needs without using people's copyrighted work."

    So have they just implied that if they lose a few copyright lawsuits, they're going to have to shut their doors? Because that's the way it seems to me.

  22. Neil Barnes Silver badge

    Users may not know <... > whether they are infringing."

    Given the recent publicity, it would be hard to argue that a user was unaware of the possibility of infringement.

    But then, who expects statistics to produce art?

  23. jmch Silver badge
    Facepalm

    Options....

    "OpenAI has said it would be "impossible" to build top-tier neural networks that meet today's needs without using people's copyrighted work."

    OK, so either *don't* build a "top-tier neural networks that meet today's needs" whatever the eff that is

    OR

    Pay people for their copyrighted work to be incorporated in the training set

    Simply hoovering up all the data and leave the lawyers to argue about it for the next 2 decades is just plain wrong

    1. amanfromMars 1 Silver badge

      Re: Options.... for Global Operating Devices and Virtual Daemons alike

      Simply hoovering up all the data and leave the lawyers to argue about it for the next 2 decades is just plain wrong...... jmch

      Oh? How so? It is simply the normal day to day way of doing business. Do you really think that is going to go away any time soon whenever there are huge fortunes out there, and ripe ready just for the taking ...... whenever one knows what one is doing and what needs to be done?

      Dream on, bubba, .... for that is here to stay for a while ....... and you surely know that to be the truth, the whole truth, and nothing but the truth, so help yourself. Who’s/What’s gonna stop you other than a deficit in oneself?

  24. Doctor Syntax Silver badge

    OpenAI has said it would be "impossible" to build top-tier neural networks that meet today's needs without using people's copyrighted work.

    So that makes it OK then?

    Fine. Let's try another argument along the same lines. Sauce for gander, etc.

    It would be impossible to operate a personal computer without using Microsoft's copyrighted work*

    so it should be OK to just use that freely in the same way. How does that sit with OpenAI's backers?

    * Admittedly I know this premise to be false but Microsoft are unlikely to agree and we're looking at it from their PoV.

  25. ChrisElvidge Bronze badge

    Quite apart from copyright hassles

    Simply hoovering (TM) all (or most of) the information on the Internet will surely include a lot of stuff that is simply wrong (see stackexchange/stackoverflow) or outright lies (e.g. truth social, facebook). How does this information get filtered out of the training data? Or doesn't it?

    1. Ken Hagan Gold badge

      Re: Quite apart from copyright hassles

      It doesn't, but if you are fortunate then the wrong answers are all different and don't reinforce each other statistically, whereas the right answers are broadly similar and so they do reinforce each other.

  26. Anonymous Coward
    Anonymous Coward

    Just because you can't murder someone without killing them ...

    ... doesn't make murder lawful.

  27. Cliffwilliams44 Silver badge

    The Professor should go back to school

    Mr. Ochoa is grossly incorrect. As the article correctly states, copyright infringement is directly tied to profiting from the materials.

    If I take a publicity image from a popular movie, print it out in my printer, frame it and hang it on my wall, I am not violating the copyright of the image's owner.

    If I make multiple copies, frame those copies and sell them at a flea market to $20.00 each, then I am absolutely in violation of copyright laws.

    OpenAI is profiting from subscriptions that have the potential to produce copyrighted material. Considering some very high-profile academics are embroiled in this same type of scandal, OpenAI should find a way to properly site the copyright owners of this material and include appropriate warnings for their users.

    1. Tom66

      Re: The Professor should go back to school

      At least in the UK, you can violate copyright laws even if you do not directly profit from those actions, for instance pirating movies. It is correct that the penalty for violations is likely to be more serious if you have commercial intent, but prosecution is possible in any case.

    2. Paul Hovnanian Silver badge

      Re: The Professor should go back to school

      "As the article correctly states, copyright infringement is directly tied to profiting from the materials."

      Do I not profit from some materials by saving the money I would have otherwise spent obtaining them through retail channels? Even if I don't resell them?

  28. Paul Smith

    Straw man argument

    The training of the AI is a straw man argument. Using material to train an AI model is not a problem and is *not* a misuse of copyright assuming they have legitimate and legal access to the material in the first place. It is the use of copyrighted material in the output, without permission, payment, or accreditation that is the problem. If openAI can't build a business model that respects other peoples property and rights, then they don't have a business.

  29. Sparkus

    And there you have it.

    Current content owners will be leading the AI deployment charge.

    And expect that the owners of those content will be very active in the courts, protecting their own IP.

    I for one hope they win every single infringement case.

  30. Long John Silver
    Pirate

    Angels dancing on the head of a pin?

    The Statute of Anne (1710) enabled individuals to assert monopoly 'rights' over the distribution of their texts. Prior to that, various monopolies for making and selling things were individually issued by monarchs or enshrined in guilds.

    Lawyers revel in developing the language of 'rights'. A 'right' nowadays appears to be an 'entitlement' enshrined in law, but somehow conferred by an entity/abstraction external to law. An entity could be entirely abstract, e.g. a deity, or corporeal, e.g. a king. So-called 'human rights' derive from a mixture of the god-given and wishful thinking. Rights associated with possession of physical property have differently been elaborated according to whatever power structure was in place.

    Problems accumulating to this day have arisen from failure at the outset to recognise the irreducible qualitative differences between physical property and any dreamt other kind. What's called “intellectual property” (IP) inherently is intangible.

    IP can be instantiated on physical media (e.g. paper, vinyl, and photographic emulsion). Each such incarnation exists in a single specific, but changeable, location in time and space. There can be multiple physical instances of the medium, each with supposed IP inscribed. Physical media can be bought, sold, lent, or stolen, just as may other physical artefacts. 'Transactions' involving physical media do not diminish the hypothetical store of inscribable 'content'. In so far as a transaction involving media is concerned, the medium changes ownership (or possession when lent or stolen) but that which is inscribed is owned by nobody or everybody. The price paid when a medium with 'content' is bought covers the cost of the medium, the cost of inscription upon the medium, and the cost of distribution; the transaction is one of convenience for a buyer seeking ready access to the 'content'; it represents add-value to 'content' itself lacking any monetary worth whatsoever, this regardless of expense involved when concocting the 'content'.

    When a medium and the message it contains are indivisible, it's not unnatural, yet intellectually lazy, to conceive of them as one physical entity. In the early days of the printing press, copyright was the exclusive entitlement to distribute printed copies of text and lithographs. Transactions, being in the physical world, could be policed for compliance with the law.

    With passage of time, the world of copyright became ever more complicated, inclusion of layout for text and tables, indexing, and typefaces, as IP are examples. Argument and legal judgements ensued over quotation and “fair use”. The introduction of photographs, recorded music, and cinema complicated matters considerably.

    Only upon introduction of readily available digital computation, storage media, and transmission, did the penny fully drop for people not already immersed in the profitable world of ever more silly applications of IP 'rights'. Digital data cannot be 'owned' regardless of such pretence in law. For the past forty years, this realisation has dawned upon young people up to the middle-aged. Introducing supposed AI makes obvious the disconnect between enforceable copyright and reality. Of course, the specious nature of IP should have been recognisable when the Statute of Anne was formulated, but it was not so glaringly obvious as now.

    There is a concatenation of circumstances leading to the demise of copyright and its associated rentier economics. The first is growing unhappiness about the productively futile nature of 'financialised' market-capitalism, which is consequential upon Neo-liberal economics. When sweeping that away, clearer understanding will return about malaise brought about by unchecked monopolies; these evident in transnational corporations, conglomerates, and nonsensical copyright taken to such an extreme that its whole conceptual foundation has collapsed. Ideas and their applications shall be incorporated within orthodox pre-neoliberal market-economics as they should from the beginning. Ideas cannot be traded. Creative aptitudes can be placed in open competition in markets. Acquired reputation is the selling point. Attribution to sources of ideas borrowed or derived from shall be the principal entitlement on offer to creative enterprise.

  31. Anonymous Coward
    Anonymous Coward

    OpenAI has said it would be "impossible"

    yeah, you can't break an omlette without making a few eggs (don't blame me, I'm only quoting copilot, what do I know!) I'm sure, the 'disruptive model' bras behind ubers of the world have heard this argument too. No, wait, there was prior art! - something about stealing the first milion (nowadays: billion). Funny how this argument never comes up in court though. Perhaps because once past the first milion / billion you then settle out of court, if it (ever) comes to that minor hindrance in running a successful business operation?

  32. vistisen

    I can not see why AI shoudn't user copyrighted information... as long as it pays royalities. If I ask AI to ’paint ' me a picture and it use a complex algorithm to find the best match from the millions of paintings that I has stored in its database, then it can also use the same algorithm to distribute the royalties to the uses sources with those that made the biggest contribution getting the highest proportion. The same principle could be used for questions that use information to provide the ‘correct’ answer for factual questions. Again, the used sources could receive royalties. The very mechanism that ‘trains’ the models is attaching value to how much the individual nuggets of information is worth in searches, means attaching financial worth is already possible. It is just another tag.

  33. Tron Silver badge

    You have a choice.

    Either you have AI trained on copyright material, or you don't have usable AI. Pick one.

    The best they could do is run phrases through Google and amend them using a thesaurus if they recur too many times, before spitting them out. Which is what students have been doing in their essays for decades. 'Frankly my dear, I don't give a toss' etc.

    The legal option is to list your sources (as students should do) and pay a licensing fee to them - eg. Wikipedia, Harvard UP, Oxford for theses, or 4chan.

    If you don't make a profit, and label your AI 'experimental', you should be able to just go ahead. But consequent use should also be non-commercial.

    It's like fan fiction. You don't make a profit.

    I do like the idea of training an AI on early fiction (including 'Fanny Hill'), as well as an early edition of the 'Britannica'. Much more decorum than all this modern stuff. It might help one solve the most intractable of problems, such as finding and retaining high quality servants.

    Memo for GAFA and Washington's enthusiastic ban hammer: China will just go ahead and do it, and keep getting better at it, whilst you ponder all this.

    1. This post has been deleted by its author

    2. This post has been deleted by its author

  34. Paul Hovnanian Silver badge

    University Degree

    "Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens."

    IIRC, I was required to purchase (for a non trivial amount of money) my college text books when I was trained. The authoring, publishing and sales of which represent a profitable business sector in this country (USA). And one that is jealously guarded.

  35. Matthew 25

    Shirly

    Three things

    1 It is COPYright

    Applies when work is copied. Storing in a digital format is creating a copy. This doesn't usually happen when a person reads, listens to, or looks at a thing. But does happen when training LLM. Plus some statistics.

    2 LLMs are not general AI. They cannot think.

    If we rely on LLMs for our words, pictures, music etc we will never have anything new. Only rearrangements of things that already exist. That is how they work.

    3 In that sense, what we are told about LLMs is a con. They are a cul-de-sac for the human race because we are told we are getting something new from them when all we can possibly get is recycled. They cause stagnation in our thought process.

    1. CapeCarl

      Re: Shirly

      This theme would make for an interesting StarTrek NextGen episode...An initial contact with a planet where all art, writing, software, engineering output has bern endlessly recycled via LLMs for a few centuries.

      Everyone involved in producing any new content of any kind has long died off...CompSci curriculums consist almost entirely of tuning ChatGPT queries (a highly valued skill in said society).

      Hence no forward motion for the society of Dullmonia, and they never discover Warp drive.

  36. Sceptic Tank Silver badge
    Go

    What's new?

    Some notable tech companies (particularly one that supplied computer languages and operating systems in the past) are wholly built on exploiting other peoples' ideas.

  37. Anonymous Coward
    Anonymous Coward

    No excuse

    LLMs get no special advantage from using Micheal Creighton books for training. In truth, current LLMs cannot even "see" the essence of what makes Micheal Creighton a best seller. LLMs can see up to the level of good prose and comfortable small talk, but not beyond. You can see this limitation clearly by reading what ChatGPT outputs. It comes no closer to being able to writing a best seller than an average liberal arts college student trying to write a story in the style of Micheal Creighton - which is zero.

    OpenAI could have paid negligible sums to secure the use of recent works of modern writers whose prose is excellent. They could have paid negligible sums to secure the use tens of thousands of forgotten novels older than 10 years. There would have been no drop in ChatGPT quality from what exists now.

    For OpenAI et. al. to help themselves to any artists work without paying is pure ugly hubris and more expensive in legal costs than it would have cost to pay honestly for good enough training data in the first place. "Good enough" because that would surely be giving ChatGPT the benefit of the doubt as a description of ChatGPT's own output.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like