back to article UK inertia on LLMs and copyright is 'de facto endorsement'

A committee of UK legislators has slammed the government for its response to alleged copyright theft as a "de facto endorsement" of the way tech companies build large language models. Since OpenAI's ChatGPT launched in late 2022, industry pundits, the media, and governments have insisted that the LLMs powering the technology …

  1. Mike 137 Silver badge

    Data = dollars

    This government sees "growth" as the most important thing on the planet, and data has become a key source of value for commerce. Consequently, the aim is to 'liberalise' all aspects of data use, from ignoring copyright to neutering personal data protection. It's all part of the same misguided view that revenue for some equates to benefit for the many regardless of collateral consequences. An overlooked problem is that growth can only continue while space is not yet full and resources not exhausted. The snag is that both are becoming so (witness the escalating costs and disintegration of national infrastructure and services). In order to survive in the long term we need a new model for economic success that depends on something more akin to societal needs.

    1. Anonymous Coward
      Anonymous Coward

      Re: Data = dollars

      I would say "consolidation", not growth. As in consolidation of power - and a corresponding dearth of competition - which in the long run results in shrinkage. Hardly new.

  2. elsergiovolador Silver badge

    New

    Well, government putting big corporation profits over benefits of the citizens?

    That's new.

    1. DJO Silver badge

      Re: New

      Well they know they'll be out of a job soon so they need to ensure there's a nice lucrative sinecure they can breeze into after they are rejected at the polling booth so cosying up to big corporations who can afford to "employ" some formerly useful but actually useless people makes complete sense.

  3. Doctor Syntax Silver badge

    "If you go too far down a path where it's very hard to obtain data to train models, then all of a sudden, the ability to do so will only be the preserve of very large companies."

    Translation: We don't want to spend money on that.

    1. Howard Sway Silver badge

      "You need to train these models on large data sets if you're going to get them to perform effectively"

      Translation: you need to steal a lot of stuff if you want to make a lot of money

      1. Steven Raith

        Further translation:

        If we aren't allowed to steal, our hundreds of billions of dollars of investments are worthless.

        (good, go bankrupt, loser)

  4. Anonymous Coward
    Anonymous Coward

    "Model builders argue their use of the material is fair."................

    .....but not necessarily truthful!!!

    How big is the LLM? Who validated the LLM? How long did validation take? ...or was the LLM validated at all?

    .....and then, of course, one wonders about the outputs!!!

  5. Eclectic Man Silver badge
    Unhappy

    World-Leading

    In its response, the government said the UK had "world-leading protections for copyright and intellectual property."

    Which actually means that if everyone is shit, 'world-leading' means marginally less shit than the rest. This is clearly nothing like good enough for content creators, authors, performers etc whose work is their livelihood and is used without permission or payment for another organisation's profit.

    1. Doctor Syntax Silver badge

      Re: World-Leading

      The fact that this is a response to a substantial, well argued report to the contrary tells us only that the responder has reading and comprehension problems.

  6. Adam Foxton

    Value of Data

    "If you go too far down a path where it's very hard to obtain data to train models, then all of a sudden, the ability to do so will only be the preserve of very large companies."

    What they're saying here is that this data has value, they can see the value of it. So there's something there worth protecting.

    1. Doctor Syntax Silver badge
      Pirate

      Re: Value of Data

      The argument continues that if something has worth then anyone who has need of it should be able to take it freely. The corollary must be that, accepting at Microsoft's valuation that their software has worth, then it can be freely pirated.

  7. 96percentchimp

    "inadequate and deteriorating"

    ...describes this government's record in every sector except enriching themselves and their cronies.

  8. may_i

    "A committee of UK legislators"

    This tends to say it all really. Legislators tend to be members of bodies, the kind of bodies which include the board of directors of companies like Elsevier. Copyright monopolists.

    If people who actually know what they are talking about think there are good reasons to start pouring human knowledge and culture into machine learning systems in the hope of creating something incredible, that's a good reason why copyright should not apply to that use.

    Imagine creating a potentially ultra-intelligent child and being legally unable to impart all of humanity's defining data to that child to fully educate them. Where is the sense in that?

    Most people have absorbed trademarked words into everyday speech. In theory anyone who calls a vacuum cleaner a "Hoover" is committing misuse of the "Hoover" trademark. Lines between "fair use", "inspired by" and other concepts which provide a small respite from the already overextended rights of copyright and patent holders are constantly being re-drawn in increasingly restrictive ways, to the benefit of corporate entities instead of society at large.

    The legal and legislative debate and actions concerning ML systems, potential future AIs, copyright, patents, trademarks and the ownership of knowledge in general will define the next decades. There are very real risks with allowing corporate interests to assert control over the ownership of human knowledge and culture. Excusing monopolistic behaviour regarding knowledge by claiming to be a 'trusted curator' is transparently self-serving. Curating humankind's collective data is a society's responsibility, as is making it widely available. In times long ago, learning your tribe's songs and stories so that knowledge was passed to the next generation was a responsibility given wisely to trusted people. If nobody bothered to pass on information about which plants would kill you if you ate them, your tribe would probably die out . Here in 2024, we have taken that idea and made it many orders of magnitude larger. Our modern data storage and networking technology lets us store incomprehensibly large amounts of knowledge, culture and data. So much data that we cannot make good use of it without a lot of help. The help is the exponentially more powerful computer and the ways we can express our thoughts to them.

    We have the possibility in front of us to possibly make something far more than the sum of its parts. A risk which is surely worth taking?

    Locking what is needed to teach our collective child behind laws which only benefit rich, greedy monopolists is not in society's best interests.

    I fear that there are too few people in governments who are impartial enough, knowledgeable enough and imaginative enough to be trusted with this responsibility.

    1. Paul Crawford Silver badge

      Re: "A committee of UK legislators"

      Imagine creating a potentially ultra-intelligent child and being legally unable to impart all of humanity's defining data to that child to fully educate them. Where is the sense in that?

      Traditionally you bought books for your child...

      1. Roland6 Silver badge

        Re: "A committee of UK legislators"

        Or left the child in public libraries, which naturally had paid royalties for the books in their collection.

      2. Max Pyat

        Re: "A committee of UK legislators"

        Exactly, and you might have bought the books with some of the money you made renting the child out to a chimney sweep

    2. Roland6 Silver badge

      Re: "A committee of UK legislators"

      LLMs are not knowledge networks, we are a long way from useful machine learning. An LLM-based chatbot might be able to tell you a lot about a red traffic light, which may or may not be true, but is of little real use when you encounter a red traffic light.

    3. ChoHag Silver badge
      Thumb Down

      Re: "A committee of UK legislators"

      > there are good reasons to start pouring human knowledge and culture into machine learning systems in the hope of creating something incredible

      Fantastic!

      > that's a good reason why copyright should not apply to that use.

      Nope. Pay up. You don't get to ignore laws just because you're rich and you don't like them.

      ... Well, that's literally how it works but it shouldn't.

      "I have to steal your stuff for the good of humanity" was never a great selling point. After all, the victim is part of humanity. Thanks for your creation, now fuck off and die somewhere while we make bank off it.

  9. Anonymous Coward
    Anonymous Coward

    Err what?

    Copyright infringement is a *civil* matter. The courts are there for anyone to sue the crap out of them already. Berne convention means that right is automatic. Alternate view: ‘ Publishers want more laws to screw the unjust monopolistic rents from society having increased the length of copyright from the original 4 years .’ Am still looking forward to the current incentives delivering John Lennon’s next opus. I mean sure there possibly some transparency issues and ensuring the evidence exists . Otoh already if it turns out the AI companies are using unlicensed, incorrect or inappropriate licences to access content they will be writing some *big* cheques, possibly existentially big cheques.

  10. Anonymous Coward
    Anonymous Coward

    Err what?

    Copyright is a *civil* matter. Courts are there to sue the crap out of them already. Automatic by the Berne convention. Alternate headline ‘Publishers seek new laws to screw unjust monopolistic rents from the economy by extending copyright again and again and again’. Still waiting for the current incentives to deliver John Lennon’s next opus , it was originally 4 years and now its making being a descendant a decent living. Sure may need some clarification on transparency but pretty sure discovery rights in the US should show that up. If they have been using unlicensed , incorrect or inappropriately licensed content they will be writing some pretty big cheques. Especially as the courts will add a 0 if it’s shown to be intentional. Looks like the HoL jumping on the bandwagon at best or playing politics.

  11. hairydog

    Going Equipped for Theft

    The only usual way of taking action for cases of copyright theft is to prove that your content has been taken.

    AI makes this proof almost impossible by anonimising what it steals.

    For offences of theft of physical objects, there is the offence of "Go Equipped For Theft". So your gloves, jemmy and swag bag can get you arrested.

    Why isn't the equivalent used for copyright theft?

    After all, AI almost always depends on stealing other people's work.

    Why isn't this prosecuted?

  12. Anonymous Coward
    Anonymous Coward

    Robots.txt says hi

    End of story.

    1. doublelayer Silver badge

      Re: Robots.txt says hi

      Because these companies have demonstrated that they definitely care. When, for example, there is a paywall in front of it, which is much stronger than robots.txt, they bypass it anyway. What makes you think that a simple rule in that file would do anything, especially when the copy they absorbed was obtained from a site where you don't control that file anyway? Some of them have said that they respect robots.txt for the general web crawls. Let's assume we believe them. It doesn't apply to extra datasets like books 1-4 or probably more at this point or the dedicated news ones. You know, the datasets where most of the copyrighted works are.

      Let's see an example. From nytimes.com/robots.txt:

      User-agent: Amazonbot

      Disallow: /

      User-agent: anthropic-ai

      Disallow: /

      [...]

      User-agent: ChatGPT-User

      Disallow: /

      User-agent: ClaudeBot

      Disallow: /

      User-agent: Claude-Web

      Disallow: /

      User-agent: cohere-ai

      Disallow: /

      User-agent: DataForSeoBot

      Disallow: /

      User-agent: Diffbot

      Disallow: /

      User-agent: FacebookBot

      Disallow: /

      User-agent: Google-Extended

      Disallow: /

      User-agent: GPTBot

      Disallow: /

      [...]

      So we should be pretty assured that the New York Times has no articles in the training data, right?

      1. Bitsminer Silver badge

        Re: Robots.txt says hi

        Yet the robots.txt from archive.org says "Please crawl our files."

        And the content of NYT lies therein.

        The AI crawlers have been at it for many years, and so they've got most of the old stuff and are now only crawling for the new. Or the news. The publishers only just recently woke up.

        1. doublelayer Silver badge

          Re: Robots.txt says hi

          The point is that robots.txt is trivially ignored. It is not the responsibility of companies to try to put in rules there to prevent other companies from breaking the law, and doing so does not prevent those companies from breaking the law. Robots.txt is a nice mechanism to prevent search engine bots from doing something stupid and abusing your resources. Otherwise, it's not very useful at all.

  13. Groo The Wanderer Silver badge

    Follow the money - who is paying whom?

  14. amanfromMars 1 Silver badge

    What else did you expect, dummies? For them to sentient and original ?????

    AI and LLMs hear your self-serving concerns and pathetic whines, and both reject and ignore them ...... just mirroring human reaction and activity whenever encountering increasingly effective competition and uncontrolled opposition, so you only have yourselves to blame and shame?

    1. amanfromMars 1 Silver badge
      Pirate

      Re: What else did you expect, dummies? For them to sentient and original ?????

      That second question in the title should of course read, and be written down as ...... For them to be sentient, and original???? ‽ ..... although surely, any body with half a wit will have understood it to be asking such of humanity ........ whilst most probably also justifiably fearing it to be so, whenever possible intentions for future exercise and presentation are way beyond and far up ahead of any Earthly command and remote human control.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like