back to article If you're going to train AI on our books, at least pay us, authors tell Big Tech

More than 8,000 writers have signed an open letter penned by the US Authors Guild urging leaders from six top AI companies to obtain consent and compensate them for training models on their copyrighted work. Large language models are trained on large amounts of text scraped from the internet. Hundreds of thousands of books …

  1. Anonymous Coward
    Anonymous Coward

    There's another risk: repurposing copyrighted data

    Given that these models are trained on data and texts that are under copyright (and patents, and other legal protections), any output will have to be checked for required contribution distribution.

    But I think it may get worse.

    Once that output is used by other parties, that content again enters under the various legal umbrellas that exist - and the only winners in the fight to then untangle the mess are lawyers.

    Not good.

    Anyone publishing information should really start to check the legal protections they have - if you check Google's conditions, or Facebook's, you will find that you signed away all your rights. They've been working on this for a long time - you're not just milked for personal details, but also for free contents. Now them chickens are coming home to roost..

    1. Roland6 Silver badge

      Re: There's another risk: repurposing copyrighted data

      Things will get very interesting when someone uses a LLM to create the lyrics to a song, the record companies will ensure this goes to court…

      1. Yet Another Anonymous coward Silver badge

        Re: There's another risk: repurposing copyrighted data

        >Things will get very interesting when someone uses a LLM to create the lyrics to a song, the record companies will ensure this goes to court…

        Or you can use this vast corpus to find out which song lyrics have occurred in previously published books

        I'm guessing most pop songs don't consist entirely of brand new sentences

        Some of the larger record companies are going to be a little reticent about going to court to claim a fragment of a song re-appeared in a LLM, especially after they spent 70 years claiming that their artist precisely copying a riff from some unknown artist wasn't an issue

    2. Dimmer Silver badge
      Joke

      Re: There's another risk: repurposing copyrighted data

      It’s an AI,

      Just ask it if it was trained on copyrighted material.

      It’s an AI,

      Will it tell you the truth or make up crap.

      1. Sam not the Viking Silver badge
        Alert

        Re: There's another risk: repurposing copyrighted data

        Despite the icon, I don't think that is a joke. There's many a true word said in jest......

        When stating something that quotes 'facts' or 'sources' it is important that those references are listed so you can/could make your own verification/interpretation.

        Not providing background makes an article 'opinion' and should be labelled as such.

        We already know AI can make up lies.

        https://www.theregister.com/2023/03/02/chatgpt_considered_harmful/

      2. jmch Silver badge
        Trollface

        Re: There's another risk: repurposing copyrighted data

        "Will it tell you the truth or make up crap."

        It's an AI - it doesn't know the difference between truth and made-up crap!!

        1. Anonymous Coward
          Anonymous Coward

          Re: There's another risk: repurposing copyrighted data

          It's one big *glob* of anything parseable with a few humans typing in complex regex's from now to infinity. The smartest Ai is the one trained by the person with the least amount of typos.

    3. Michael Strorm Silver badge

      Warning: Contains mixed metaphors

      > if you check Google's conditions, or Facebook's, you will find that you signed away all your rights [to] free content

      This assumes that *you* were the one who uploaded *your* content to Google or Facebook in the first place.

      But, of course, much- if not most- content will have been posted or reposted to those services by random third parties. The majority of whom likely neither own the content nor had the permission to do so, making any (implicit) agreement to permissions they never had the right to grant irrelevant. (*)

      Unless they have some other, watertight way of determining which permissions they *actually* have- i.e. those that are worth the paper they're written on or the digital equivalent- the whole thing is a can of worms, a house of cards built on a loose approach to copyright and permissions that has now come home to roost.

      (*) Yeah, they could theoretically sue your Aunt who uploaded that funny cat picture for implicitly agreeing to let them redistribute a tenth-generation copy of a stock-photo without permission, but they won't, and it wouldn't solve the problem that they didn't have that permission regardless.

  2. Anonymous Coward
    Anonymous Coward

    "It is only fair that you compensate us for using our writings, without which AI would be banal and extremely limited."

    So all of the authors are top rank, none of the 90% who, following Sturgeon's Law, would only make the AI even more banal?

    (Yes, saw there ARE well-known names: especially near the start of the list)

    1. doublelayer Silver badge

      If they don't think the authors' work has any benefit, they are free to leave it out of the training data. That they have not suggests they think there is value in having that text in there, and they are using that value to make money. It's not on us to decide how much value they are getting from any given book, but on them to decide whether they are willing to pay for the use of copyrighted data they don't own. They can decide to exclude something because it is not available for sale, because it isn't worth as much as is being asked, or because they think it will be detrimental.

      1. that one in the corner Silver badge

        Never mind the quality, feel the width

        > If they don't think the authors' work has any benefit, they are free to leave it out of the training data

        True - except there doesn't seem to be any indication that they are spending all of the money/effort/time required to evaluate the content of their entire training set: after all, doing so would only reduce the amount of bulk material used and part of the boasting is how much text was used[1]!

        Although, having said that, you can also feed the training with a good dose of negatives: "please don't write stuff like this". So even after categorising every bit of their dataset they'll still use it all, just maybe not in the way that the author's would like (they aren't happy now, but if they find that they are on the Naughty Training List they're likely to get really upset!)

        [1] some say 5GB for GPT-1, 40GB for GPT-2, 600GB for GPT-3

        1. doublelayer Silver badge

          Re: Never mind the quality, feel the width

          That's their problem, and they're certainly welcome to pay someone for bad writing to caution against. However, I've seen enough bad writing that's freely available online that I figure they could probably find enough for free to add to the caution pile.

  3. that one in the corner Silver badge

    They only want to whip the LLaMa's ass

    > The social media giant often releases its models for academic research, and was criticized for not being as open as it claimed by preventing developers using LLaMA for commercial applications.

    Huh? Open generally just starts with "you can read it", verify it and so on - *some* licences (including the most famous ones, at least to the audience here) then go on to say that you can also *use* it, with or without other restrictions. Such as, not for commercial use.

    This isn't complaining about lack of openness, it is just complaining about not being allowed to exploit.

    Which was probably a Good Thing - these LLMs are already being shoved into use in places where they are not fit for purpose[1], no need to encourage more of that.

    Fingers crossed, the newer LLaMa, if it is released for commercial use, will at least have benefitted from being poked by academics and will, compared to its predecessors, have less chance of making a total mockery out of everyone deploying it.[2]

    [1] and, yes, you can indeed argue that applies to every use made of them so far; I'll agree for the well-hyped uses by the Big Players, certainly.

    [2] or Meta have realised they'll never get any better, so soak everyone in one fell swoop before the word gets out; place your bets now.

  4. Anonymous Coward
    Anonymous Coward

    OpenAI has announced partnerships with ... Shutterstock

    Well, Shutterstock appears to have a better record than Getty Images when it comes to "acquiring" images that don't belong to it. Not squeaky clean, but less aggressive about it.

    The value to OpenAI will be that they can just shrug off any complaints and point the plaintiff at Shutterstock.

    Bet OpenAI wish they had done the same trick with their text training data. The deal with AP will help but who will trust them when they claim they have started from scratch again with a clean data collection?

  5. amanfromMars 1 Silver badge

    An Alien Intervention successfully proving itself surprisingly difficult to believe is not a WMD

    Is it an inherent human learning difficulty .... an endemic weakness and convenient vulnerability for serial exploitation and continual development ..... that has the species not accepting the stealthy grooming of Secret Almighty IntelAIgent Reset Services by Remote Virtual Machinery for an Earthly Takeover and Universal SCADA Systems Administrations Makeover?

    And that question for peer review and worldwide consideration is asked of and unerringly directed to whomever and/or whatever it may be of grave concern and growing insurmountable worry and unbelievable consternation.

  6. packrat

    really?

    anti intellectual learning swarm, assemble.

    silliest thing I've ever heard. Might drive a high quality AI tho.

    now about the method of traning nodes...

    pat

  7. Eclectic Man Silver badge

    According to (Sir) Nick Clegg

    who was interviewed on the BBC Radio 4's 'Today' program this morning, all the Meta LLM does is 'guess the next word' in a sentence. It has no 'intelligence' of tis own. I do not recall him being asked about paying authors whose works have been used to train it.

  8. TheMaskedMan Silver badge

    "More than 8,000 writers have signed an open letter penned by the US Authors Guild urging leaders from six top AI companies to obtain consent and compensate them for training models on their copyrighted work."

    Nothing wrong with that. As the creators of the material they are surely entitled to say whether it can be used or not, and are likely to want paying if it is.

    "Large language models are trained on large amounts of text scraped from the internet. Hundreds of thousands of books hosted on websites have been ingested without writers' permission. Now many of those writers are speaking out against having their work ripped off by computers."

    How did the books get on the websites, though? Are they uploaded by the author / publisher for people to read? If so, is there a fee?

    Or are they pirate copies uploaded by third parties without permission? If so, are the authors actively seeking to have the infringing copies removed? If not, why not?

    If they uploaded the material themselves, I don't have much sympathy with them - you can expect people to use material that you publish online - though I agree that they should be able to prevent the AI bods from continuing to use it if they wish.

    If it's pirate material, they should be looking to get the pirate copies removed in addition to cutting a deal with the AI bods.

    1. Falmari Silver badge

      @TheMaskedMan "Or are they pirate copies uploaded by third parties without permission? If so, are the authors actively seeking to have the infringing copies removed? If not, why not?"

      To try and get infringing copies removed they would have to know the sites those copies are on. That would require the AI companies to declare where the data came from.

      Has any of these AI companies ever stated what each item of data is a model was trained on, and from were it was obtained?

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like