back to article OpenAI sued, again, for scraping and replicating news stories

Three digital publishers have sued OpenAI over claims that it stole their copyrighted articles to train ChatGPT in two separate lawsuits filed on Wednesday. ChatGPT was trained on huge swathes of text scraped from the internet, including lots of journalism. News publishers, however, aren't happy that OpenAI used their articles …

  1. Neil Barnes Silver badge
    Childcatcher

    So how do you stop this?

    Other than the obvious approach of taking a large axe to the memory banks...

    Is it necessary to retrain the entire model again, without the (potentially) offending material? Or are the designers hedging around with tests to say 'if the output looks like this: <> then don't do it/apply credits'? I know too little about these statistical models, but what I have seen recently suggests that for the average bloke on the Clapham Omnibus, they're not holding any great benefits.

    1. Dinanziame Silver badge
      Alert

      Re: So how do you stop this?

      I think those models can be really useful to help writing boring stuff, like "write a letter in German to my landlord to complain about this issue with the building". It might help people whose job is to write things, though for the moment they might want to be really careful about fact checking the output, since the models tend to hallucinate. When it comes to automatically answering questions and queries though, I think their ability to put statement of facts into nice-looking sentences is mostly form over function — the form looks high quality, which we tend to think implies that the content is high quality as well, but that's not the case.

      1. rg287 Silver badge

        Re: So how do you stop this?

        It might help people whose job is to write things, though for the moment they might want to be really careful about fact checking the output, since the models tend to hallucinate.

        Aside from outright hallucination, they can give some very odd/bland output even when they're accurate. This article on a local news site reports an abnormal load moving out of the local GE factory who make grid-scale electrical gear - transformers and such.

        The opening two lines read:

        The officers from the Staffordshire Police Roads Policing Unit were working to help the abnormally large load, which consisted of a metal container and larger metal supports, through the streets of Stafford on Saturday.

        The jumbo machine bears the livery of heavy load and transport company Allelys.

        Now, it strikes me that this is very generic, even by the low standards of modern local journalism (where one journo is covering two counties). A real human writer would have been able to say what the "metal container" was, because they'd have asked someone or bobbed an email to GE press relations. They'd be able to say "the oil-filled transformer weighs around x tonnes and is bound for upgrade works to the grid in destination".

        The second line about the livery... who cares about the livery? A more journalistic line might be "The move was enabled by heavy transport specialists Allelys".

        I can't say for sure, but it feels very like AI generated output, possible with some image-to-text "describe this photo". I suppose that's a matter of training and finessing. AI could be useful for the "boilerplate" of journalism and getting words out, albeit with full human oversight and fact-checking. But not if it needs a 100% rewrite because it's playing "say what you see".

      2. H in The Hague

        Re: So how do you stop this?

        "... is mostly form over function — the form looks high quality, which we tend to think implies that the content is high quality as well, but that's not the case."

        Yup, same problem with machine translation: it looks OK, most of the words and grammar are OK, but the content might not be OK.

        Now, in the translation industry they use Machine Translation Post-Editing. But if you're under pressure due to the low rate and are not familiar with the subject of the translation I would think it's going to be very difficult to spot mistakes. I have seen MT leave out a little word like 'not', and that does rather change the meaning.

        All this ML stuff would be a lot more useful if you could train it on just your own documents, instead of on a large chunk of the web. Funny, thing is, that's rather like the Computer Assisted Translation some of use have been using for a decade or two. That does work well, and makes translators more efficient, without distracting you with irrelevant stuff.

        1. Neil Barnes Silver badge
          Headmaster

          Re: So how do you stop this?

          Google translate recently changed 'half-finished' to 'finished' in an English-German translation for me.

    2. JoeCool Silver badge

      The obvious approach is to enforce copyright ...

      and give partial ownership to the news orgs.

      That's how song plagerism is handled.

  2. Anonymous Coward
    Anonymous Coward

    Embrace the verbatim

    Actually - the "verbatim" quote WITH ATTRIBUTION is what would make AI actually useful and prevent "telephone game" decay of information.

    Embrace it and develop an economic model where the original source gets a micropayment for every hit, with means of verification.

    Something like Spotify for news.

    1. Kevin Johnston

      Re: Embrace the verbatim

      Fully agree, not only does this avoid lawsuits from the source owners but it also builds confidence in the output of these systems unlike the recent examples where the advice or quotes turn out to be complete fiction

      1. Doctor Syntax Silver badge

        Re: Embrace the verbatim

        It does raise the possibility that the attributions and payments go to real copyright holders even if the alleged quote is an hallucination. That could raise the possibility of even more court cases down the line when the alleged copyright holder gets sued for something they'd never actually said.

        You couldn't make it up if you tried.

        1. 42656e4d203239 Silver badge

          Re: Embrace the verbatim

          >>That could raise the possibility of even more court cases down the line when the alleged copyright holder gets sued for something they'd never actually said

          In the UK there is a legal presumption that says, effectively, "The computer is always right" and that there is no burden on the prosecution to prove that the computer actually is right.

          The copyright holder would be being sued for something the computer said they did, but they didn't, and be in the unenviable position of having to prove the computer wrong - which they wouldn't easily be able to becasue the prosecution just says "Computer is always right your honor" at which hizonor sagely nods and finds in favour of the claimant.

          So, in the end, the copyright holder will have said the stuff the computer claimed they did, even if they didn't!

          Obviously this is a bit reductio ad absurdum but we are talking lawyers and big buckets of money so anything goes.

          >>You couldn't make it up if you tried.

          Indeed. Kafka would be having a field day!

          1. Doctor Syntax Silver badge

            Re: Embrace the verbatim

            Apart from the fact that the "computer is always right" presumption is likely to get a good kicking any time now in the light of the Post Office scandal it belongs to criminal law. It puts the onus on the defence to prove that it was wrong. Even if statute law isn't amended quickly the defence is going to remind any jury of the fallacy of that whatever the judge might direct them about the law. In reality I doubt a judge would now give any strong directions about it. In a jury trial the jury, ot the judge, is the tribunal of fact. In a non-jury trial the judge, unlike the jury, has to give a reasoned verdict.

            Civil law - people being sued - works on balance of evidence. The situation would be very different as it would be what one one computer said in the form of a website vs what the other computer said in the form of a chatbot.

    2. Filippo Silver badge

      Re: Embrace the verbatim

      >Spotify for news

      Nice concept. I'd pay for that.

  3. tiggity Silver badge

    costs

    I would be surprised if copyright / attribution was not considered early in the day.

    I would be totally not surprised if doing that meant a lot more costs., from quality of input data (copyright / attribution would need to be included) - extra time & money, through to the AI itself, with my limited reading around AI models then it would be extra work & data store size, to link attribution data to a particular text embedding and ripple that back

    1. heyrick Silver badge

      Re: costs

      Copyright/attribution means sharing their hard earned dollars with everybody else, so...

  4. vtcodger Silver badge

    How Long?

    One wonders how long it will be before some publisher(s) obtain (a) DMCA takedown notice(s) for ChatGPT and its kin.

    1. druck Silver badge

      Re: How Long?

      The sooner the better.

  5. Doctor Syntax Silver badge

    How can they say LLMs don't include attributions. They've been known to include stock image watermarks in images/

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like