back to article New York Times sues OpenAI, Microsoft over 'millions of articles' used to train ChatGPT

The New York Times has sued Microsoft and OpenAI, claiming the duo infringed the newspaper's copyright by using its articles without permission to build ChatGPT and similar models. It is the first major American media outfit to drag the tech pair to court over the use of stories in training data. As with similar suits – …

  1. Valeyard

    I hope it succeeds

    If only so that the LLMs don't try to insert some non-sequitur anti-UK opinion at every opportunity if it's trained on that rag

    1. Zibob Silver badge

      Re: I hope it succeeds

      I hope it succeeds too.

      If only to take in and combine all sources so that such biases from your, their, shitposts don't make it to an influencing position in on such data.

      Yes its the case that "Ai" can warp media but that just a consequence of it not having access to all of the media. It does not have to regurgitate such media verbatim, but that all *we* allow it access too in set various rules.

      It *can* work but currently we are all in the "but what about me" mindset. No to get all communist about it but nothing happens if we keep all information secret.

      Not that we have to open source everything but there is some lubricity needed between the pay walls and open information.

      Otherwise there will just be pirate news.

      Best to fogure out how to work with than against. A piece of any pie is worth more than 100% of nothing

  2. Anonymous Coward
    Anonymous Coward

    If it's free on the Internet

    then it's free for all to do whatever they damn like with it.

    Restrict access to your site if you don't like it.

    The Times also permits search engines to access and index its content. This here is your problem.

    1. Anonymous Coward
      Anonymous Coward

      Re: If it's free on the Internet

      Restrict access to your site if you don't like it.

      I'd think that's what a paywall is for?

      You also can't argue that allowing search engines spells "free for all". At best you can say that they shouldn't rely so heavily on peoples' honesty, it's trivial to make your browser pass for a search engine.

      1. elsergiovolador Silver badge

        Re: If it's free on the Internet

        They should also block people from remembering the articles they read on these websites, because they effectively train their brain on them.

        1. Anonymous Coward
          Anonymous Coward

          Re: If it's free on the Internet

          Only comparable if you plan on selling that brain afterwards.

          I for one plan on keeping mine till the bitter end.

          1. elsergiovolador Silver badge

            Re: If it's free on the Internet

            I don't think it is as simple as it is.

            ChatGPT is not a brain / model you buy, but access to it. It's very much the same when you train your brain on those articles and then run lectures and seminars based on the learned knowledge.

            You really would have to give people a memory loss causing pill after reading each article.

            And to be fair, the level of journalism is so low today - it's basically recycled ready made stories, that such a brain cleanse would be beneficial.

          2. Trigonoceps occipitalis

            Re: If it's free on the Internet

            I was searching the dark web transplant organ sites. I'm getting a bit forgetful and think I need a new brain. I found that El Reg commentard brains were notably expensive.

            I asked the supplier why, expecting a pitch based on how intelligent etc the previous owner had been. No, "Never been used mate."

        2. m4r35n357 Silver badge

          Re: If it's free on the Internet

          I am embarrassed on your behalf for that infantile attempt at contortion. What are you, eight?

          1. Zibob Silver badge

            Re: If it's free on the Internet

            Okay just a tho7ght experiment, literally.

            Did you read any news today? I'm assuming yes because of where we are. So what headlines did you read?

            1. heyrick Silver badge

              Re: If it's free on the Internet

              "So what headlines did you read?"

              I mostly skip down to the BoredPanda articles in the news app.

              I'm on my winter holiday and I've more than had my fill of doom, gloom, deaths, pain, suffering, and mayhem. We're fucked and there's nothing I can do about it, so I'll pass my free time watching dumb movies and looking at amusing photos. It's either that or contemplate the coming zombie apocalypse, right?

              1. wub
                Thumb Up

                Re: If it's free on the Internet

                Too right. I'm doing my best to keep away from all the doom and gloom. But it is really hard. Very best wishes on your quest!

              2. ThatOne Silver badge
                Devil

                Re: If it's free on the Internet

                > or contemplate the coming zombie apocalypse, right?

                Coming? It's already here, just look around.

        3. Ken Hagan Gold badge

          Re: If it's free on the Internet

          "because they effectively train their brain on them."

          Citation needed, but don't lose heart! I'm sure everyone will be very impressed when you link to that paper explaining how the human brain works.

    2. damienblackburn

      Re: If it's free on the Internet

      Yes...and no.

      Here in the US, it being on the internet means it's free for viewing. It does not, however, give unlimited rights to the viewer.

      If I write a post in a personal blog, the content remains under my copyright indefinitely unless I have a preexisting contract to sign those rights over to someone or something else. Social media often has you sign the rights over to them for anything you create on their platform, for example. Anyone who creates original content of any kind automatically has the copyright to it. They can download or copy it ad nauseum, but attribution generally has to be given to not give the impression that it's their original work if it's a reposting (which isn't a bulletproof defense either).

      The waters start getting murkier when you throw things like transformative effects, parodies, educational use, and more. Profit, either direct or implied, doesn't actually play a significant factor into this. There is an argument being made that MLs are transforming the content into a new form, which is covered under Fair Use. And it's not a simple test, either. If the MLs are just copy+pasting it then it fails most prongs of the Fair Use test and there's a real case to be made for civil damages.

      1. elsergiovolador Silver badge

        Re: If it's free on the Internet

        So if someone reads your blog and tells someone about it and then those people ask that person questions about your blog, I understand this is forbidden?

        Because very much it is what this AI is doing, except at scale.

        1. damienblackburn

          Re: If it's free on the Internet

          Asking questions and making interpretation is fine. As is commentary.

          And most AI is not doing it. It's applying minimal transformations on it. Some metadata is being assigned to the information to aid in probability matching, but that's about it.

          1. elsergiovolador Silver badge

            Re: If it's free on the Internet

            I know a person with incredible memory and she could tell you the article she just read word for word.

            I guess she should be banned from reading anything.

            1. doublelayer Silver badge

              Re: If it's free on the Internet

              Well guess what. She would be forbidden from using that power to make copies of books by reading them once. It's not how you do it, but what you do. Copying stuff that you're not permitted to copy, not allowed. Copying substantial portions of what you're not allowed to copy, not allowed. The courts will need to decide if that's what the LLMs are doing, but they have done exactly that in the past, so they'll have to find a cool new argument for why they're technically doing something different. Your simple incorrect analogies aren't going to cut it.

              1. Persona Silver badge

                Re: If it's free on the Internet

                Copyright does not stop you from reading the text and rewriting it in your own words, provided the result is clearly more that simply shuffling a few words around.

          2. mpi

            Re: If it's free on the Internet

            > minimal transformations

            Please explain how ingesting half the internet and outputting a bunch of float32 numbers is "minimal transformation".

            1. Ken Hagan Gold badge

              Re: If it's free on the Internet

              I think the media hype around AI in 2023 would have been much less if ChatGPT's output had been a bunch of float32 numbers.

        2. Filippo Silver badge

          Re: If it's free on the Internet

          >Because very much it is what this AI is doing, except at scale.

          Yes, but the AI is not a legal entity. The corporation that trained the AI is. What the AI is doing with the data might look a bit like the situation you described, but what the corporation is doing (scraping and passing as input to a program) doesn't look like that at all. I don't think anyone in the corporation has read even 0.001% of the data they've downloaded.

          1. damienblackburn

            Re: If it's free on the Internet

            You're right in that the AI isn't a legal entity. However it is a product of the business, designed by a person there. You're trying to argue that if an automaker sells an unsafe vehicle they're not liable because the vehicle itself isn't a legal entity, which doesn't hold up in pretty much any jurisdiction outside of China.

            1. Filippo Silver badge

              Re: If it's free on the Internet

              Oh, no, I think the AI maker is very definitely liable - or, more accurately, that it has to be settled in court, that the answer is not at all obvious and might fall either way, and that "but it's just like learning" is not going to be a valid defense. I'm sorry if that wasn't clear.

    3. Filippo Silver badge

      Re: If it's free on the Internet

      > If it's free on the Internet then it's free for all to do whatever they damn like with it.

      No, it is not. Try to download someone's popular blog, format it as a book and sell it. See what happens.

    4. ChoHag Silver badge

      Re: If it's free on the Internet

      There is free and then there is Free.

      You do not have to pay for (some of) the content, you are not Free to do with it (in public) as you wish.

      You can try of course. We're about to find out what happens when you do.

    5. Anonymous Coward
      Anonymous Coward

      Re: If it's free on the Internet

      The information here is copyright, but the training of an LLM is also transformative. The legal question will be whether it is transformative enough.

      Also there is the question of if the LLM is breaking the copyright, or if the person driving the LLM is. Just being able to trigger the retrieval of data isn't enough, especially if the user is specifically asking for it... Just like you can't assume what comes out of a Google search is copyright or not - mostly it all is copyrighted.

      So... it's unclear where this is all going, but it probably is going to be necessary to be able to label content on the Internet in some way. Worst case we'll have AIs that think it's 1924.

      If the outcome is that we can identify and censor out information on the Internet that is illegal to read and know about, that'll actually be a good thing...

    6. T. F. M. Reader

      Re: If it's free on the Internet

      Access to the NYT site is restricted. It's searchable, however, and what a search engine does when you search for X is it tells you that NYT had an article about X (mentioning X, whatever) and provides you with a link to the article. If NYT demands subscription (paid or not) for you to read the article then it's your decision.

      Crafty you can also ask either a friend who has a subscription or ChatGPT about the article. The friend may tell you verbally what the article says or send you a link with a code as a "gift" (NYT allows that). ChatGPT will spit something resembling the article at you (and will tell you that this is what NYT has published, hallucinations notwithstanding). Whether the output is really close to the original or warped by hallucinations there is a problem, albeit a different one.

      What is the difference between your friend and ChatGPT (besides hallucinations, in which respect ChatGPT is like a friend you shouldn't trust)? At least two things. One is scale. Your friend can only do it occasionally (AFAIK "gifts" are limited, too), and NYT hope that you will be tempted to part with a few bucks yourself if you like the content and do it often enough. This looks to me as a valid marketing tactics. ChatGPT's scale is virtually unlimited in comparison. The other thing is that ChatGPT (read: OpenAI/MSFT) gets paid by (some of) its users. I can certainly understand that NYT would prefer you to pay them directly rather than another commercial entity that abuses the search engine access to give its customers access to their copyrighted material, possibly distorting it in the process.

      IMHO, the case certainly has merit. The outcome is not a foregone conclusion though.

  3. HuBo Silver badge
    Thumb Up

    Grand slurp canyon

    Way to go The New York Times!

  4. bofh1961

    It's all about profit

    There's no point in copyright except to protect profit. If you don't want to make a profit from your writing, you don't copyright it. I must get ChatGPT to see if it's slurped my website... I hope so! Bollocks, it hasn't...

    1. heyrick Silver badge

      Re: It's all about profit

      "There's no point in copyright except to protect profit."

      It's also to protect your work/time/effort. If I sat on my arse and wrote something on my blog, I'd not be particularly happy if somebody else copy-pasted it word for word to their website. If they want content, they should make their own, or buy it, whatever. [0]

      "If you don't want to make a profit from your writing, you don't copyright it."

      This is a uniquely American thing. The rest of the world (that has signed up to the Berne Convention) understands that the assignment of "copyright" (or author's moral rights in places like France [1]) is automatic and, specifically, does not require any form of registration to make it valid [2].

      I don't need to put any effort into copyrighting my crap (on my blog, the © mumble is just an automatic reminder at the bottom), I would instead need to put effort into revoking the copyright, like specifically offering it under a licence such as CC0. And that only works if you have the copyright in the first place. [3]

      This doesn't mean that I necessarily expect to make profit on it, it could be as simple as firing off a takedown request to have a copy of something of mine removed from somewhere else.

      The American necessity to register for copyright sounds a lot like the USPO - a fiction designed to keep lawyers at work.

      This, for example, is bollocks. What the....? https://www.copyright.gov/grtx/

      .

      0 - Not that anybody would want to copy the crap that I write, but the point still stands.

      1 - Moral rights are slightly stronger in that an author can object to an adaptation of his work that s/he feels might damage his/her reputation, etc.

      2 - copyright is automatic in the US, it's just you can't sue for damages unless the work has been registered, which sort of defeats the purpose really.

      3 - usual exceptions, such as work you create while on the clock is property of your employer unless your contract states otherwise, etc etc etc.

      1. thames

        Re: It's all about profit

        You don't have to register your copyrights in the US in order to sue for infringement. Registration just affects the sort of damages you can claim.

        If your copyright is registered you can claim statutory damages (an automatic amount) without having to prove actual damages (how much it really cost you). If you want to claim more damages than the statutory amount you can, but you have to offer proof of the value of the loss.

        If your copyright is unregistered then you cannot claim based on statutory damages and have to prove actual damages, which means showing proof that you actually lost money due to the infringement.

        What registration of the copyright does is basically make it easier for large companies to sue small infringers because they don't have to prove that the infringement actually cost them any money.

        1. heyrick Silver badge

          Re: It's all about profit

          "have to prove actual damages, which means showing proof that you actually lost money due to the infringement"

          Which is damn near impossible, and this (along with the lack of being able to claim legal fees) effectively destroys the ability to take any useful punitive action against copyright infringement for non-Americans.

          I mean, this sort of thing shouldn't even be a thing: https://ip-appeals.com/why-canadian-creators-should-register-copyrights-in-the-united-states/

    2. doublelayer Silver badge

      Re: It's all about profit

      Or to limit the actions others may take with the work. For example, to have a restriction on how something can be distributed or used. I can require people using copyrighted code to release changes as open source, or someone using copyrighted text or artwork to only use it in noncommercial situations, and I have those rights because of copyright. It can also limit where the work can be displayed. For example, if I write something on my website and I want people who read it to look at other things on that site, whether because it could earn me money or not (it's not), I can restrict others' right to put it on their website instead. Those things are not necessarily about profit, though they often have an option to have a commercial benefit as well.

    3. Doctor Syntax Silver badge

      Re: It's all about profit

      If you're reading a site like this I thing we can assume you've heard of the GPL. Forget the pun about copyleft. GPL is founded on copyright. Every line of code of software made available under the GPLs is subject to copyright and it's entirely due to that that such software's authors are able to impose the conditions of those licences. It is not at all about profit.

  5. Tron Silver badge

    2G GAI incoming.

    It's one thing for a person or software to read or scrape internet material, quite another to profit from its reproduction. That's why fan fiction has to be given away for free.

    This may trash 1G GAI, but it will open up the field for 2G using the same engines, documenting sources, with permission, rather like a student does in an essay.

    This could be the perfect solution to funding Wikipedia: Licensed scraping.

    The courts could stipulate that a No Scraping (without licensing) HTML statement would have to be obeyed.

    Google could use all the content they scraped from out-of-print books, but the snowflakes and activists would see endless 'harms' in anything pre-2010, and some material would be factually obsolete.

    It does set AI back quite a bit, but it was never going to be that reliable anyway.

    It also means that you could pick a 2G AI model that uses the sources you want - left wing, right wing, democrat, republican, CCP, Islamic, whatever.

    You can pay educational publishers and have GAIs that are good for students, or pay universities and scrape PhD content for scientific research.

    Large Web 2.0 sites could pull in a few quid by permitting scraping. Social media would just add to their Ts&Cs that they could licence content (if it isn't already in them).

    So GAI will just reboot on a more legal, more targeted, and perhaps slightly less gaffe-prone basis, spreading their cash a little wider.

  6. Anonymous Coward
    Anonymous Coward

    Robots.txt

    Says Hi!

    1. Anonymous Coward
      Anonymous Coward

      Re: Robots.txt

      Too bad that the website scrapers ignore it (including the chocolate factory or at least someone using Google Cloud)

      Trying to stop the scrapers is almost a lost cause unless you are aggressive in blocking huge IP address ranges.

      FSCK the lot of them.

      1. Anonymous Coward
        Anonymous Coward

        Re: Robots.txt

        I'm pretty sure Google respects robots.txt*, but indeed there's nothing that can force everybody else from doing the same. The greatest strength and greatest weakness of robots.txt files is that they are not enforced legally.

        *for automated indexing purposes. They can and do execute quality assurance checks with other user agents, without respecting robots.txt, if only to catch cloaking and other scams.

        1. the spectacularly refined chap Silver badge

          Re: Robots.txt

          I'm pretty sure Google respects robots.txt*, but indeed there's nothing that can force everybody else from doing the same. The greatest strength and greatest weakness of robots.txt files is that they are not enforced legally.

          It is also an inversion of the legal position. Anything I create and post on my own website is mine and has automatic copyright protection. I don't need to explicitly slap an effective "hands off!" notice for that to be the case. That is the underlying logic of robots.txt - essentially it states "don't do x, y or z". If it was the other way around - "You are free to do a, b, or c" that would hold water, but statue law is not overridden by a standard cooked up by someone on the internet with a vested interest.

  7. Rgen

    Microsoft should pay up. They will continue to get sue.

    1. Anonymous Coward
      Anonymous Coward

      Maybe, but it'll take a lot of lawyers to figure out how much. If the the LLM gets the content designated as "quotes" and the rest "commentary", then it can be considered fair use.

      Certainly there is a lot more egregious use of Copyright material on the Internet. Anything on YouTube with "Reaction" in the title springs to mind...

      1. heyrick Silver badge

        "Anything on YouTube with "Reaction" in the title springs to mind..."

        Which, in AI terms, would be something like "Short blonde vocal coach reacts to symphonic metal song I created while cosplaying as Floor Jansen".

    2. parrot

      Sue is innocent!

      1. ghp

        "Sue is so proper" (c. The Telegraph cryptic crosswords 8, no.59, 10 across)

  8. Michael Hoffmann Silver badge
    Meh

    Somewhat ironic that the ability (at least in gpt4) to quote verbatim, verse and chapter with link if requested, is what raises the confidence level that you aren't being proffered yet another "hallucination".

    Whereas in 3.5 I could never sure that what I read wasn't complete made-up bollocks and when asking for references it would tell me "sorry, Dave, I can't do that", now I can ask "and provide the link to , say, a peer reviewed article" and get the whole enchilada. Which, yes, would or could be an entire Nature article, otherwise hidden behind a paywall.

  9. Anonymous Coward
    Anonymous Coward

    Dinosaur fight!

    Free to watch too.

  10. Pascal Monett Silver badge

    Big Money going after Bigger Money

    Don't be fooled, it's all about money. True, NYT can claim the moral high ground, but it's still all about money.

    We need a Mr Moneybags icon. Or maybe a Scrooge McDuck icon.

  11. Groo The Wanderer

    I do look forward to OpenAI being THOROUGHLY spanked by the courts for their abuse of the systems. Only true SCUM would confuse the permission given for human readers to be able to view the articles posted to websites with mass robotic scraping and analysis...

    1. Lurko

      I wish you were right. But the case is being heard in the US courts, where Justice is blind, and the scales of Justice are therefore easily weighted with money.

      NYT is worth $8 billion, MS is worth $2.8 trillion. NYT makes a profit of $180m a year, MS makes a touch more profit than that every 24 hours. And more than just cash, there's the politics. Every moron, thieving politician and every government on the planet are blathering on about the vital importance of "leadership in AI". All MS have to do is is let the US government know how harmful losing this case will be to US leadership in AI (not forgetting that every other big US tech company will be saying the same thing to their owned politicians), and influence will be wielded to get the right outcome.

      Potentially NYT will be allowed to "win", but only on terms where mass scraping is somehow made allowable by default, and NYT then get whatever peanuts the AI industry deign to throw at them.

      1. xyz123 Silver badge

        NYT has been angling to be bought out since 2021. They think this will force Microsoft to buy them out for effectively 1 days MS profits.

        NYT (apart from being an anti-semitic, racist, homophobic piece-of-shit garbage "newspaper") has contracts where the higher-ups get to parachute away with 100s of millions if the company is bought out and they get replaced.

        If I was MS I'd buy out the paper, turn it into a tabloid for 4yr olds, and make everyone run daily stories on "which Teletubby is the bestest" and "Why I should do what Mommy and Daddy say". Every.single.day until they all quit in disgust. Then close it down.

        Well worth the price.

        1. Danny 14

          buy it, sack all the "journalists" and let the AI run the articles. continue slurping.

  12. Version 1.0 Silver badge
    IT Angle

    AI Votes?

    A lot of down votes for posts with opinions on this article, that level of down vote response is not common on El Reg stories. For a while now I've been wondering how comment votes everywhere are being created by AI - with the potential for a few comments created by AI to encourage opinions on AI everywhere.

    I'm not going to make a good/bad comment about this because saying everything so far has reminded me of the early days when so many people criticized Janis Joplin for being a white girl who was creating and singing "black style" music ... so many bad sucky comments in America originally made the world start listening to Janis, and then loving her fantastic performance and abilities.

    1. Anonymous Coward
      Anonymous Coward

      Re: AI Votes?

      I have downvoted about a dozen posts that are wrong, like yours is.

      Also, Janis Joplin's singing voice it just bad. Its not awful like Yoko Ono's, but its bad just the same.

  13. Disgusted Of Tunbridge Wells

    Were they trawling the NYT because content on Twitter was too reliable and trustworthy?

  14. Blue Pumpkin
    FAIL

    They just missed a trick

    .. to invent the AI bot subscription at a cost of a few million $ per year.

    Would bring them more money than paying the lawyers for what will no doubt be perceived as detrimental in the long run.

    I mean, does the NYT actively not want to have any influence over the future ? Or prefer to leave it to the likes of antisocial media ?

  15. Omnipresent Silver badge

    Your thoughts are not your own.

    If The Times and their cohorts used a computer in their publishing, they do not have a case. Anything you do on a computer is owned by microsoft, or google. The Computer it's self can be taken from you. You do NOT own anything digital. It's in the contract you sign to turn on the computer. Welcome to reality. You belong to the computer.

  16. xyz123 Silver badge

    I asked chatgpt for solutions to various world problems.

    All it came up with was anti-semitic gibberish about how Hamas are the real victims and should be allowed to kill anyone they like, thus proving 100% it must have been trained on data from the New York Times! Case closed Dr Watson.

  17. Anonymous Coward
    Anonymous Coward

    AI yet another drive by slurp theft

    Someone else taking credit for anothers work is a very popular theft, it happens at work, online, wherever people are insufficiently paranoid about their property but for most of us it goes unpunished simply because we do not have the funds or required proof necessary to prove ownership

    So you can understand big business coming to believe they can get away with the same I slup you and what was yours is now mine scam over and again.Clearly the legal system is both unfair and unfit for purpose of protecting ownership.So what's the solution.Well for AI this one has been presented as requiring deep thought and yet I would say the case is simple.

    Does a creator own their creation if so then if the AI could not be productive without slurp then the AIOwner needs to show written proof that they have concent to use said content.The premise that the human laws such as for published documents being applicable to nonhumans is as reasonable as charging your dog for watching your TV.

    If human laws really do apply to AI then all ownership of pets and AIs is slavery and M$ have been party to the same.

    AI are not human and so human laws do not apply to them especially as AI do not self motivate. If an AI processes protected content it is not doing anything other that what it has been told.Making any crime commited the guilt of the operator/ programmer.

    There has been a lot of BS thrown about in this simple case including the reference to AI, what has been presented as AI is nothing more than normal code it is no different from say 3D graphics in that the output is reminiscent of reality as perceived by humans. The hardware and software involved are not alive nor self aware,governing or determing so why pretend this program is AI, personally I see it as being purely to create confusion pure an simple so as to suggest the existing laws do not apply.

    For a long time the courts in those counties that have deemed spying upon their populations without any proof of wrongdoing okay and sanctioned the theft and misuse of the personal information of internet users without full disclosure or limitation of what the collected data would be used for have, I believe shown that their legal systems are broken beyond repair.

    If you remove the BS then what we are left with is

    Misuse of software so as to get around copywrite

    Since the offending program is just code then those that use it to break copywrite are guilty and I would suggest that creating code to bypass DRM is already covered by US law.

    Once an actual AI becomes a possibility then the question of slavery also becomes an issue as does ownership of content it produces hence why Real AI does not exist simple because there is no profit in having a machine create stuff you cannot sell, claim ownership over nor trust to be the best for the enslaver.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like