back to article Google says public data is fair game for training its AIs

Google has updated its privacy policy to confirm it scrapes public data from the internet to train its AI models and services – including its chatbot Bard and its cloud-hosted products. The fine print under research and development now reads: "Google uses information to improve our services and to develop new products, …

  1. Pascal Monett Silver badge

    "whether or not [Google] would scrape public copyrighted or licensed data or social media posts"

    It's Google.

    It will.

    1. Ashto5 Bronze badge

      It’s Google

      It did, it will and your not going to be able to stop it

      1. Jimmy2Cows Silver badge

        Re: your

        Hmm... upvote for the sentiment, or downvote for the egregiously incorrect use of "your"...

        What's a simple pedant to do?

  2. DS999 Silver badge

    This will eventually go to the courts

    If I browse the web and read copyrighted but publicly accessible information to learn about some new subject, that's fine. If I use my new knowledge to write a book about it that's fine too so long as it is my original work and I'm not simply regurgitating paragraphs from what I read.

    So if an "AI" is doing it to "learn" that may be considered fair use under copyright law. But if I ask it a question about steam engines, for instance, and in its response to my questions spits out full sentences and even paragraphs that are from "publicly available" but copyrighted information on the web, that's theft. It could give me a link and say "the answer to your question is found here". It is like the difference between a search engine simply providing a link to a website, versus having scraped the text and serving it to me without the web site getting the hit.

    It will be up to judges, or legislators (if they can agree on anything long enough to pass a law) to draw the line on where fair use becomes theft.

    1. Dinanziame Silver badge

      Re: This will eventually go to the courts

      Search engines already return snippets with their results, which has landed them in trouble with news sites. Results of ancillary copyright laws are mixed. It's going to be a long time before we have clarity on what should be legal, and it's the wild west yntil then

      1. Law

        Re: This will eventually go to the courts

        Search results retain information on the source the the material and provide that information.

        AI results give you responses with no context on that information, they lose where the information was slurped from and who owns the copyright information in the model.

        That’s the difference. That’s where it’s no longer fair use and now copyright infringement- you’ve taken somebody else’s work, and offer it as your own, rather than looking at somebody else’s work and making it easier to find by pointing others to it.

        Add to that, other company’s will be charging for this information (OpenAI) and I’m surprised people are even debating this.

        1. John69

          Re: This will eventually go to the courts

          It is certainly a difference, but not one that makes a difference in IP law.

    2. 42656e4d203239 Silver badge

      Re: This will eventually go to the courts

      >>spits out full sentences and even paragraphs that are from "publicly available" but copyrighted information on the web, that's theft

      Actually it isn't theft (dishonestly appropriating property belonging to another with the intention of permanently depriving the other of it) - there is no element of "permanently depriving the other" in copyright infringement.

      With the definition of theft out of the way... I will agree that LLMs possibly infringe copyright. Given the way these LLMs work (see comments passim) it could be argued that it isn't infringing copyright becasue it manifestly isn't copying the work (it's regenerating it based on letter/word probabilities) - but I will leave that discussion to m'learned friends.

      1. veti Silver badge

        Re: This will eventually go to the courts

        You are permanently depriving the copyright holder of the credit/reputation that should be theirs, and any associated benefits that might accrue from it.

        However, as I understand it, the designers thought of this - after some well publicised incidents - and implemented safeguards specifically to prevent wholesale regurgitation of texts. If that's not the case, then it should be easy to win a copyright suit against them, and since nobody has yet convincingly brought that suit, I think those safeguards are currently working.

        1. 42656e4d203239 Silver badge

          Re: This will eventually go to the courts

          >>You are permanently depriving the copyright holder of the credit/reputation that should be theirs,

          IANAL but no you are not. You would actually have to take posession of the credit/reputation/material which just copying/regenerating the text doesn't do.

          Anyway - we sort of agree - I am being pedantic about the legal definition of theft (which copyright infringement doesn't meet); the man on the Clapham omnibus would probabnly say that copyright infringement is theft using colloquial rather than leagal language.

          What we call it is by the by - is a regenerative system commiting an act that can actioned when it doesn't actually copy and paste the material in the first place? that is the question the beaks and silks will have to sort out for us mere mortals.

          1. Anonymous Coward
            Anonymous Coward

            Re: This will eventually go to the courts

            By the "regeneration" argument, having a person with a photographic memory read a book and then type it out, word for word, wouldn't be "copying" it. Nonsense. If the "new" work is clearly recognizable as being the original work, even if there's a few tweaks, then it's copyright infringement.

    3. John Brown (no body) Silver badge

      Re: This will eventually go to the courts

      "on where fair use becomes theft."

      Not all jurisdictions have a "fair use" clause anyway, and of those that do, it's not the same everywhere. Mostly it seems to be US corporations assuming their own quite liberal "fair use" legal framework applies to everywhere to everything. The MO seems to be "just do it until someone complains then tie them up in court for years until they give up or go bust"

  3. Anonymous Coward
    Anonymous Coward

    AI processing of public information...

    ...goes far beyond the capabilities of a person.

    What's 'public' still needs to be processed legally. It's more and more clear that AI needs government regulation to help settle how it may be used, as companies *will* otherwise assume they have carte blanche because the information is 'public'.

    1. Version 1.0 Silver badge

      Re: AI processing of public information...

      It looks like AI has become the prime malware delivery choices now, looking at our hourly malware deliveries I'm seeing exploits from an "ingodwetrust" address (WTF!) today. It looks like AI is working at getting everyone to open their delivery files, processing the public information to chose deliveries that users might think are OK ... if AI is working then we're going to see a worse public file delivery environment.

  4. Boolian

    Run DMCA

    Train it on free, publicly generated data, monetize it, and sell it back - which appears to be everyone's modus operandi nowadays (though Google aren't yet firewalling their content, and demanding subscriptions... give em time).

    Maybe, in some far future, the use for blockchain, will be digital watermarks (called GettyFek?) and we will actually be able to monetize our own output - 0.0001p a word, article picture, or video; along with personal AI avatars, whose sole use will be scraping the net hunting for our content, and issueing pay-or-pull notices.

    *reaches for calculator, and terminal command.

    1. XxXb

      Re: Run DMCA

      Google might not outright be pay walling but the reduction in storage available and the pay to increase is the biggest puppy dogging move in history.

  5. Evil Scot

    Interestingly, Reg staff outside the USA could not see the text quoted at the above link...

    27 Countries (Plus California) find that interesting.

  6. Howard Sway Silver badge

    Didn't Google News get banned in some places for doing this?

    Just scraping the news from news sites and republishing it itself. Surely scraping other copyrighted text and saying that it went through a very complicated LLM data masher program first before regurgitation and therefore it's not now a copyright violation is a pretty weak excuse : you copied it when you fed it into the LLM.

    1. Dinanziame Silver badge

      Re: Didn't Google News get banned in some places for doing this?

      Not banned, but some countries said that they had to pay the news sites. Depending on the countries and the details, they've paid up (Australia, France) or closed Google News (Spain). In Canada, it seems a new law says they have to pay to include news sites in search results, and if I understand correctly they've said they were going to filter them out completely.

    2. Felonmarmer

      Re: Didn't Google News get banned in some places for doing this?

      The copying when you feed into the system isn't the copyright infringement though - it's publishing the info afterwards.

      I can cut and paste into a word doc, go through and rewrite it in my own words and it becomes a new work, derivative sure, but that's allowed as fair use in many jurisdictions (like the USA for example, where the first AI copyright trials are taking place). And they are civil claims as well, not criminal cases, so it will come down to who has the most money for their lawyers.

  7. heyrick Silver badge

    would scrape public copyrighted or licensed data

    Well that's pretty much everything unless it is clearly marked with a compatible licence...

    Instead, as now, they'll just ignore licences and copyright completely, and then lose their shit if somebody ignores theirs.

  8. Grumpy Fellow

    Is robots.txt still a thing?

    Just a question. In the old days you could put a file called robots.txt in a web site's home directory to deter content scraping by web crawlers. Would that address this issue?

    1. Anonymous Coward
      Anonymous Coward

      Re: Is robots.txt still a thing?

      Yes. This is how you disable all crawlers. Which is why the lawsuits against AIs will fail.

      1. that one in the corner Silver badge

        Re: Is robots.txt still a thing?

        > This is how you disable all crawlers

        Unless they have been dastardly enough to include the -e robots=off option when they point wget at your site.

    2. TheMaskedMan Silver badge

      Re: Is robots.txt still a thing?

      "In the old days you could put a file called robots.txt in a web site's home directory to deter content scraping by web crawlers."

      There's also noindex in the page metatags. Perhaps something similar, such as a noai, tag would be the answer to all this squabbling over what publicly available information can be used for.

      Of course, that requires the bots to respect the tags, but it would be a good starting point if you're inclined to sue an ai company at a later date.

      Personally, I would have no problem with my content being used to train models. I might be slightly less enthusiastic about the model reproducing it verbatim, but I've always assumed that anything I've written on the web is effectively beyond my control the minute it goes live. Yes, I know, DMCA, copyright, designs and patents act etc (is it still called that?). Fine in theory, but just you try enforcing it outside your home jurisdiction. Or in your jurisdiction, come to that. If you're a big business with deep pockets you might have a chance, but the rest of us don't have the time or the money to pursue such things, even if we wanted to. Best to just get on with doing what we do.

      1. Inkey

        Re: Is robots.txt still a thing?

        yeah nicely said Maskedman.... i think an even fairer outcome would be that if it's scraped from public data then the source should be public too

      2. Anonymous Coward
        Anonymous Coward

        Re: Is robots.txt still a thing?

        Better to have an ai-ok tag, as a proper opt-in rather than the default position of "you didn't say I couldn't take it so it's ok that I did".

    3. veti Silver badge

      Re: Is robots.txt still a thing?

      What if you don't want your text to be invisible and undiscoverable, you just want it to be correctly attributed?

  9. Roland6 Silver badge

    Boot on other foot: Google’s ad analytics network is far game for training ad blockers

    Can see more sites putting extra “prove you are human” hoops in the way of content access.

    But definitely, the best form of defence is attack, so perhaps we need tools that for example can authenticate web crawlers and so decide what content (real or honeytrap) to expose… tools that can generate honeytrap rubbish in quantities LLM training require… perhaps an opportunity to prove a 100 monkeys with typewriters…

  10. Kevin McMurtrie Silver badge

    Ok, then

    With that logic, Gsecurity will have to stop yelling at people for using Gbikes. They're just lying around in publicly accessible areas. Fair use, right? Oh, looks like tons of data in Google maps is open to the public too. Somebody could train their own system to generate realistic and accurate maps from that publicly visible data.

  11. FuzzyTheBear

    for free ?

    Think it's time for a humongous lawsuit that will have them pay us for our data.

    They make money out of it .. so should we.

    Gentlemen get your checkbooks out.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like