"whether or not [Google] would scrape public copyrighted or licensed data or social media posts"
It's Google.
It will.
Google has updated its privacy policy to confirm it scrapes public data from the internet to train its AI models and services – including its chatbot Bard and its cloud-hosted products. The fine print under research and development now reads: "Google uses information to improve our services and to develop new products, …
If I browse the web and read copyrighted but publicly accessible information to learn about some new subject, that's fine. If I use my new knowledge to write a book about it that's fine too so long as it is my original work and I'm not simply regurgitating paragraphs from what I read.
So if an "AI" is doing it to "learn" that may be considered fair use under copyright law. But if I ask it a question about steam engines, for instance, and in its response to my questions spits out full sentences and even paragraphs that are from "publicly available" but copyrighted information on the web, that's theft. It could give me a link and say "the answer to your question is found here". It is like the difference between a search engine simply providing a link to a website, versus having scraped the text and serving it to me without the web site getting the hit.
It will be up to judges, or legislators (if they can agree on anything long enough to pass a law) to draw the line on where fair use becomes theft.
Search engines already return snippets with their results, which has landed them in trouble with news sites. Results of ancillary copyright laws are mixed. It's going to be a long time before we have clarity on what should be legal, and it's the wild west yntil then
Search results retain information on the source the the material and provide that information.
AI results give you responses with no context on that information, they lose where the information was slurped from and who owns the copyright information in the model.
That’s the difference. That’s where it’s no longer fair use and now copyright infringement- you’ve taken somebody else’s work, and offer it as your own, rather than looking at somebody else’s work and making it easier to find by pointing others to it.
Add to that, other company’s will be charging for this information (OpenAI) and I’m surprised people are even debating this.
>>spits out full sentences and even paragraphs that are from "publicly available" but copyrighted information on the web, that's theft
Actually it isn't theft (dishonestly appropriating property belonging to another with the intention of permanently depriving the other of it) - there is no element of "permanently depriving the other" in copyright infringement.
With the definition of theft out of the way... I will agree that LLMs possibly infringe copyright. Given the way these LLMs work (see comments passim) it could be argued that it isn't infringing copyright becasue it manifestly isn't copying the work (it's regenerating it based on letter/word probabilities) - but I will leave that discussion to m'learned friends.
You are permanently depriving the copyright holder of the credit/reputation that should be theirs, and any associated benefits that might accrue from it.
However, as I understand it, the designers thought of this - after some well publicised incidents - and implemented safeguards specifically to prevent wholesale regurgitation of texts. If that's not the case, then it should be easy to win a copyright suit against them, and since nobody has yet convincingly brought that suit, I think those safeguards are currently working.
>>You are permanently depriving the copyright holder of the credit/reputation that should be theirs,
IANAL but no you are not. You would actually have to take posession of the credit/reputation/material which just copying/regenerating the text doesn't do.
Anyway - we sort of agree - I am being pedantic about the legal definition of theft (which copyright infringement doesn't meet); the man on the Clapham omnibus would probabnly say that copyright infringement is theft using colloquial rather than leagal language.
What we call it is by the by - is a regenerative system commiting an act that can actioned when it doesn't actually copy and paste the material in the first place? that is the question the beaks and silks will have to sort out for us mere mortals.
By the "regeneration" argument, having a person with a photographic memory read a book and then type it out, word for word, wouldn't be "copying" it. Nonsense. If the "new" work is clearly recognizable as being the original work, even if there's a few tweaks, then it's copyright infringement.
"on where fair use becomes theft."
Not all jurisdictions have a "fair use" clause anyway, and of those that do, it's not the same everywhere. Mostly it seems to be US corporations assuming their own quite liberal "fair use" legal framework applies to everywhere to everything. The MO seems to be "just do it until someone complains then tie them up in court for years until they give up or go bust"
...goes far beyond the capabilities of a person.
What's 'public' still needs to be processed legally. It's more and more clear that AI needs government regulation to help settle how it may be used, as companies *will* otherwise assume they have carte blanche because the information is 'public'.
It looks like AI has become the prime malware delivery choices now, looking at our hourly malware deliveries I'm seeing exploits from an "ingodwetrust" address (WTF!) today. It looks like AI is working at getting everyone to open their delivery files, processing the public information to chose deliveries that users might think are OK ... if AI is working then we're going to see a worse public file delivery environment.
Train it on free, publicly generated data, monetize it, and sell it back - which appears to be everyone's modus operandi nowadays (though Google aren't yet firewalling their content, and demanding subscriptions... give em time).
Maybe, in some far future, the use for blockchain, will be digital watermarks (called GettyFek?) and we will actually be able to monetize our own output - 0.0001p a word, article picture, or video; along with personal AI avatars, whose sole use will be scraping the net hunting for our content, and issueing pay-or-pull notices.
*reaches for calculator, and terminal command.
Just scraping the news from news sites and republishing it itself. Surely scraping other copyrighted text and saying that it went through a very complicated LLM data masher program first before regurgitation and therefore it's not now a copyright violation is a pretty weak excuse : you copied it when you fed it into the LLM.
Not banned, but some countries said that they had to pay the news sites. Depending on the countries and the details, they've paid up (Australia, France) or closed Google News (Spain). In Canada, it seems a new law says they have to pay to include news sites in search results, and if I understand correctly they've said they were going to filter them out completely.
The copying when you feed into the system isn't the copyright infringement though - it's publishing the info afterwards.
I can cut and paste into a word doc, go through and rewrite it in my own words and it becomes a new work, derivative sure, but that's allowed as fair use in many jurisdictions (like the USA for example, where the first AI copyright trials are taking place). And they are civil claims as well, not criminal cases, so it will come down to who has the most money for their lawyers.
"In the old days you could put a file called robots.txt in a web site's home directory to deter content scraping by web crawlers."
There's also noindex in the page metatags. Perhaps something similar, such as a noai, tag would be the answer to all this squabbling over what publicly available information can be used for.
Of course, that requires the bots to respect the tags, but it would be a good starting point if you're inclined to sue an ai company at a later date.
Personally, I would have no problem with my content being used to train models. I might be slightly less enthusiastic about the model reproducing it verbatim, but I've always assumed that anything I've written on the web is effectively beyond my control the minute it goes live. Yes, I know, DMCA, copyright, designs and patents act etc (is it still called that?). Fine in theory, but just you try enforcing it outside your home jurisdiction. Or in your jurisdiction, come to that. If you're a big business with deep pockets you might have a chance, but the rest of us don't have the time or the money to pursue such things, even if we wanted to. Best to just get on with doing what we do.
Can see more sites putting extra “prove you are human” hoops in the way of content access.
But definitely, the best form of defence is attack, so perhaps we need tools that for example can authenticate web crawlers and so decide what content (real or honeytrap) to expose… tools that can generate honeytrap rubbish in quantities LLM training require… perhaps an opportunity to prove a 100 monkeys with typewriters…
With that logic, Gsecurity will have to stop yelling at people for using Gbikes. They're just lying around in publicly accessible areas. Fair use, right? Oh, looks like tons of data in Google maps is open to the public too. Somebody could train their own system to generate realistic and accurate maps from that publicly visible data.