The only contract I want with AI is to keep them off my website.
Cloudflare tightens screws on site-gobbling AI bots
Cloudflare on Monday expanded its defense against the dark arts of AI web scrapers by providing customers with a bit more visibility into, and control over, unwelcome content raids. The network biz earlier this year deployed a one-click AI bot defense to improve upon the not very effective robots.txt mechanism, a way websites …
COMMENTS
-
-
Tuesday 24th September 2024 07:48 GMT wolfetone
""Some customers have already made decisions to negotiate deals directly with AI companies," explained Sam Rhea, a member of Cloudflare's emerging technology and incubation team. "Many of those contracts include terms about the frequency of scanning and the type of content that can be accessed. We want those publishers to have the tools to measure the implementation of these deals.""
This should read:
"We've noticed some customers have money to give these AI companies to stop them scraping their content. We'd like some of that money instead, and we know our customers are good for it. So we're introducing this service for the long game of getting more money".
-
-
Tuesday 24th September 2024 11:04 GMT Anonymous Coward
I remember, during the early Web, when there were bots crawling the web looking for email addresses to harvest for their spam campaigns, when someone, I don't remember who, had a clever little trick that they did for the spambots who naïvely would click on every link on a website and search for email addresses — they'd have a link that just generated hundreds of random, completely unique, but totally fabricated email addresses for the bots to harvest. And at the end of every page of that dynamically-generated page, they'd have a link to the same URL, which would then generate more and more email addresses in hopes that it would tie up those bots with harvesting thousands of low-quality fabricated email addresses.
I mean, obviously the spammers caught on, but I wonder… you could probably do that with AI content-scrapers, now. All you need something that generates dynamic text that the bot can just scarf, over and over again. I mean, I hear wonderful things happen when LLMs slurp up reams and reams of synthetic data without knowing it.
It doesn't even have to be computationally expensive. Something incredibly simple like a markov chain that iterates over something public domain, like… I don't know, the entirety of Project Gutenberg? should be enough. Just enough to tie the scrapers and waste their time, because of course they scrutinize their sources and would catch this instantly.
The nice part is that unlike email addresses, there really isn't an easy way to validate synthetic content. Like, a lot of stuff that markov text generators make sound plausible, but ultimately don't make any damn sense.
I mean, you could use a text corpus that's patently radioactive, like terrorist literature or (textual) depictions of illegal actions, but that might stand out too much, and of course there's the matter of legality to consider. But, you know… anything to mess with these bastards, even just a little bit.
-
Tuesday 24th September 2024 11:48 GMT Jellied Eel
I mean, you could use a text corpus that's patently radioactive, like terrorist literature or (textual) depictions of illegal actions, but that might stand out too much, and of course there's the matter of legality to consider. But, you know… anything to mess with these bastards, even just a little bit.
I keep thinking there are probably simpler legal remedies. So a clear terms of use message on your website stating that scraping is not permitted. Then unauthorised use of the website would arguably be illegal under the various forms of Computer Misuse Act(s) in much the same way that hacking a website would be.
Downside is it would need lawyers to take on the case, but could copy from the patent trolls and take on a small scraper first to get a win and establish the precedent. Same principle should be applicable to all the data harvesters, although they often hide behind the illusion of consent.
-
Tuesday 24th September 2024 13:15 GMT Irongut
> I mean, you could use a text corpus that's patently radioactive
It wouldn't make any difference to Sam Altman or his cronies, after all the most commonly used picture library for training these non-intellligences contains CSAM.
Not just may contain but is known to contain.
What could be more 'radioactive' than that?
-