hmm
Any goldrush has people selling shovels.
Cloudflare on Wednesday offered web hosting customers a way to block AI bots from scraping website content and using the data without permission to train machine learning models. It did so based on customer loathing of AI bots and, "to help preserve a safe internet for content creators," it said in a statement. "We hear …
Let me know when Cloudflare has a means to detect when people have had enough of thier criminal organization protection.
Does Cloudflare detect when they're serving exact copies of corporate logos on new sites? Does their domain service detect deceptive naming? Do they have AI to prevent recurring abuse from the same customers? Does their abuse contact perform any function? Do they even detect when major DNS relays are blocking Cloudflare domain lookups because of phishing lists? Nope to all.
They did recently offer a service that attempts to sabotage browser developer consoles. I can only assume this is to help their phishers stay obfuscated.
It's not about policing random abusers or trolls. Cloudflare has been protecting the identities of sophisticated criminal gangs for over a decade. They send phishing and cat phishing emails and SMS. They advertise their fake stores on major web sites.
It's not just good fake stores. They ship counterfeit products so some people won't realize that their credit card is stolen.
If you're willingly assisting and profiting with criminals, you're a criminal. It's as simple as that. Cloudflare does it because it's good for their protection services.
"If you're willingly assisting and profiting with criminals, you're a criminal. It's as simple as that. Cloudflare does it because it's good for their protection services."
In which case you should focus your attention on the payments services and banks who actually move money for the criminals, and the countries who host criminals.
robots.txt is as useful as 'do not track' flags. They only work for the people you're not worried about. It assumes the people looking at it have any ethical concerns at all, and nobody involved in LMM 'AI' does. The whole model is based on theft, why would they honor your robots.txt?
A research paper looked into the possibility of embedded hidden patterns into images that generative AI would train off - resulting in malformed models that would generate incorrect outputs for given prompts. Unfortunately, it's an arms race, since whilst that might impact some models available today, it won't impact future ones which are trained against that data.
you'd need to do harder than that, perhaps creating multiple websites, only one being genuine, and all others genuine enough to be accepted as genuine for scraping, but containing malicious content they grab and re-use. Honeytrap poison ;)
"And with a network that sees an average of 57 million requests per second, Cloudflare has ample data to determine which of these fingerprints can be trusted."
Now there's a figure that should set some alarm bells jangling. Surely, if they're fingerprinting bots they must be similarly fingerprinting everyone, if only to distinguish bot from nonbot. It's all very well worrying about the likes of Google and Facebook tracking us via cookies (I don't, particularly, but many do) but when a single company hosts a vast swathe of the web's most active sites they don't need cookies - they just hoover up what they want from the server logs. Suppose I don't want to participate? Can I stop Cloudflare from hoovering up my data for this or any other purpose? How would I even know that it's happening, since server logs don't require one of those pestilential cookie pop-ups? Data hoovering to provide a service to paying customers is apparently the root of all evil when ad slingers do it, and I don't see why a hosting company, with access to all http headers and server logs, should be any different.
Nor is scraping the open web tantamount to theft. If you want control of who reads your content, stick it behind a paywall or access control mechanism. The open web is just that - open, and anyone publishing anything on an open website can reasonably expect to have it read / consumed for just about any purpose, regardless of whether you like and approve of it or not.
Still, I have to give Cloudflare credit for finding a way to ride the AI hype train without all the hassle of building/ training models etc. Full marks for crafty thinking, maybe next time do it without stealing visitor data.
I was in full agreement with you until "Nor is scraping the open web tantamount to theft.". Just because I posted it on my website, that does NOT mean it is now public domain for anyone to do with as they will. Visit the site, see the page, read/listen to/watch the content? Yes, that's what it's there for. Download the contents for offline viewing? Questionable, especially things like videos and entire songs. Use to autocreate (no creativity needed, just a quick text query) "new" content from my copyrighted work, without permission or attestation? Absolutely not.
Scraping websites is rather like recording broadcast TV. If you're doing it for private viewing later, it's fine. If you're doing it to collage together something "new" from it and publish it, it's copyright infringement.
"Just because I posted it on my website, that does NOT mean it is now public domain for anyone to do with as they will."
Of course it isn't public domain, and of course you retain the copyright to your work. But the web is created to be consumed, and you have very little control over what it is consumed for, or by whom.
Making a blatant copy of the site, either online or printed, say, is clearly a nono, and is relatively easy to remedy via copyright claims if the other party is in a convenient jurisdiction. But it's less clear if the site is scraped for some kind of analysis. Does the bot infringe your copyright by, say, counting the words? If so, how? Does a human, if they do the same thing? What if they count the frequency of each word, and how often it appears next to other words? What if they amalgamate the findings from your site with those from 1000s of others into one massive frequency table?
I can't really see how that, or other analysis, infringes copyright, with the possible exception that a temporary, transient copy of each page is made as the bot retrieves it from your server. If that copy is retained, then maybe there would be a tenuous argument for copyright infringement, bit otherwise I can't see it.
All of which is not to say that, as the author, you can't take steps to prevent people or bodies you don't like from reading your material. Of course you can, but copyright isn't really the answer to my mind.
I can't really see how that, or other analysis, infringes copyright,
I basically agree with your argument.
It gets interesting when you consider a simple case where you have recorded every distinct word in a document and in what positions it occurs eg "intelligence" occurs on page 12, line 9, word 7; page 13,...
Obviously the original document can be reconstructed from that down to the physical layout. Using only word offsets still preserves the logical content. (Ignoring punctuation etc.)
More lossy representations blur this even more but I also wonder whether LLM might be heading towards a holographic representation of its training sets.
Once you merge word adjacency frequencies from a large number of documents faithful recovery of any particular document is nearly impossible. It can probably come close though as in many disciplines the self proclaimed experts spout exactly the same semiliterate nonsense.
I am always deeply suspicious of any attempt to broaden the scope of, protection for, and duration of, intellectual property (copyright, patents...) as it has never benefited those that the proposed measures purport to protect. If copyright were inalienable and only non-exclusive licencing permitted most of that nonsense evaporate. eg A.A.Milne could only have licensed Pooh to Disney (and concurrently to anyone else if he chose to.)
Still, I have to give Cloudflare credit for finding a way to ride the AI hype train without all the hassle of building/ training models etc. Full marks for crafty thinking, maybe next time do it without stealing visitor data. ------ they claim to be using AI to detect these AI scrappers. So presumably they have build models and trained them.
>> Can I stop Cloudflare from hoovering up my data for this or any other purpose?
Don't use them as your CDN and they will then not be monitoring requests from end nodes to your servers as the traffic will not be routed over their CDN.
>> Nor is scraping the open web tantamount to theft.
Maybe you should read up on copyright laws before making such a comment.
If the bot is written to disregard robots.txt, it’s likely we’ll see many other approaches to disguise requests as normal consumers of the resource. Very difficult to differentiate requests that come from multiple IP addresses that use User Agents used by major browser vendors.
Well, sort of. What tends to happen is the effort to circumvent blocks is correlated with the potential reward. This is why anything offering direct financial rewards will go from simple scripts, to skeleton browsers, to automated browsers, to saps being paid to click their way through. Speculative crawls like these are not worth the resources of spinning up browsers: it's probably far more efficient to identify, say, the 10,000 sites you're interested in and pay some saps to click their way through, saving each page as it comes.
If the AI client has access to 100,000's of non-sequential IP addresses then it can be well hidden as it can act as 100,000's of human speed browsers. If on the other hand, it is accessing a large number of sites quickly from a small number of IP addresses it becomes visible within the logs rather quickly. The only real challenge is to separate the AI traffic patterns from traffic patterns that would be generated by a large multi-user proxy.
Oh, it's them who feel obliged to perform a security check on my browser, and then one of three things happens:
1, it works and I get to go to the site
2, the check fails, I get some garbage Ray ID to look at
3, the test works and I'm redirected to the test which redirects me to the test
The worst thing is that this essentially seems random. I can be blocked on a site and five minutes later Cloudflare on a different site is like "yeah, you're okay". WTaF?
Google completely ignore robots.txt - over 90% of our traffic is from googlebot ; they seem to have got in a scan loop and are just smashing the site all day every day. The worst of it is we can't block it because the remaining 10% is paying customers driven to the site by google :-/ They have a form to fill in, but who knows if that will be effective (hint - it hasn't yet). I did discover yesterday though that our CDN (not cloudflare) have a tarpit function so the AI bots might be meeting that soon...
Google are definitely up to no good (no surprise there for most Reg readers, I'm sure).
We used to have an "intranet" site that required staff login to access. While checking how our main public site was showing up in Google, I noticed that it was also returning some search results for our intranet site, as it has the same domain, although a different host.
At least it was only showing page titles and not any page content, but it shouldn't even have been possible for it to do that! I don't use Google as my normal search engine nor a Chromium-based browser, so my suspicion was that when some colleagues were accessing the site using Chrome (and possibly other Chromium-based browsers), their browser was sneakily recording their browsing activities and reporting home to GRUgle, perhaps even including the whole page content as well (that'd be an entirely inadvertent result of badly-engineered "browser cache quality telemetry diagnostics" or something, I'm sure…). Luckily there wasn't really anything sooper sekrit on our intranet, otherwise I would have made more noise at the time…