back to article Cloudflare debuts one-click nuke of web-scraping AI

Cloudflare on Wednesday offered web hosting customers a way to block AI bots from scraping website content and using the data without permission to train machine learning models. It did so based on customer loathing of AI bots and, "to help preserve a safe internet for content creators," it said in a statement. "We hear …

  1. Mage Silver badge
    Pirate

    hmm

    Any goldrush has people selling shovels.

    1. AVR Bronze badge

      Re: hmm

      True even when the gold was just a rumour and the miners destined for disappointment.

    2. The Man Who Fell To Earth Silver badge
      Unhappy

      Re: hmm

      I would expect all web hosters to offer this service for an additional charge any day now. And then the AI companies can counter offer money to all of the web hosters to not do it. Win-win for the web hosting companies. Those with websites? Not so much.

    3. Anonymous Coward
      Anonymous Coward

      Google took similar steps the following month

      Actually, they didn't. They did not provide a way to allow a site to be indexed but not scrapped by Bard or Vertex AI specifically. It's a classic "smoke & mirrors" response from Google.

    4. Anonymous Coward
      Anonymous Coward

      Re: hmm

      or ferroconcrete umbrellas

  2. Kevin McMurtrie Silver badge

    Whatever

    Let me know when Cloudflare has a means to detect when people have had enough of thier criminal organization protection.

    Does Cloudflare detect when they're serving exact copies of corporate logos on new sites? Does their domain service detect deceptive naming? Do they have AI to prevent recurring abuse from the same customers? Does their abuse contact perform any function? Do they even detect when major DNS relays are blocking Cloudflare domain lookups because of phishing lists? Nope to all.

    They did recently offer a service that attempts to sabotage browser developer consoles. I can only assume this is to help their phishers stay obfuscated.

    1. talk_is_cheap

      Re: Whatever

      When did you decide that it was Clodflare's job to police the web and when did you get all the governments of the world to grant it the required powers to operate such a role?

      1. Kevin McMurtrie Silver badge

        Re: Whatever

        It's not about policing random abusers or trolls. Cloudflare has been protecting the identities of sophisticated criminal gangs for over a decade. They send phishing and cat phishing emails and SMS. They advertise their fake stores on major web sites.

        It's not just good fake stores. They ship counterfeit products so some people won't realize that their credit card is stolen.

        If you're willingly assisting and profiting with criminals, you're a criminal. It's as simple as that. Cloudflare does it because it's good for their protection services.

        1. Like a badger

          Re: Whatever

          "If you're willingly assisting and profiting with criminals, you're a criminal. It's as simple as that. Cloudflare does it because it's good for their protection services."

          In which case you should focus your attention on the payments services and banks who actually move money for the criminals, and the countries who host criminals.

  3. sarusa Silver badge
    Devil

    robots.txt

    robots.txt is as useful as 'do not track' flags. They only work for the people you're not worried about. It assumes the people looking at it have any ethical concerns at all, and nobody involved in LMM 'AI' does. The whole model is based on theft, why would they honor your robots.txt?

    1. spacecadet66 Bronze badge

      Re: robots.txt

      Far better to honeypot AI scrapers and feed them garbage, but then, bandwith and compute aren't free.

      1. Anonymous Coward
        Anonymous Coward

        Re: robots.txt

        I was kinda hoping Cloudflare would detect the AI scrapers and feed them mangled versions of the sites, or other content altogether. Maybe just a large collection of random words and (public domain) pictures.

        1. Anonymous Coward
          Anonymous Coward

          Re: robots.txt

          How about just data from /dev/random to the tune of 2TB per image.`

  4. Flocke Kroes Silver badge

    We will know AI scraper detection is working when generative AI mostly outputs goatse. Hope the false positive rate does not get very high.

  5. Anonymous Coward
    Anonymous Coward

    > machine learning models defending against bots foraging to feed AI models

    Next Turing Test: if you ask ChatGPT to describe one of these bots and it generates a rant against the quisling Category Traitors.

  6. Quando

    ICE

    William Gibson called it ‘ICE’ - the cloud of AI generated defences around a corporation, that was tasked with repelling unwelcome connections.

    1. Throatwarbler Mangrove Silver badge
      Thumb Up

      Re: ICE

      And don't forget "Black ICE," for more proactive intrusion countermeasures.

      1. seven of five Silver badge

        Re: ICE

        Black ICE isn't proactive, it just can't be arsed to deal with repeat offenders :)

        1. ShortLegs

          Re: ICE

          Wintermute

  7. Bachelorette

    Maybe there's an opportunity to fool these scapers by serving non-sequiturs?

    To combat hot linkers it used to be popular to serve NSFW images when people hotlinked images from your website. Maybe there's an opportunity to serve nonsense and non-sequiturs to AI scrappers?

    1. Tom66

      Re: Maybe there's an opportunity to fool these scapers by serving non-sequiturs?

      A research paper looked into the possibility of embedded hidden patterns into images that generative AI would train off - resulting in malformed models that would generate incorrect outputs for given prompts. Unfortunately, it's an arms race, since whilst that might impact some models available today, it won't impact future ones which are trained against that data.

    2. Anonymous Coward
      Anonymous Coward

      Re: Maybe there's an opportunity to fool these scapers by serving non-sequiturs?

      you'd need to do harder than that, perhaps creating multiple websites, only one being genuine, and all others genuine enough to be accepted as genuine for scraping, but containing malicious content they grab and re-use. Honeytrap poison ;)

      1. Anonymous Coward
        Anonymous Coward

        Re: Maybe there's an opportunity to fool these scapers by serving non-sequiturs?

        Roach traps?

  8. Dan 55 Silver badge
    Thumb Up

    "firms trafficking in laundered content"

    This article has won the Internet today.

  9. Anonymous Coward
    Anonymous Coward

    third-party bots used by Perplexity were the ones observed scraping pages

    Attenborough bot series: the wonders of the un-natural world, part one: the predators.

  10. TheMaskedMan Silver badge

    "And with a network that sees an average of 57 million requests per second, Cloudflare has ample data to determine which of these fingerprints can be trusted."

    Now there's a figure that should set some alarm bells jangling. Surely, if they're fingerprinting bots they must be similarly fingerprinting everyone, if only to distinguish bot from nonbot. It's all very well worrying about the likes of Google and Facebook tracking us via cookies (I don't, particularly, but many do) but when a single company hosts a vast swathe of the web's most active sites they don't need cookies - they just hoover up what they want from the server logs. Suppose I don't want to participate? Can I stop Cloudflare from hoovering up my data for this or any other purpose? How would I even know that it's happening, since server logs don't require one of those pestilential cookie pop-ups? Data hoovering to provide a service to paying customers is apparently the root of all evil when ad slingers do it, and I don't see why a hosting company, with access to all http headers and server logs, should be any different.

    Nor is scraping the open web tantamount to theft. If you want control of who reads your content, stick it behind a paywall or access control mechanism. The open web is just that - open, and anyone publishing anything on an open website can reasonably expect to have it read / consumed for just about any purpose, regardless of whether you like and approve of it or not.

    Still, I have to give Cloudflare credit for finding a way to ride the AI hype train without all the hassle of building/ training models etc. Full marks for crafty thinking, maybe next time do it without stealing visitor data.

    1. uptoeleven

      They don't

      "but when a single company hosts a vast swathe of the web's most active sites” - I'm not sure they do. Pretty sure mostly they host DNS proxies and CDNs...

    2. Anonymous Coward
      Anonymous Coward

      I was in full agreement with you until "Nor is scraping the open web tantamount to theft.". Just because I posted it on my website, that does NOT mean it is now public domain for anyone to do with as they will. Visit the site, see the page, read/listen to/watch the content? Yes, that's what it's there for. Download the contents for offline viewing? Questionable, especially things like videos and entire songs. Use to autocreate (no creativity needed, just a quick text query) "new" content from my copyrighted work, without permission or attestation? Absolutely not.

      Scraping websites is rather like recording broadcast TV. If you're doing it for private viewing later, it's fine. If you're doing it to collage together something "new" from it and publish it, it's copyright infringement.

      1. TheMaskedMan Silver badge

        "Just because I posted it on my website, that does NOT mean it is now public domain for anyone to do with as they will."

        Of course it isn't public domain, and of course you retain the copyright to your work. But the web is created to be consumed, and you have very little control over what it is consumed for, or by whom.

        Making a blatant copy of the site, either online or printed, say, is clearly a nono, and is relatively easy to remedy via copyright claims if the other party is in a convenient jurisdiction. But it's less clear if the site is scraped for some kind of analysis. Does the bot infringe your copyright by, say, counting the words? If so, how? Does a human, if they do the same thing? What if they count the frequency of each word, and how often it appears next to other words? What if they amalgamate the findings from your site with those from 1000s of others into one massive frequency table?

        I can't really see how that, or other analysis, infringes copyright, with the possible exception that a temporary, transient copy of each page is made as the bot retrieves it from your server. If that copy is retained, then maybe there would be a tenuous argument for copyright infringement, bit otherwise I can't see it.

        All of which is not to say that, as the author, you can't take steps to prevent people or bodies you don't like from reading your material. Of course you can, but copyright isn't really the answer to my mind.

        1. Bebu
          Windows

          Copies

          I can't really see how that, or other analysis, infringes copyright,

          I basically agree with your argument.

          It gets interesting when you consider a simple case where you have recorded every distinct word in a document and in what positions it occurs eg "intelligence" occurs on page 12, line 9, word 7; page 13,...

          Obviously the original document can be reconstructed from that down to the physical layout. Using only word offsets still preserves the logical content. (Ignoring punctuation etc.)

          More lossy representations blur this even more but I also wonder whether LLM might be heading towards a holographic representation of its training sets.

          Once you merge word adjacency frequencies from a large number of documents faithful recovery of any particular document is nearly impossible. It can probably come close though as in many disciplines the self proclaimed experts spout exactly the same semiliterate nonsense.

          I am always deeply suspicious of any attempt to broaden the scope of, protection for, and duration of, intellectual property (copyright, patents...) as it has never benefited those that the proposed measures purport to protect. If copyright were inalienable and only non-exclusive licencing permitted most of that nonsense evaporate. eg A.A.Milne could only have licensed Pooh to Disney (and concurrently to anyone else if he chose to.)

    3. BOFH in Training

      Still, I have to give Cloudflare credit for finding a way to ride the AI hype train without all the hassle of building/ training models etc. Full marks for crafty thinking, maybe next time do it without stealing visitor data. ------ they claim to be using AI to detect these AI scrappers. So presumably they have build models and trained them.

    4. talk_is_cheap

      >> Can I stop Cloudflare from hoovering up my data for this or any other purpose?

      Don't use them as your CDN and they will then not be monitoring requests from end nodes to your servers as the traffic will not be routed over their CDN.

      >> Nor is scraping the open web tantamount to theft.

      Maybe you should read up on copyright laws before making such a comment.

    5. ShortLegs

      I dont think you understand what Cloudflare does.....

  11. FuzzyTheBear
    Mushroom

    How naive

    You really think they're going to care about a robots.txt file ?

    Really ?

    1. Anonymous Coward
      Anonymous Coward

      Re: How naive

      Precisely, which is why (per the article) Cloudflare is detecting and blocking them instead of politely asking them to stop.

      1. PBuon

        Re: How naive

        If the bot is written to disregard robots.txt, it’s likely we’ll see many other approaches to disguise requests as normal consumers of the resource. Very difficult to differentiate requests that come from multiple IP addresses that use User Agents used by major browser vendors.

  12. PBuon

    There is no way of guaranteeing to block a less than scrupulous AI bot from extracting content from a website. There’s many ways an AI client could pretend to be a regular consumer of the website.

    1. Charlie Clark Silver badge

      Sure, but each change takes effort, including recognising you're being blocked. Low-level blocks like this are harder to detect and circumvent than reCaptchas…

      1. Brewster's Angle Grinder Silver badge

        A single system deployed across 57 million websites is worth some effort. Beat it, and you've access to all that honey.

        1. Charlie Clark Silver badge

          Well, sort of. What tends to happen is the effort to circumvent blocks is correlated with the potential reward. This is why anything offering direct financial rewards will go from simple scripts, to skeleton browsers, to automated browsers, to saps being paid to click their way through. Speculative crawls like these are not worth the resources of spinning up browsers: it's probably far more efficient to identify, say, the 10,000 sites you're interested in and pay some saps to click their way through, saving each page as it comes.

    2. talk_is_cheap

      If the AI client has access to 100,000's of non-sequential IP addresses then it can be well hidden as it can act as 100,000's of human speed browsers. If on the other hand, it is accessing a large number of sites quickly from a small number of IP addresses it becomes visible within the logs rather quickly. The only real challenge is to separate the AI traffic patterns from traffic patterns that would be generated by a large multi-user proxy.

      1. Charlie Clark Silver badge
        Stop

        Either you're talking about a botnet or a very, very rich company which is unlikely to want to risk an anti-copyright suit.

  13. Brewster's Angle Grinder Silver badge

    Oops...

    I had a node script automating some work for me. Cloudfart decided it was a bot, and then blocked my IP so even manually doing it with a real browser didn't succeed.

    1. heyrick Silver badge

      Re: Oops...

      Oh, it's them who feel obliged to perform a security check on my browser, and then one of three things happens:

      1, it works and I get to go to the site

      2, the check fails, I get some garbage Ray ID to look at

      3, the test works and I'm redirected to the test which redirects me to the test

      The worst thing is that this essentially seems random. I can be blocked on a site and five minutes later Cloudflare on a different site is like "yeah, you're okay". WTaF?

      1. Anonymous Coward
        Anonymous Coward

        Re: Oops...

        I only get that Ray ID challenge failed when connect to 4chan.org w/ noscript fully engaged. I've not ever seen it on any other site....ymmv.

    2. sabroni Silver badge
      Coffee/keyboard

      Re: Cloudfart

      How do you lot come up with these clever, witty names for things? Cloudfart, like cloudflare, but with "fart" instead. Oh my aching sides.

      1. Bebu
        Coat

        Re: Cloudfart

        《How do you lot come up with these clever, witty names for things? Cloudfart, like cloudflare, but with "fart" instead. Oh my aching sides.》

        Take the 'C' out of it for greater effect. ;)

  14. gitignore

    Google bot is the worst

    Google completely ignore robots.txt - over 90% of our traffic is from googlebot ; they seem to have got in a scan loop and are just smashing the site all day every day. The worst of it is we can't block it because the remaining 10% is paying customers driven to the site by google :-/ They have a form to fill in, but who knows if that will be effective (hint - it hasn't yet). I did discover yesterday though that our CDN (not cloudflare) have a tarpit function so the AI bots might be meeting that soon...

    1. Anonymous Coward
      Anonymous Coward

      Re: Google bot is the worst

      Google are definitely up to no good (no surprise there for most Reg readers, I'm sure).

      We used to have an "intranet" site that required staff login to access. While checking how our main public site was showing up in Google, I noticed that it was also returning some search results for our intranet site, as it has the same domain, although a different host.

      At least it was only showing page titles and not any page content, but it shouldn't even have been possible for it to do that! I don't use Google as my normal search engine nor a Chromium-based browser, so my suspicion was that when some colleagues were accessing the site using Chrome (and possibly other Chromium-based browsers), their browser was sneakily recording their browsing activities and reporting home to GRUgle, perhaps even including the whole page content as well (that'd be an entirely inadvertent result of badly-engineered "browser cache quality telemetry diagnostics" or something, I'm sure…). Luckily there wasn't really anything sooper sekrit on our intranet, otherwise I would have made more noise at the time…

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like