back to article Cloudflare tightens screws on site-gobbling AI bots

Cloudflare on Monday expanded its defense against the dark arts of AI web scrapers by providing customers with a bit more visibility into, and control over, unwelcome content raids. The network biz earlier this year deployed a one-click AI bot defense to improve upon the not very effective robots.txt mechanism, a way websites …

  1. Sora2566 Silver badge

    The only contract I want with AI is to keep them off my website.

    1. Pascal Monett Silver badge
      Mushroom

      Indeed. Here's the deal : fuck off.

  2. wolfetone Silver badge

    ""Some customers have already made decisions to negotiate deals directly with AI companies," explained Sam Rhea, a member of Cloudflare's emerging technology and incubation team. "Many of those contracts include terms about the frequency of scanning and the type of content that can be accessed. We want those publishers to have the tools to measure the implementation of these deals.""

    This should read:

    "We've noticed some customers have money to give these AI companies to stop them scraping their content. We'd like some of that money instead, and we know our customers are good for it. So we're introducing this service for the long game of getting more money".

    1. SnailFerrous

      Or in abridged form.

      Nice web site you've got there. Be a shame if it got scraped.

      1. wolfetone Silver badge

        TikTok version:

        Money. Gimme.

  3. Groo The Wanderer

    And how much does this auditing cost, or do you have to be "big enough player" to have a Cloudflare account in the first place?

    Sounds like Cloudflare wants in on that lucrative AI budget. "Pay us so you don't have to pay them."

    1. Irongut Silver badge

      > "big enough player" to have a Cloudflare account

      I have a Cloudflare account for my blog. A single person is big enough for Cloudflare.

      You can keep your FUD and go back to serving your non-intelligent LLM overlords.

      1. wolfetone Silver badge

        "A single person is big enough for Cloudflare."

        For now.

    2. IGotOut Silver badge

      Maybe if you took one second to look, you'd find out.

      Instead, just feel free to spout your ignorance here.

  4. Anonymous Coward
    Anonymous Coward

    I remember, during the early Web, when there were bots crawling the web looking for email addresses to harvest for their spam campaigns, when someone, I don't remember who, had a clever little trick that they did for the spambots who naïvely would click on every link on a website and search for email addresses — they'd have a link that just generated hundreds of random, completely unique, but totally fabricated email addresses for the bots to harvest. And at the end of every page of that dynamically-generated page, they'd have a link to the same URL, which would then generate more and more email addresses in hopes that it would tie up those bots with harvesting thousands of low-quality fabricated email addresses.

    I mean, obviously the spammers caught on, but I wonder… you could probably do that with AI content-scrapers, now. All you need something that generates dynamic text that the bot can just scarf, over and over again. I mean, I hear wonderful things happen when LLMs slurp up reams and reams of synthetic data without knowing it.

    It doesn't even have to be computationally expensive. Something incredibly simple like a markov chain that iterates over something public domain, like… I don't know, the entirety of Project Gutenberg? should be enough. Just enough to tie the scrapers and waste their time, because of course they scrutinize their sources and would catch this instantly.

    The nice part is that unlike email addresses, there really isn't an easy way to validate synthetic content. Like, a lot of stuff that markov text generators make sound plausible, but ultimately don't make any damn sense.

    I mean, you could use a text corpus that's patently radioactive, like terrorist literature or (textual) depictions of illegal actions, but that might stand out too much, and of course there's the matter of legality to consider. But, you know… anything to mess with these bastards, even just a little bit.

    1. Jellied Eel Silver badge

      I mean, you could use a text corpus that's patently radioactive, like terrorist literature or (textual) depictions of illegal actions, but that might stand out too much, and of course there's the matter of legality to consider. But, you know… anything to mess with these bastards, even just a little bit.

      I keep thinking there are probably simpler legal remedies. So a clear terms of use message on your website stating that scraping is not permitted. Then unauthorised use of the website would arguably be illegal under the various forms of Computer Misuse Act(s) in much the same way that hacking a website would be.

      Downside is it would need lawyers to take on the case, but could copy from the patent trolls and take on a small scraper first to get a win and establish the precedent. Same principle should be applicable to all the data harvesters, although they often hide behind the illusion of consent.

    2. Irongut Silver badge

      > I mean, you could use a text corpus that's patently radioactive

      It wouldn't make any difference to Sam Altman or his cronies, after all the most commonly used picture library for training these non-intellligences contains CSAM.

      Not just may contain but is known to contain.

      What could be more 'radioactive' than that?

    3. Someone Else Silver badge

      Like, a lot of stuff that markov text generators make sound plausible, but ultimately don't make any damn sense.

      Much like what Chat GPT generates, then.

      Who knew that this whole AI thing was a rebranding of a Markov chain?

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like