back to article Cloudflare builds an AI to lead AI scraper bots into a horrible maze of junk content

Cloudflare has created a bot-busting AI to make life hell for AI crawlers. The network-taming company built the tool after noticing that almost one percent of all requests to access web content that it can see now come from AI crawler bots. Those bots are probably scraping data that’s gathered up to train AI models. Web site …

  1. Anonymous Coward
    Anonymous Coward

    AI generated content is poison for AI

    This is rather deep.

    AI generated content has been shown to very quickly poison any AI build on it. Even if the content itself is perfectly fine. So this strategy not only protects websites from on wanted visitors, it also will help us to more easily recognize the resulting bad chatbots.

    However, as much of the content published on the internet is already AI generated, it might not change that much for the crawlers.

    But if the expected arms race leads to innovative AI being able to recognize AI generated content, that would itself be a valuable outcome.

    1. Scotech

      Re: AI generated content is poison for AI

      Depends on the crawler. The best ones are built on the back of a traditional search engine crawler's index, which includes cues regarding page keyword rankings and site authority. They can incorporate this data into the training, and from there, it can affect the weightings in the resulting model. So what happens if a highly authoritative site introduces a bucket load of AI slop into the mix?

      Should be fun to watch!

      1. Mike007 Silver badge

        Re: AI generated content is poison for AI

        We were looking at the settings for our client sites that are behind cloudflare and saw the "block AI crawlers" option. The first thing my colleague asked was "will that block Google?". Good question... We did not turn it on.

        1. mark l 2 Silver badge

          Re: AI generated content is poison for AI

          TBH even if it did block the Google bot, its been a long time since Google search was useful. Google only care about those willing to pay for sponsored listings. So It won't be long before the entire 1st page of Google is just their AI generated slop and paid for rankings.

          1. Groo The Wanderer - A Canuck

            Re: AI generated content is poison for AI

            It already is; the relevant links are scrolled below the edge of my screen after the obligatory banner ad...

      2. Doctor Syntax Silver badge

        Re: AI generated content is poison for AI

        If they follow the search engine in obeying robots.txt than it shouldn't be an issue.

    2. frankyunderwood123

      Quote: But if the expected arms race

      Well written - "expected".

      It may be the entire race just cools down due to lack of a killer commercial app and thus the huge returns.

      We're in a hype cycle, as everyone knows, to the point where the top dogs in tech are serving up hardware touted as being AI Ready.

      All of that hype has been a massive dud for the billions of consumers, because there's no compelling application for Joe Average Public to get excited by.

      They can make creepy AI generated images or get AI to spit out a resume or a reply to an email, but that's fun for about 5 minutes.

      Meanwhile, the actual use case for AI is going great guns - research.

      LLM's are cutting the time it takes to do deep research by 50% or more, which is impressive - this is where it is useful tech.

      For consumers? - FFS, microsoft trying to boost PC sales by calling them "AI ready?"

      Or Apple, doing only what Apple can and pretending they invented it by renaming it, doing the "Apple Intelligence" crap?

      It's Mr. Clippy on steroids, that's it - that's all you get.

      It feels impressive until you realise the "black box" you are chatting with has walls, has a point where it hallucinates because it isn't AI.

      It was never AI.

      It's pattern recognition - very clever pattern recognition, but still just pattern recognition.

      The reason it's boomed is because compute power - suddenly, that pattern recognition can be near realtime.

      And then we get the tech bros warning about AI - the same damn assholes creating it - and they are warning about it to boost it.

      Hype cycle.

      AGI? - decades or centuries away.

      1. Anonymous Coward
        Anonymous Coward

        Re: Quote: But if the expected arms race

        There is an echo in here!

      2. Edward Ashford

        Re: Quote: But if the expected arms race

        "LLM's are cutting the time it takes to do deep research"

        And then some poor human has to check it hasn't just made stuff up, and the references are actually real.

    3. DrXym

      Re: AI generated content is poison for AI

      I'm sure images could be watermarked as AI to stop accidental ingestion but it must be harder for text. While you could probably "watermark" text in certain ways I wonder if any AI does.

    4. frankvw Bronze badge
      Holmes

      Re: AI generated content is poison for AI

      "AI generated content has been shown to very quickly poison any AI build on it."

      Given that AI tries to mimic organic intelligence (I won't call it real intelligence, given some of the people I encounter with depressing regularity) this is actually not surprising. Consider the drivel one encounters on social media, which poisons the minds of those consuming that drivel and then continue to spread more of it.

      Art does imitate nature.

    5. Anonymous Coward
      Anonymous Coward

      Re: AI generated content is poison for AI

      > "If the expected arms race leads to innovative AI being able to recognize AI generated content, that would itself be a valuable outcome."

      To be honest, I'd consider it a more valuable outcome if that *didn't* happen.

      But yeah, that's a problem with "fake content" methods- the possibility that once you can recognise it, it's easier to filter out. (*)

      Then again, once they've used it to train their models, it's generally- as far as I'm aware- impossible to separate out that contamination from everything else it's learned since. So, even if it's later identified, it may have served its purpose regardless.

      Of course, that works better if you're happy to deliberately generate maliciously flawed information because you're not a fan of gen-AI and/or because you want to screw over the type of people who would ignore robots.txt and steal from you regardless.

      (*) I've been critical of approaches to (e.g.) web browsing privacy that depend upon obscuring actual usage among large amounts of auto-generated fake data sent by your browser, or whatever. If someone retains all your data and *later* figures out how to identify the fake filler you thought would protect you, they have your real data minus the fake crap anyway, and the false sense of security may have left you worse off.

  2. Omnipresent Silver badge

    Clone wars

    Begun, they have.

    1. Roopee Silver badge
      Pirate

      Re: Clone wars

      Good analogy. Incoming shite! Lots more of it - to add to what has come in already... probably best not to step in it :)

  3. Winkypop Silver badge
    Alert

    Let me off the ride

    I’m feeling sick.

  4. Homo.Sapien.Floridanus

    bad bots, bad bots

    watcha gonna do?

    watcha gonna do

    when they come for you?

  5. Matt 52

    Ironic

    Recent estimate is 10% of online content is created by AI - so we now have AI articles being used to replace AI articles.

  6. Roopee Silver badge
    Terminator

    Karma...

    ...but undoubtedly with unintended consequences. Not necessarily fun ones.

  7. thod

    Cloudlfare toll

    I am sure cloudflare will now sell priority access to the sites for the good bots (eg thus that pays it for it).

  8. Anonymous Coward
    Anonymous Coward

    2 hopes and one is Bob

    There are lots of open source AI scrapers available that adapt to anything put at it.

    The one we use to IQ you returds has already countered this.

    1. Anonymous Coward
      Anonymous Coward

      Re: 2 hopes and one is Bob

      I am always interested in understanding both sides of an argument, which is one of the reasons I check The Register comments so carefully.

      In all the years I have been ingesting both sides of arguments, this is the first time I have felt the need to ask for a gagging order!

      1. Anonymous Coward
        Anonymous Coward

        Re: 2 hopes and one is Bob

        Yeah, that's the proper course of action to right-up hasten the advent of the singularity, with more positive feedback .txt-on-steroids roboffinry, rising technogasmic instability, and regurgitated death metal cacophony. The mother of all call stack recursion overflows into denial of operational logic vulnerabilities. The black hole that lurks inside the black hole, likely next to Uranus, flaring out as a gas giant cloud!

        The epistemectomy is strong on this one ...

    2. MOH

      Re: 2 hopes and one is Bob

      Bad bot

    3. Anonymous Coward
      Anonymous Coward

      Re: 2 hopes and one is Bob

      Your ideas are intriguing to me, and I wish to subscribe to your newsletter.

  9. Dan 55 Silver badge

    Nepenthes

    Did Cloudflare just implement Nepenthes or do something significantly different to that?

    The blog page says "To generate convincing human-like content, we used Workers AI with an open source model to create unique HTML pages on diverse topics" so it sounds like Nepenthes with their own training data.

    1. DanAU

      Re: Nepenthes

      Cloudflare's implementation is completely different. They pre-generate the content and store it in blob storage (R2) rather than generating it on demand, so that it doesn't affect runtime performance. They also ensure the content is factually accurate, whereas Nepenthes is just a Markov chain.

      1. Dan 55 Silver badge

        Re: Nepenthes

        If these are AI generated pages, there's no guarantee they are factually accurate unless there is a human reviewing them.

        Also, why would they want to play nice and do an AI crawler's homework for it? Far better to feed it misinformation to make the AI's output worthless for users as the one thing an AI company hates doing is retraining as it's expensive and time consuming.

  10. KayJ

    “No real human would go four links deep into a maze of AI-generated nonsense”

    I wonder if they'd be up for a friendly wager...

    1. Bebu sa Ware
      Coat

      Obviously the CloudFlare writer doesn't get out much...

      “No real human would go four links deep into a maze of AI-generated nonsense”

      "I wonder if they'd be up for a friendly wager..."

      Candy from a kid, I would think.

      Clearly the writer hadn't visited MAGA world as just the anti-vax internet rabbit holes are more than four deep chock-full of nonsense that would make any AI confabulator blush.

      This AI honeypot is supposedly baited with an unrelated collection of factual material which is asserted protects the later users of the poisoned AI from misleading output. I would imagine one of the more common types of misleading information is claiming two or more facts are causally related when they are demonstrably not.

  11. Doctor Syntax Silver badge

    I think I'd be inclined to start by just throwing the crawler randomly selected words from the files in /usr/share/dict, then pick out clusters of those words and throw them into the steam, also at random. That way the LLM gets lots of meaningless associations of words to chomp on so that instead of hallucinating stuff that looks real it would be hallucinating random lists of words. For added maliciousness sprinkle in about 50% randomly selected words from PR spiels.

  12. Groo The Wanderer - A Canuck

    If only it were that simple. But some of these scrapers completely ignore things like robots.txt because they are unethical thieves!

  13. Andrew Scott Bronze badge

    ClodFlare

    Tried to look at the current offerings and specs of seagate drives yesterday and got blocked by Clodflare. Not a bot. Didn't see any ai generated content, couldn't visit the site at all. Tried different browsers as i read somewhere that firefox was tagged as specious by some sites but got the same result with chrome. They seem a bit incompetent.

  14. anthonyhegedus Silver badge
    FAIL

    How long before we start noticing regular websites spouting gibberish because cloudflare mistakenly thinks we are an AI?

  15. ecofeco Silver badge
    Facepalm

    So the AI wars has begun

    That quote I saw a few months seems more true than ever: "In the future, AI will argue with other AI about the meaning of Christmas while people scavenge for food in trash bins."

    I did not expect it to arrive today. I can haz cheeseburger?

    1. Bebu sa Ware
      Facepalm

      Re: So the AI wars has begun

      "In the future, AI will argue with other AI about the meaning of Christmas while people scavenge for food in trash bins."

      Just outsourcing to AI of the status quo:

      "Arseholes argue with other arseholes about the meaning of Christmas while people scavenge for food in trash bins."

  16. Anonymous Coward
    Anonymous Coward

    Broken CAPTCHA

    Dear Cloudflare, I am not a bot!

    Please let me use https://ahrefs.com/backlink-checker.

    Or make the captcha more challenging, if necessary.

  17. mildesten
    Coat

    Obligatory pop culture reference

    "Because ice, all the really hard stuff, the walls around every store of data in the matrix, is always the produce of an AI, an artificial intelligence. Nothing else is fast enough weave good ice and constantly alter and upgrade it. So when a really powerful icebreaker shows up on the black market, there are already a couple of very dicey factors in play. Like, for starts, where did that product come from? Nine times out of ten, it came from an AI, and AI's [sic] are constantly screened, mainly by Turing people, to make sure they don't get too smart." - William Gibson, Count Zero Interrupt

    Seems like the only parts of this we're missing to date are the Turing Agency and the psychedelic, direct-to-brain UI. If current so-called AI is smart enough to do this sort of metaphorical dance, then it looks like we're in for a bit of an arms race, where scrapers and targets have their AIs try to outmanoeuvre each-other. If not...well, it'd still be nice to have a real-life Turing Agency - if only to put paid to the as-of-yet unsubstantiated "it slices, it dices, it replaces all of your interns"-type claims that are floating around these days.

  18. frankyunderwood123

    Hype Cycle

    It may be the entire race just cools down due to lack of a killer commercial app and thus the huge returns.

    We're in a hype cycle, as everyone knows, to the point where the top dogs in tech are serving up hardware touted as being AI Ready.

    All of that hype has been a massive dud for the billions of consumers, because there's no compelling application for Joe Average Public to get excited by.

    They can make creepy AI generated images or get AI to spit out a resume or a reply to an email, but that's fun for about 5 minutes.

    Meanwhile, the actual use case for AI is going great guns - research.

    LLM's are cutting the time it takes to do deep research by 50% or more, which is impressive - this is where it is useful tech.

    For consumers? - FFS, microsoft trying to boost PC sales by calling them "AI ready?"

    Or Apple, doing only what Apple can and pretending they invented it by renaming it, doing the "Apple Intelligence" crap?

    It's Mr. Clippy on steroids, that's it - that's all you get.

    It feels impressive until you realise the "black box" you are chatting with has walls, has a point where it hallucinates because it isn't AI.

    It was never AI.

    It's pattern recognition - very clever pattern recognition, but still just pattern recognition.

    The reason it's boomed is because compute power - suddenly, that pattern recognition can be near realtime.

    And then we get the tech bros warning about AI - the same damn assholes creating it - and they are warning about it to boost it.

    Hype cycle.

  19. Long John Silver Bronze badge
    Pirate

    Captcha misery

    Two matters arising.

    1. CAPTCHAs are becoming an increasing irritant. Some sites, e.g. Yandex, pop up CAPTCHAs frequently while visiting them. Some CAPTCHAs contain elements hard to distinguish. Also, confrontations with sets of images within which one must identify cars, motorcycles, bicycles, buses, stairs, bridges, and so forth, add a tedious and repetitive element to online life.

    Unlike irritants imposed by the marketing industry, there is no easy workaround/block. Moreover, when for privacy one uses various tools available within browsers, e.g a user-agent switcher, these can set off 'suspicion' of bot intrusion.

    The article mentions that AI controlled scrappers are beginning to find means for circumventing CAPTCHAs; this must lead to increasingly difficult and/or time-consuming challenges for humans. Shall somebody create a browser add-on linked to an AI which is capable of negotiating CAPTCHAs? The human could sit back and ponder 'the meaning of life' whilst a tussle between software comes to its conclusion.

    2. Isn't it time to absorb many acronyms in long use, e.g. CAPTCHA, into the language as nouns? Some could merit upper-case first letters (proper-nouns). Text with blocks of capital letters becomes unsightly. Also, each block of letters assumes implied importance compared to the rest of the text. Henceforth, 'captcha'?

    1. thosrtanner

      Re: Captcha misery

      In respect to your point 2 - CAPTCHAs are ugly in reality, so I see no reason to dignify them by treating the name as a noun. Make them stand out as horribly in text as they do on web sites you're trying to visit.

  20. Adam Foxton

    "Crawler operators ignore the instructions in robots.txt files, or work around CAPTCHAs and web server settings"

    Surely this settles the case of if AI training crawlers are legal? If there is a robots.txt file and they ignored it, that's unauthorised use. Regardless of if a human would see it or not (which seems to be a common defence), this has been a recognised standard for machine access for decades.

    In practice this won't settle it- lawyers still need years worth of work and corporations need to spend time harvesting investors' money- but in an ideal world shouldn't this be a pretty easy judgement?

  21. TheMaskedMan Silver badge

    "Surely this settles the case of if AI training crawlers are legal? If there is a robots.txt file and they ignored it, that's unauthorised use."

    Not quite - said robots.txt must also explicitly ban AI training crawlers. At that point, I agree that the site has gone far enough to indicate that such crawlers are not welcome, though some kind of inpage NOAI tag akin to NOINDEX would probably be helpful, too. I would also add something to the footer of every page, explicitly stating that the content is not for use by AI crawlers.

    With these measures in place, there is no need for this silly arms race - one has enough tools available to issue cease and desist letters and seek reparations if your content is found in training data etc. The important point, however, is that all of these measures must be in keeping with the current permissive nature of the net. Search engine bots will crawl your site unless you tell them not to, and it's only reasonable to adopt the same approach with AI bots.

    If you really must feed them twaddle - and let's face it, much of the human generated web is, and always has been twaddle - why not feed them an infinitely long page, generated on the fly? Sooner or later, the thing is going to run out of memory and crash. Assuming that the bot is capable of rendering JavaScript, you could even drop in a little infinite loop and let it feed itself infinite twaddle. I'm sure there are plenty of ways to play silly buggers with these things - blocked IP addresses, maybe.

    But I won't be doing any of those things. I don't care if they want to crawl my sites - indeed, they are welcome to do so. The web has always been a place to publish stuff, on the understanding that some people you don't like or approve of are going to read it. Yes, you can get arsey and squeal that they're infringing your copyright in some way (nobody has yet managed to show me exactly how an AI bot is doing that, though, with the exception of ingesting pirate material) but the fact is that if you want to control access to your content you need to implement measures to do that, otherwise material published on the open web is pretty much fair game.

    It's easy enough to do; put it behind a paywall, or even a password protected directory - no password, no read. But that buggers up your SEO, doesn't it, and those lovely search engine bots - which will also crawl your site and make copies, but nobody seems to object to that - won't be able to get in.

    Unfortunately, this AI malarkey is here to stay - sometimes it's even useful. Might as well stop wasting effort in an arms race you can't win and get on with your own thing.

  22. thosrtanner

    I'd be more happy with this if cloudflare hadn't recently decided to randomly block the browser I used because it didn't behave identically to the latest chrome engine and it wasn't until a report appeared on this site that they actually contacted the makers of affected browsers

  23. OllieJones

    Wasteful?

    The AI companies are running short of electricity to power all those GPUs. And the best way to counter their scrapers is to waste even more of their electricity processing slop. Whiskey. Tango. Foxtrot.

    The need for this kind of scraping countermeasure is clear. But it's tragic. This here internet was built to shrink the world, not cook it.

    Maybe some sane trade association or government regulator can see this, say "holy s__t this whole thing has gotten ridiculous, let's set up a scraping code of ethics."

    I suggest something like the DKIM stuff in email. Legitimate email senders include a signature signed with a privatte key declaring the origin of the mail. Why can't we do sonething similar for "ethical scrapers", that is, scrapers that obey robots.txt and rate limit themselves decently?

    The situation we have now will lead to recursive craniorectal inversion.

  24. Captain Tinkleberry

    Just Go away

    Hmm,

    I find it strange that causing an AI bot to run around wasting resources including electricity is seen as a good thing rather than just blocking the bot.

    Whether or not you like AI (I use it quite a bit), poisoning someone else's data is not a good way to proceed - just as an AI bot stealing someone's data is not a good way to proceed.

    Can this only be switched on if the admin has bothered to update their robot.txt.

    Seems like a somewhat malicious way to go about things. Just tell the bot to go away and block it.

    1. I could be a dog really Silver badge

      Re: Just Go away

      If the bots (or rather, their operators) were ethical then the vendors would have come up with a robots.txt extension or similar to see whether a host allows them to freeload for commercial gain. AFAIK they haven't.

      If they were ethical then it would be enough to say "no AI bots" and your site wouldn't get illegally downloaded and used for someone else's financial gain. Again, they don't - there may be some that do, but clearly many don't.

      Just imagine. If you found burglars helping themselves to the contents of your house, would you a) put the kettle on and make them a pot of tea, or b) do your utmost to stop them appropriating and benefitting from your hard earned stuff ? As I see it, this is simply a way of punishing the less ethical bot operators - and I would suggest they deserve it.

  25. chuckamok

    Will this reduce hallucinations?

    Wonder if there are effects that can be measured?

  26. chuckamok

    It's biomimicry, not so strange!

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like