back to article Websites clamp down as creepy AI crawlers sneak around for snippets

The internet is becoming significantly more hostile to webpage crawlers, especially those operated for the sake of generative AI, researchers say. The Data Provenance Initiative in their study titled "Consent in Crisis" looked into the domains scanned in three of the most important datasets used for training AI models. …

  1. xyz Silver badge

    Ban "em, ban 'em all.

    Can you imagine an AI trained on social media drivel. The amount of AI electrons needlessly dying so that Cheryl's and her BFFs "thoughts" can be stamped for posterity on X, TikTok et al and then swallowed like a toxin by some poor unsuspecting AI shouldn't be allowed.

    1. Irongut Silver badge

      Re: Ban "em, ban 'em all.

      > Can you imagine an AI trained on social media drivel.

      Yes, it is called ChatGPT and Gemini, Llama, Claude, etc, etc, etc. They are all trained on social media, Reddit and 4chan among other sites.

  2. FF22

    Victim blaming

    "Part of this is down to the restrictions in robots.txt and the ToS not lining up. 34.9 percent of the top training websites make it clear in the ToS that crawling isn't allowed, but fail to mirror that in robots.txt"

    A classic case of victim blaming. For one, robots.txt can not mirror the ToS anyway, because the former is a technical, and the latter a legal document, designed to express very different things that are only marginally related to each other. Also the robots.txt format does not allow to include instructions for any yet unknown entity (crawler), is not legally binding, and even if it would be could be easily circumvented by simply renaming the user-agent in question.

    And finally robots.txt and copyright law have completely opposing defaults. The robots.txt spec essentially says that if the file is missing or any explicit disallow directive is missing, then you're free to crawl whatever url you want. Copyright law on the other side says that unless there's explicit permission, you're not allowed to access or use any copyrighted content. So, of course robots.txt - which is not law and not a contract anyway - can not possibly used to override copyright law (and the ToS), but AI companies act as if it would, and that the lack of explicit disallow directives would give them permission to use any content for training.

    Also large internet corporations, like for ex. Google or Facebook, etc. are actively working to make robots.txt useless for blocking AI crawler even at the theoretical/technical level, by simply using the very same user agent that they use for collecting content for other purpose. So, for ex. Facebook is using the very same user agent ("facebookexternalhit") for obvious AI crawling that they've been and are still using for building the link previews in shares - which in the end means that you can't tell to them to not train their AI models with your data but show meaningful link previews in shares.

    Similarly Google has Googlebot-Extended, which supposedly is there so you can prevent them using your data for AI training - but that only applies to training for ex. their Bard model, but doesn't affect AI Summaries, which are shown based on what data their regular Google crawlerbot has retrieved.

  3. alain williams Silver badge

    robots.txt talks about crawlers & not their purpose

    It requires that Web Site Owners (WSOs) list the different crawlers and specify rules.

    Most WSOs do not care what the crawlers care called they just want to control what the content is used for; so indexing might be fine as might humans use as learning material but feeding to an AI might not be. Also should there be some attribution made whenever derived content is displayed somewhere ?

    The Consent in Crisis paper kind of talks about this but, as all to often in such papers, does not provide an easy to read summary.

    Maybe Automated Content Access Protocol needs to be revived. But this will be objected to by AIs and others who want to continue their free lunch.

  4. gitignore
    Stop

    Financial and Environmental cost

    We found that over 90% of our traffic was Bot related. After adding entries to robots.txt , and sending a % of 429 responses (thanks google for ignoring robots.txt), it's down to a still high, but manageable, 30%. Quite a few bots identify themselves in the agent string as a normal client rather than a bot which makes filtering them quite complex.

    Financially we've saved about $5k / month in cloud costs, and all that processing and serving of pointless traffic inevitably has an energy cost as well. The scans kill the content caches as they just pull anything with a link rather than just hitting popular items like real customers do, so our cache hit percentage is now way higher, and we're not training competitors' AI systems with our data any more.

    Next step is to detect the bot and serve it junk, though I think I'll struggle to get signoff on that project!

    1. Anonymous Coward
      Anonymous Coward

      Re: Financial and Environmental cost

      >Next step is to detect the bot and serve it junk, though I think I'll struggle to get signoff on that project!

      Ask forgiveness - you are doing God's work brother.

  5. Pascal Monett Silver badge

    "an honor code for crawlers"

    Ah, honor.

    The kind of thing the Founding Fathers were sure people had when making important decisions.

    Ah, how times have changed.

    1. yetanotheraoc Silver badge

      Re: "an honor code for crawlers"

      Ah, the golden age fallacy.

      https://rationalwiki.org/wiki/Good_old_days

      The Founding Fathers dealt with scoundrels both at home and abroad, and even amongst the other (sic) Founding Fathers. Or so my histories tell me. I wasn't there.

  6. Howard Sway Silver badge

    "Consent in Crisis"

    There's an easy way to fight back against this and give the AI scrapers their just desserts - call it "Content in Crisis". First of all, I'd change the robots.txt spec to have a forbid-all-ai-bots entry, so there should be no need to keep up with the ever increasing number of scrapebots. Secondly, if you've done this, add a page to your site and an invisible hyperlink to it from your homepage, containing "facts" like "The first industrial steam engine was invented by Percy Numbnuts in 1327". You can then ask all the AIs who invented the first industrial steam engine and catch the scrapers red handed. And know that their output is getting polluted with laughable bullshit when they ignore the rules.

    1. Irongut Silver badge

      Re: "Consent in Crisis"

      Since adherence to robots.txt is voluntary this would achieve exactly nothing. I'm sure someone would notice your invisible link though, decide your site might be dodgy and downgrade it in search results.

  7. Alien Doctor 1.1

    I wonder...

    how many sites will now implement mfa logins to allow access to anything more than the homepage?

  8. DS999 Silver badge
    Facepalm

    Oh noes!

    We're going to have to start paying for content to feed the AI monster? Its the end of the world!

    The only reason it was ever "free" is because when they start scraping the websites didn't know what they were doing, or didn't care because AI was just some ivory tower research that wouldn't affect their bottom line. Once they realized that letting LLMs feed on their data would make them able to answer many questions about the site's content without ever visiting the site and seeing the ads, alarm bells went off and they obviously had to rectify the situation.

    If nothing changed and scraping remained free, and AI lived up to its hype, all the sites they feed on would disappear because they'd no longer hava a business model. Who knows, if AI works well enough that we no longer have to surf the web for news but were simply able to get our personal AI to give us a summary of the news items it knew we'd be interested in maybe that results in a better world. Instead of having to rely on showing ads to visitors, maybe journalism would be supported by AI paying them a regular "salary" for daily or weekly content. It would become more like it used to be where journalists had regular income and didn't have to write clickbait headlines to feed their families.

  9. IGotOut Silver badge

    Thanks for the reminder....

    ....must turn on Cloudflares AI bot blocker.

  10. Pete 2 Silver badge

    Scrape me!

    > the current trajectory of AI data scraping could affect how the web is structure

    So sites with high quality content prohibit AI access. While sites that are wholly propaganda, political (or religious) extremism or just plain mad, will welcome the crawlers with open arms.

    How can that lead to anything except a completely lop-sided future?

    AI isn't going away. But to have it, and all the services that will use it, based on a foundation of misinformation only makes that garbage mainstream. Accepted truth.

    Many complain that the right have dominated social media, where the left consider themselves too high-minded to sully themselves with the same tactics. It seems that by excluding facts, truth and independent sources the same thing will happen to AI.

    1. m4r35n357 Silver badge

      Re: Scrape me!

      Anyone know whether El Reg is being scraped, "legitimately" (I have not given consent) or not?

      I sort of hope so . . . :)

      Let's make sure we only give them the highest quality input, folks!

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like