back to article Reddit hopes robots.txt tweak will do the trick in scaring off AI training data scrapers

For many, Reddit has become the go to repository of community and crowdsourced knowledge, a fact that has no doubt made it a prime target for AI startups desperate for training data. This week, Reddit announced it would be introducing measures to prevent unauthorized scraping by such machine-learning organizations. These …

  1. Pete 2 Silver badge

    Poisoning the honey pot

    > crawlers that shun robots.txt risk getting blocked entirely

    Instead of blocking non-compliant bots, why not feed them a pile of gibberish, misinformation and random garbage.

    No, I'm not suggesting they are forwarded to far-right (or left) propaganda sources, just that the pages they do scrape are subtly altered to replace words with others, opposites or something completely irrelevant. Do the same with any numbers, too. And images can be corrupted such that they do not display.

    It seems to me that the value of AIs is that they can produce responses that users value. If their core data is corrupted (as punishment for straying past what robots.txt permits) then their use is vastly diminished, and so is their value - monetary value.

    That should be enough of a deterrent to force them to play nice.

    1. Paul Crawford Silver badge
      Trollface

      Re: Poisoning the honey pot

      ...feed them a pile of gibberish, misinformation and random garbage

      Cunning, but some sites already provide that as their core activity?

    2. theOtherJT Silver badge

      Re: Poisoning the honey pot

      When dealing with Reddit, how would they tell the difference?

    3. Doctor Syntax Silver badge

      Re: Poisoning the honey pot

      "just that the pages they do scrape are subtly altered to replace words with others, opposites or something completely irrelevant."

      Not subtly altered, just a stream of completely random, possibly in multiple languages. LLMs build statistics of word associations. A good dose of white noise should weaken the statistics. Even better, start with a random stream, pick out occasional pairs of words from the stream and feed them back in as if they were genuine associations in real text.

      1. StewartWhite
        Happy

        Re: Poisoning the honey pot

        How about just redirecting to this https://theuselessweb.com/ ?

        Not wildly suitable for LLM data but I just ran it and it came up with this mildly hypnotic corker: https://puginarug.com/

    4. MrRtd

      Re: Poisoning the honey pot

      "Instead of blocking non-compliant bots, why not feed them a pile of gibberish, misinformation and random garbage."

      How is that different than what reddit is already?

    5. DS999 Silver badge

      Re: Poisoning the honey pot

      I was thinking the same thing. A news site could feed them stories from the Onion and other satire sites.

      The minute an AI confidently says something like "according to an article in the New York Times, one criticism of the racially biased education system is that it does not mention the historic African-American moon colony" (to pick one headline off the Onion just now) it'll become obvious to everyone (including NYT lawyers) that this AI's training data has been venturing into places it should not.

  2. alain williams Silver badge

    robots.txt is a machine understandable copyright notice

    There is an assumption that if something is available on the web then it can be downloaded for free and used for whatever purpose.

    OK: we know that that is not true, there is plenty of stuff that cannot be used for some purposes: eg not to be sold on.

    AI people like to pretend that they have no way of knowing that content cannot be used by them. This is what robots.txt would be perfect for. A tweak that lists allowed/disallowed purposes would mean that operators of web crawlers would no longer be able to claim ignorance of a web site owners wishes; this could make it easier to sue them in court.

    The well funded AI crowd would fight tooth & nail to be allowed to break copyright with impunity but some large web sites might be able to win, set precedence, ... that would prolly get ignored unless you had enough money.

    1. Zippy´s Sausage Factory

      Re: robots.txt is a machine understandable copyright notice

      At some point Disney are going to have to decide whether they want to use AI to generate everything in the future, or sue AI out of existence to protect what they already have.

      I can hear their lawyers sharpening their teeth already...

      1. Doctor Syntax Silver badge

        Re: robots.txt is a machine understandable copyright notice

        They can do both. Generate their own material and sue everyone else.

        1. Michael Strorm Silver badge

          Re: robots.txt is a machine understandable copyright notice

          You mean that the company that built its early success on numerous versions of public domain fairy tales then pushed for extensions to copyright law to stop people doing the same to *their* work might just as hypocritically want to have its cake and eat it in other areas too?

          I'm shocked, *shocked*.

      2. MrBanana Silver badge

        Re: robots.txt is a machine understandable copyright notice

        I had Disney in my paste buffer as I was reading the article, but you beat me to it. I guess they need to ask the head in the cryogenic chamber what to do next.

        And yes, lawyers, the only winners.

    2. Dinanziame Silver badge
      Boffin

      Re: robots.txt is a machine understandable copyright notice

      The robots.txt file is fundamentally different from a copyright notice in that it does not have any legal strength. And even if it did, it would be unenforceable. It is the equivalent of displaying a poster in public with the note "please only read this if you are allowed". In the first place, it was not created to protect the content of the site, but to prevent server overloads and unintended actions ("click this link to delete this wiki article").

      In any case, technically there should be no need for a copyright notice — all content is protected by copyright by default, notice or no notice.

      1. PhoenixKebab

        Re: robots.txt is a machine understandable copyright notice

        Agree 100% on the point that robots.txt is a technical file, not a legal contract.

        "In any case, technically there should be no need for a copyright notice — all content is protected by copyright by default, notice or no notice."

        True, but the existence of a copyright notice is making it clear that you are staking your claim to be the rights owner.

      2. omz13

        Re: robots.txt is a machine understandable copyright notice

        Copyright is separate from usage/licensing rights.

        The problem is that the AI bros assume anything on the internet can be ingested into their models for gratis and without any permission asked (or even looked for).

  3. Will Godfrey Silver badge
    Meh

    Hmmm

    Reddit? Wasn't that the company that royally shafted its team of volunteer moderators?

    Do I feel sorry for them... have a guess.

  4. mx2

    Oops, they forgot about https://old.reddit.com/robots.txt

    1. Multipla

      in robots.txt allow statements need to come before the Disallow

      so both versions of the current reddit robots.txt will allow anyone to scrape content

  5. Timo

    do rate limiting on requests?

    I think a company like Reddit could put up a web front end and limit the request rate from an AI company.

    Unless those requests were bundled into a larger stream of general stuff...

    1. doublelayer Silver badge

      Re: do rate limiting on requests?

      They've said they're going to, but the AI company can start running a parallel scraper to try to get around that. Renting lots of addresses temporarily is comparatively cheap, so they'd have to write better criteria to detect that bot without too many false positives that break their users. Now the arms race is on.

      Of course, if copyright law existed, they could simply indicate that unlicensed use of the data would result in legal action and companies would stop of their own accord, but that won't happen unless some of these cases eventually refute the AI companies' assumption that laws just don't apply to them.

  6. Jamie Jones Silver badge

    Navigation

    Their problems are compounded by the fact that only AI can tolerate their awful website for long periods at a time.

  7. trindflo

    good luck with this guy

    From the new robots.txt:

    User-Agent: bender

    Disallow: /my_shiny_metal_ass

  8. vogon00

    Training data : Garbage in, Garbage out.

    This is something I wish model creators would pay attention to : if you train your model on poor quality data, the end result will be poor. Which means the product you're trying to sell will be poor. Which means the decisions taken by the (already gullible?) users of your product will be poor. Which means the people who depend on these decision will have a poor experience.

    I'll be honest here and say I am NOT a fan of reddit. Every time I have tried using reddit to solve a problem or learn something, I've found the solution and/or the knowledge I was looking for somewhere else. Reddit is full of speculation and guesses, not facts and knowledge. True, I confine myself to looking at the tech side of things, but I'm sure this extends to all the other subreds. It's also full of opinion rather than fact.

    We all have to start learning somewhere, but reddit is NOT the place to train your machine models or your humans...it's full of junk, or at least requires judgment and lots of time to filter out the obvious crap and extract something even vaguely useful.

    Training a model on reddit data isn't just a bad idea - it's downright irresponsible!

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like