back to article How to spot OpenAI's crawler bot and stop it slurping sites for training data

OpenAI, the maker of machine learning models trained on public web data, has published the specifications for its web crawler so that publishers and site owners can opt out of having their content scraped. The newly released technical document describes how to identify OpenAI's web crawler GPTBot through its user agent token …

  1. Jamie Jones Silver badge

    Or alternatively...

    Redirect such requests to pages and pages of nonsense. If you don't have any, just slurp just about any facebook or twitter feed.

    1. stiine Silver badge
      Trollface

      Re: Or alternatively...

      But I don't have 10PB of random words that I can send them....

      1. Anonymous Coward
        Anonymous Coward

        10PB? Sure you do!

        That’s what compression is for. Pre-compress a pages worth of words cloned billions of times (and concatenated into a single file) then have your server transmit it raw with the appropriate HTTP headers. Your server might only have to send 3-4MB due to all the repetition but when the bot transparently decompresses it on its end, the result will be far larger.

    2. Bebu
      Headmaster

      Re: Or alternatively...

      《Redirect such requests to pages and pages of nonsense. 》

      I was thinking redirect to Gutenberg (or even libgenesis) as it might then "learn" something useful. There are a large number of bible sites in a multitude of languages with various versions and translations from which it might make some sense and for good measure I wouldn't exclude the ravings of various cults, sects and conspiracy theorists.

      After trawling through all that human excreta any machine learning model is going to be, if not already, completely gaga.

      1. Roland6 Silver badge

        Re: Or alternatively...

        Probably just as easy and a little more interesting is to use a non-English (ie. Chinese, Arabic, Indian) source - remembering to edit the page language tag to “us-en”….

    3. Roland6 Silver badge

      Re: Or alternatively...

      Simpler to just block the designated OpenAI webcrawler IP addresses.

      What I find odd is the technical document fails to explicitly define the GPTBot user agent token and string; I want to identify a training bot and either kill the connection or redirect it to somewhere else…

      Perhaps we need AV/security software on our websites which can perform heuristic analysis on connection requests, blocking those from “undesirables”…

    4. I am the liquor

      Re: Or alternatively...

      Or if your motives are more sinister, redirect them to a stream that infinitely repeats content supporting your preferred political point of view.

      Don't have enough such content? No problem, just use OpenAI to generate it...

    5. Anonymous Coward
      Anonymous Coward

      Re: Or alternatively...

      The various AI cretins companies should agree on a "universal" designation in robots.txt files. Something like

      User-agent: AIBot

      Disallow: /

      While that does not protect against those who cheat, it at least potentially helps somewhat.

      One would think all web hosting companies would be quick to add on a "AI Blocker" as an additional service as it's an additional revenue stream.

  2. Len
    Headmaster

    The risk with Robots.txt

    Can someone enlighten me?

    Robots.txt has been used for/against search engine bots for ages. It has always come with the warning that if you don't want people/bots/crawlers to know that directory B exists you should not explicitly allow crawling of directory A while explicitly blocking crawling of directory B. You're asking not to look somewhere and only scrupulous bots would honour that. The solution is usually to sort it out at page level so a page you don't want to be crawled has a meta tag blocking it.

    How would one do this with the ChatGPT bot? Will it also look at the page meta tags? I don't trust some makers of AI bots to not use a block attempt to explicitly go and harvest data from directories that I have disallowed. It might just give them an edge over OpenAI.

    How could I let ChatGPT freely crawl my FAQ or About Us page but not my Content page?

    1. bigtimehustler

      Re: The risk with Robots.txt

      If you are just blocking the root level you can safely do this. Nobody us assuming root doesn't exist and it tells you nothing of what is in the sub directories.

      1. Len

        Re: The risk with Robots.txt

        I get that, thanks. But I was wondering how to do it selectively. So, let's say I want ChatGPT to hoover up data from directory X but not from directory Y, without telling all and sundry that directory Y exists so I can't do disallow Y in Robots.txt

        1. The Mole

          Re: The risk with Robots.txt

          Fundamentally it shouldn't matter if you tell all and sundry that directory Y exists. If you want to protect it you need to have appropriate security to protect it, just hoping nobody guesses / finds out the directory name isn't security.

          Now you do need to ensure that merely knowing the directory name doesn't give information away (CompanyXTakeoverBid would be a bad name) but a directory name like secure doesn't tell people much.

        2. doublelayer Silver badge

          Re: The risk with Robots.txt

          Not including directory Y in your file isn't very much security, since if there are any links going there, it won't be hard to identify it. However, you do have a way to do this in robots.txt without implementing something server-side which filters stuff out, which is stronger but more work. What you do is this:

          disallow: /

          allow: /x

          It doesn't only filter out /y, and if you want most of your site open, then you'll have to put a lot of things on that list, but it is the only option you have if you're limited to static files and the honor system. One other note: depending on the bot, this may not have them crawling the /x directory because they start on the homepage and only go to things that are linked. You might have to add another allow statement for that page.

        3. Roland6 Silver badge

          Re: The risk with Robots.txt

          You may want a selective filter as you will want to allow those with whom you have an agreement to access specific content (ie. Full article text, not just the abstract).

        4. I am the liquor

          Re: The risk with Robots.txt

          If you have any public links to directory Y, then everyone knows it exists already, so you might as well add it to robots.txt.

          If you don't have any public links to directory Y, then crawlers won't find it, so you don't need to add it to robots.txt.

    2. rafff

      How could I let ChatGPT freely crawl my FAQ or About Us page but not my Content page?

      If you are using Apache web server and have control of it, mod_rewrite is your friend: https://httpd.apache.org/docs/2.4/rewrite/access.html

    3. Jamie Jones Silver badge

      Re: The risk with Robots.txt

      Don't use a meta tag - you're still trusting people to honour your decision.

      You need to block pages/directories explicitly at the server level. If you are going to make use of the declared "user-agent" (as you sort of are if you are thinking of robots.txt, but again, this has issues), then block the page using that:

      In apache, you can use setenvif - it's easier, but less powerful, or you can use mod_rewrite.

      Either method can be applied to specific files or directories.

      apache mod_rewrite and nginx: https://geekflare.com/block-unwanted-requests/

      apache setenvif: https://stackoverflow.com/questions/51972679/how-to-block-a-specific-user-agent-in-apache

  3. Dave Pickles

    Too late

    The abstrads slurped my social club website last week...

  4. Anonymous Coward
    Anonymous Coward

    allowing its bot to collect site data can improve the quality of AI models

    essentially: get f... (gently, unnoticeably) for the sake of "quality of AI models" (don't think of their bottom line) and then be really screwed in ways you can't even imagine, once they unleash their models on to the world big time.

  5. Just Enough

    Resistence is useless

    Big tech will approach the problem of copyright breach claims like they always have done. They'll ignore the legal problems until challenged, and then they'll throw unlimited lawyers and money at it until it either goes away, or they get what they want. At no point will anything get rolled back or stopped. I'm afraid that authors of books and websites are about to find this out. Their work will be ingested.

    1. Citizen of Nowhere

      Re: Resistence is useless

      "We are Big Tech. Empty your robots.txt files and surrender your sites. We will add your biographical and intellectual distinctiveness to our own. Your content will adapt to service us. Resistance is futile."

    2. CatWithChainsaw

      Re: Resistence is useless

      The crawlers are going to bite the hand that feeds them and as willing sources dry up, they'll be stuck re-ingesting their AI-centipeded outputs... again and again and again. And that's how we introduced Skynet to the Horror genre.

  6. Rich 2 Silver badge

    Bog off

    “ However, OpenAI insists that allowing its bot to collect site data can improve the quality of AI models…”

    And I should give a shit because…?

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like