Or alternatively...
Redirect such requests to pages and pages of nonsense. If you don't have any, just slurp just about any facebook or twitter feed.
OpenAI, the maker of machine learning models trained on public web data, has published the specifications for its web crawler so that publishers and site owners can opt out of having their content scraped. The newly released technical document describes how to identify OpenAI's web crawler GPTBot through its user agent token …
That’s what compression is for. Pre-compress a pages worth of words cloned billions of times (and concatenated into a single file) then have your server transmit it raw with the appropriate HTTP headers. Your server might only have to send 3-4MB due to all the repetition but when the bot transparently decompresses it on its end, the result will be far larger.
《Redirect such requests to pages and pages of nonsense. 》
I was thinking redirect to Gutenberg (or even libgenesis) as it might then "learn" something useful. There are a large number of bible sites in a multitude of languages with various versions and translations from which it might make some sense and for good measure I wouldn't exclude the ravings of various cults, sects and conspiracy theorists.
After trawling through all that human excreta any machine learning model is going to be, if not already, completely gaga.
Simpler to just block the designated OpenAI webcrawler IP addresses.
What I find odd is the technical document fails to explicitly define the GPTBot user agent token and string; I want to identify a training bot and either kill the connection or redirect it to somewhere else…
Perhaps we need AV/security software on our websites which can perform heuristic analysis on connection requests, blocking those from “undesirables”…
The various AI cretins companies should agree on a "universal" designation in robots.txt files. Something like
User-agent: AIBot
Disallow: /
While that does not protect against those who cheat, it at least potentially helps somewhat.
One would think all web hosting companies would be quick to add on a "AI Blocker" as an additional service as it's an additional revenue stream.
Can someone enlighten me?
Robots.txt has been used for/against search engine bots for ages. It has always come with the warning that if you don't want people/bots/crawlers to know that directory B exists you should not explicitly allow crawling of directory A while explicitly blocking crawling of directory B. You're asking not to look somewhere and only scrupulous bots would honour that. The solution is usually to sort it out at page level so a page you don't want to be crawled has a meta tag blocking it.
How would one do this with the ChatGPT bot? Will it also look at the page meta tags? I don't trust some makers of AI bots to not use a block attempt to explicitly go and harvest data from directories that I have disallowed. It might just give them an edge over OpenAI.
How could I let ChatGPT freely crawl my FAQ or About Us page but not my Content page?
Fundamentally it shouldn't matter if you tell all and sundry that directory Y exists. If you want to protect it you need to have appropriate security to protect it, just hoping nobody guesses / finds out the directory name isn't security.
Now you do need to ensure that merely knowing the directory name doesn't give information away (CompanyXTakeoverBid would be a bad name) but a directory name like secure doesn't tell people much.
Not including directory Y in your file isn't very much security, since if there are any links going there, it won't be hard to identify it. However, you do have a way to do this in robots.txt without implementing something server-side which filters stuff out, which is stronger but more work. What you do is this:
disallow: /
allow: /x
It doesn't only filter out /y, and if you want most of your site open, then you'll have to put a lot of things on that list, but it is the only option you have if you're limited to static files and the honor system. One other note: depending on the bot, this may not have them crawling the /x directory because they start on the homepage and only go to things that are linked. You might have to add another allow statement for that page.
Don't use a meta tag - you're still trusting people to honour your decision.
You need to block pages/directories explicitly at the server level. If you are going to make use of the declared "user-agent" (as you sort of are if you are thinking of robots.txt, but again, this has issues), then block the page using that:
In apache, you can use setenvif - it's easier, but less powerful, or you can use mod_rewrite.
Either method can be applied to specific files or directories.
apache mod_rewrite and nginx: https://geekflare.com/block-unwanted-requests/
apache setenvif: https://stackoverflow.com/questions/51972679/how-to-block-a-specific-user-agent-in-apache
essentially: get f... (gently, unnoticeably) for the sake of "quality of AI models" (don't think of their bottom line) and then be really screwed in ways you can't even imagine, once they unleash their models on to the world big time.
Big tech will approach the problem of copyright breach claims like they always have done. They'll ignore the legal problems until challenged, and then they'll throw unlimited lawyers and money at it until it either goes away, or they get what they want. At no point will anything get rolled back or stopped. I'm afraid that authors of books and websites are about to find this out. Their work will be ingested.