Any search engine that excludes Facebook pages from its search results, intentionally or not, gets the thumbs up from me.
Facebook pages are complete drivel, second only to MySpace pages.
Facebook has updated its robots.txt file so that the site can only be crawled by a short list of search engines, including Google, Microsoft's Bing, China's Baidu, Russia's Yandex, and a few others. Previously, Facebook's robot.txt allowed anyone to crawl the site, although the company had threatened to sue at least one …
It's entirely possible to tarpit a crawler you really don't like by generating an infinite number of random pages and links for it extremely slowly, tying up its resources for months on end. The robots.txt protocol is so well known that there is no excuse for a site owner not to use it to express policy or for a crawler operator not to respect it.
Given that a crawler which doesn't respect the wishes of a site owner can be tarpitted until it gives up, there is no point suing if a crawler respects robots.txt and a better punishment available if a crawler doesn't, given that all a crawler is doing is making automated use of information you have chosen to publish.
Wouldn't it be ironic if Facebook's irresponsible approach to user privacy allowed Google to auto-populate their rumoured rival service with user accounts?
Expect an email from Google soon, starting: "If you, like many others, are unhappy with Facebook, then we're please to tell you that we've already prepared you an account at Google Me, with the same login details, same friends and groups lists ..."
I've run a long-tail site, and when I wasn't watching very hard it was brought to a grinding halt by crawler traffic placing 5x the load on the server that the actual user traffic was generating. I ended up blocking the random crawlers as well - half of them were effectively just stealing the content, and most of the rest of them were sending next to no traffic to use anyway
Biting the hand that feeds IT © 1998–2022