back to article Facebook bars crawls from all but select few

Facebook has updated its robots.txt file so that the site can only be crawled by a short list of search engines, including Google, Microsoft's Bing, China's Baidu, Russia's Yandex, and a few others. Previously, Facebook's robot.txt allowed anyone to crawl the site, although the company had threatened to sue at least one …


This topic is closed for new posts.
  1. Ian Ferguson
    Thumb Up

    Search results

    Any search engine that excludes Facebook pages from its search results, intentionally or not, gets the thumbs up from me.

    Facebook pages are complete drivel, second only to MySpace pages.

  2. Anonymous Coward
    Anonymous Coward

    Crawler beware

    Could it not be argued that a search company should make a concious decision before crawling certain sites rather than rely on the presence of a robots.xt file?

    especially one that has the wherewithal to hire lawyers (lots of lawyers).

    1. copsewood

      respect robots.txt or welcome to my infinite tarpit

      It's entirely possible to tarpit a crawler you really don't like by generating an infinite number of random pages and links for it extremely slowly, tying up its resources for months on end. The robots.txt protocol is so well known that there is no excuse for a site owner not to use it to express policy or for a crawler operator not to respect it.

      Given that a crawler which doesn't respect the wishes of a site owner can be tarpitted until it gives up, there is no point suing if a crawler respects robots.txt and a better punishment available if a crawler doesn't, given that all a crawler is doing is making automated use of information you have chosen to publish.

  3. Ralph B

    Populating Google Me?

    Wouldn't it be ironic if Facebook's irresponsible approach to user privacy allowed Google to auto-populate their rumoured rival service with user accounts?

    Expect an email from Google soon, starting: "If you, like many others, are unhappy with Facebook, then we're please to tell you that we've already prepared you an account at Google Me, with the same login details, same friends and groups lists ..."

  4. Anonymous Coward
    Big Brother

    Google already have...

    their own social networking site, its called orkut, so it *could* happen

  5. Anonymous Coward
    Anonymous Coward

    Don't blame them

    I've run a long-tail site, and when I wasn't watching very hard it was brought to a grinding halt by crawler traffic placing 5x the load on the server that the actual user traffic was generating. I ended up blocking the random crawlers as well - half of them were effectively just stealing the content, and most of the rest of them were sending next to no traffic to use anyway

  6. John Ridley

    It's a start

    Now, can I specify a custom robots.txt so that NO search engines can crawl my Facebook stuff?

    I suspect I'm OK already since I go in every few weeks and make sure that nobody I haven't accepted as a friend can even tell that I exist on FB, let alone see any of my stuff.

  7. Anonymous Coward

    Lying scumsuckers

    Yes, some of these crawlers are really disreputable. If Facebook doesn't under the counter I'm a goldfish, even before you consider what they've got acting as CEO...

  8. El Richard Thomas

    Fixed the quote...

    "Some sleazy crawlers simply aggregate user data en masse and then sell it, which we view as a threat to our own sleazy business model"

This topic is closed for new posts.

Other stories you might like