So why is Net Applications seeing what it's seeing?
What they want to see?
Over 10 per cent of Google's internal machines are hiding their software makeup from the outside world, according to data collected by Net Applications, a web analytics outfit that captures user traffic on more than 40,000 sites across the net. When visiting webpages, the firm says, 11 to 13 per cent of internal Google …
There might be a non-sinister explanation. A lot of the work Google does is related to quality, so anyone in the anti-spam team, AdWords quality, fraud detection, search quality, etc. At http://www.justlanded.com we have seen Google servers pretend to be different O/Ss and user agents from the same IP at the same times. Yahoo do the same stuff and from time to time we have to unblock their IPs when they use things like Wget to pull down thousands of pages (typical email harvesting or scraper behaviour).
People will see black helicopters everywhere... that's not to say they won't be launching a distro or some other MicrosoftWhack though...
I wonder if it's just someone that is "paranoid". You know, flash off, no java, no javascript, clear the cookies and browser history all the time. Some people at google might just be "Ha! I'm going to clear the user-agent string too!" Or testing anonymization tools. Something like that. I just don't see google hiding a new OS by clearing it all out entirely. Maybe it's even just an internal test build of Chrome that had a bug making it leave the user-agent blank 8-).
The most likely reason for blank user agents: the Googlers have decided that they want to encourage websites to be standards compliant instead of detecting the browser type and building a page for that one. This sounds pretty consistent with a company that has just released a minority browser platform.
As I recall, blanking or replacing the user-agent string is a standard feature of Squid (and presumably other proxy servers as well). OpenBSD's pf firewall has a "modulate state" option, which does something similar on a TCP/IP level (randomising all the parameters, making it hard to identify the OS generating the traffic).
If I wanted to hide my secret OS/browser, I'd have it report itself as something like a Subversion build of Firefox/Gecko running on WinXP - looking normal in logs, while having an obvious explanation for any odd behaviour server admins might notice (it's a work-in-progress version of an open source browser, of course it's not acting in exactly the same way as the last released version!).
Blanking the user-agent, on the other hand, would make sense in two ways: first, as a paranoid sysadmin wanting as little information getting out as possible (so you blank user-agent and probably have a firewall randomising parameters too) - second, to help catch sites which are running spider-traps which serve up pages of link-spam to anything other than IE. (In fact, the comment about these 'appearing to be real people not spider activity' could be exactly the point: comparing the pages seen by real people - and their proxy - to the pages served up to Googlebot.)
Or option 3: they don't want the world knowing that for all the hype about using 'Goobuntu' and having their own web browser, 90% of their staff are still using IE 7 on XP!
Some cheaters send one set of content if they see Googlebot in the UA string, and another if they see any other agent. So it makes a lot of sense to me if google double-checks the googlebot results by fetching the same URL again with a different UA string and seeing if they get the same results returned.
I've seen quite a few of these from Google, and reached the conclusion it was requests from mobile users passing through google where the pages are optimised for display on a phone. The name of the system escape me this early in the morning, but there was a lot of discussion of it some time ago on webmasterworld etc.
"The most likely reason for blank user agents: the Googlers have decided that they want to encourage websites to be standards compliant instead of detecting the browser type and building a page for that one."
My thoughts entirely. Perhaps the world would be a better place if *everyone* did that and the broken sites discovered that they weren't getting customers anymore.
I see google non spiders come and look at some of my sites, I sort of expect people in google to use the web themselves, and yes the header is fairly stripped, there are some blank ones, but we can all do that if we like.
The reason is the same as the no_exist-google87048704 they are seeing what the site returns to a browser with no user agent, or it is just some of them don't wish to mention the user agent, there is no law requiring it :)
"A browser user agent not only identifies the browser a machine is using, but also its operating system."
Not necessarily. RFC 2616 only specifies that "User agents SHOULD include this field with requests. The field can contain multiple product tokens and comments identifying the agent and any subproducts which form a significant part of the user agent." (§14.43) Operating system is not a significant part of the browser.
As several people have mentioned, Google may be testing the behavior of sites when they are not given a known User-Agent header. Sending an empty string instead of forging it (e.g. "fake user-agent") could be an attempt to make it less noticeable in logs. They're clearly up to something, and trying not to leak any information about it.
So, mysterious blank user agents
Coinciding with some human ranking component to google search results
Makes for a nice little no-useragent widget of googoompa-loompa desktops that has them clicking happily all day, voting for search results, and improving their chocolate...i mean search results.
Hmmmm...