It's nice to have regular recalls, but...
Didn't we have that discussion back before Y2K?
Melbourne penetration tester Thiebaud Weksteen is warning system administrators that robots.txt files can give attackers valuable information on potential targets by giving them clues about directories their owners are trying to protect. Robots.txt files tell search engines which directories on a web server they can and cannot …
This post has been deleted by its author
This is old news.
If you want to protect something from prying eyes, put it behind HTTP authentication or secured scripts. Google can't magically guess your passwords and index password protected areas.
But listing something in robots.txt that you don't want indexed? That's like looking for the evil bit on an Internet packet. If you don't want random people indexing content, don't make that content available to them. Even the "Ah, but I block the GoogleBot" junk is useless - do you have any idea how many other bots are out there just indexing sites at random?
If your robots.txt is used for anything other than "that's a large image folder and I'd rather it wasn't uploaded to Google over time for bandwidth reasons, but there's nothing sensitive in there", then you're giving yourself a false impression of safety.
It's like leaving your file server open to the world but putting the "hidden" bit on the files...
I always assumed robots.txt was just to flag parts of the tree that were not worth the robot's trouble. As both a courtesy to the search engines and to reduce bandwitdh to your web server. Nothing to do with security.
Using it for security is like leaving your front door open and a note on the table saying "please don't nick anything in the bedroom".
Using it for security is like leaving your front door open and a note on the table saying "please don't nick anything in the bedroom".
Or even ""please don't nick anything from the second drawer down, hidden under the socks, in the chest of drawers in the bedroom at the front of the house".
""please don't nick anything from the second drawer down, hidden under the socks, in the chest of drawers in the bedroom at the front of the house".
Where I have placed a mousetrap primed to go off when you stick your hand in there.
As the article says, temporarily block any IP that tries to access that area.
> Using it for security is like leaving your front door open and a note on the table saying "please don't nick anything in the bedroom".
I used to work for a company that insisted all confidential information be stored in locked cabinets with a label on the cabinet saying "Contains confidential information."
It was probably meant as a reminder for staff to lock the cabinet but obviously helped any would-be industrial spy.
From robots.org - page created in 2007:
There are two important considerations when using /robots.txt:
• robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
• the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use.
So don't try to use /robots.txt to hide information.
I'm surprised this article hasn't died of old age considering its information has been known for 21 years (i.e. since the robots.txt standard was created in 1994). Yes, it will flag up some sensitive areas, but that's what IP/username/password /2-pass-auth (and so on) restrictions are for. Also note that hackers know where all the common CMS'es have their admin interfaces (most installs don't change that), so they don't need robots.txt to find them.
Although robots.txt can be ignored by "bad" spiders, it's often useful to stop spiders that do read it from battering your site (e.g. constantly hitting a script with varying parameters - a classic one is a calendar that hyperlinks every day even if there's no events plus has nav going backwards and forwards for centuries in either direction :-) ).
Perhaps this has lapsed into the realm of "well duh, everyone knows this, if you don't then what are you even doing here" and become something they don't even both to teach any more. Thiebauld may currently be coming to the embarrasing realisation that he's effectively announced that water is wet! :-)
So every single web spider has to crawl your entire site, before dropping the directories that match the hash?
It doesn't solve the problem that robots.txt exists for (telling a search engine which bits aren't worth indexing), and it doesn't solve the second problem (flagging which bits of your site you don't want indexing).
They have a http://www.google.com/humans.txt file.
I've had pen testers my and fail one of the sites I look after, just because it had a robots.txt. They didn't look at the contents of it, just that it was there. Robots.txt is a useful tool for controlling the behaviour of legitimate crawlers, it also makes it easy to identify those that ignore it and take remedial action.
Needless to say, it added an extra sense of perspective to some of their other suggestions, most of which I considered bullshit too.
Those pen testers were clearly idiots then.
Besides, since when have penetration tests been 'pass or fail'? Normally they return a long list of recommendations of varying degrees of severity. Do you mean they raised the existence as robots.txt as an issue? if so, what severity issue was it, if it was 'informational' I can see their point...
> Normally they return a long list of recommendations of varying degrees of severity
And then the PHB sees "anything" flagged up on a report and demands that it be fixed - without consulting those whose area it impacts. I've been on the receiving end of this at a previous job ...
In that case, it was assessors for our parent company's insurers. One thing they flagged up was that they expect to see an account and the terminal blocked after 3 failed logins. They didn't ask us, we weren't even aware they'd been until "management" came along with a list of things we *must* fix.
Had we been asked at the time, we'd have been able to point out that the OS didn't in fact have a means of locking an account like that (it was a looong time ago), and locking the "terminal" really really was a bad idea and was guaranteed to cause problems without adding any security. But we were instructed that we must do it, so we complied and waited.
Sure enough, it wasn't long before the random "I can't log in" calls came in - from all over the company. You see, most users were on dynamic terminals (TCP sessions), one virtual line was blocked, and of course, once all the lower numbered lines were in use, that was the one that people hit when trying to log in. The only exception was if two people were logging in at once - when that locked line would be temporarily in use for a short time and allow others to log in on other lines.
Sure enough, we were allowed to turn off that feature !
"You've clearly never run a website of more than a few pages."
I have, several, and can you imagine how many pages I've had to prevent from being indexed by Google?
For every page that's added to a website, it wouldn't take much to add that page to Google if the process was simple. Copy/paste URL to Google, click submit. Job done.
And yes I know this, but it's not required any more where as before it was a benefit to your website if you did so. Google could take up to two weeks to index any change in your website, so it was better to inform them of a change instead of them waiting to see if one happened.
>> I have, several, and can you imagine how many pages I've had to prevent from being indexed by Google?
Luckily for you there's a solution to this, just organize the site so that the pages you don't want indexed have one or a few common roots and then there's a special file you can put in your root.
Can't remember what it's called though. Damn!
"Luckily for you there's a solution to this, just organize the site so that the pages you don't want indexed have one or a few common roots and then there's a special file you can put in your root."
Would love to say I hadn't thought of this, but it's not possible with my current employers. So it's easier to just leave the job than do that - which I'm doing.
"I've always thought that Google and the other search engines should require you to submit the pages instead of crawling them."
I think you sort of can do that if you want to. To get up-to-date stuff indexed.
robots.txt is more about keeping your dull stuff out of the index. Or stuff that Google may suspect of being illegitimate SEO work.
https://support.google.com/webmasters/answer/35291?hl=en appears to Google's standard advice on the question of "Do you need an SEO?"
A good honeypot should do little or no work itself but give the potential miscreant something they think is useful but is actually worthless and log any relevant information. In days gone by redirecting to a tarpit might have been an idea but now script-kiddies have almost limitless resources so it doesn't make sense any more.
Don't think robots.txt is as good for this as some of the other files that are regularly looked for.
The short version is that when it was first invented it was easier to list what you didn't want indexed, rather than what you did.
For the long version I'd start at the wikipedia entry.
(From which I have just learnt that Charlie Stross claims to have caused the development of robots.txt by accidentally DOSing Martijn Koster's site with a home built web crawler).
The Internet/World Wide Webs are a free open space place, so don’t venture into/onto it with anything you want to hide or expect to remain secret.
Claims to provide such effective security as renders information secret in those virtual space places, rather than searchable and liable to exposure are therefore to be assumed fraudulent and even criminal?
Lots of people saying 'Old news', of course it is, but that does beg the question of how so many sites still get it so badly wrong?
One genuine failing in the article is that it doesn't even mention the right solution, which is to use a Robots meta tag within resources you want to hide like
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
So the file will be ignored IF a spider finds it, but won't be advertised via the robots.txt file, this only works for HTML resources though.
The bigger 'take away', of course, is the fact that you should never rely on obscurity as your only security; if you don't want files to be accessible; block/control access on the server-side.
"Weksteen, a Securus Global hacker, thinks they offer clues about where system administrators store sensitive assets because the mention of a directory in a robots.txt file screams out that the owner has something they want to hide."
Change "system administrators" to "developers" or "Wanabies" then perhaps you have a point. A SysAdmin by definition has access to the entire system, so has no need to store sensitive stuff within the web root! Your normal peon that's limited to an FTP login however doesn't have a choice. Us SysAdmins get enough crap already without people trying to blame us for dev faults.
# (the directory /stupid-bot/ does not exist):
User-agent: *
Disallow: /stupid-bot/
This make the bad crawlers pup up in your logs.
A comment such as <!-- log stupid crawlers: <a href="STUPID-BOT">STUPID-BOT-TXT</a> --> in a html page (they are supposed to see) shows which ones try harder than others to find things.