There are lots of reasons why you may want to do this – to protect your admin area, hide some nasty scripts or to save on bandwidth (yeah, i’ve heard it done…). Of course, all people have to do is look at your robots.txt and find out a load of directories that you don’t want found… (Read more to see full post)
What is robots.txt
I’ll let Wikipedia explain:
The robots exclusion standard or robots.txt protocol is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website. The information specifying the parts that should not be accessed is specified in a file called robots.txt in the top-level directory of the website.
It is normally just found in your root directory, for example http://domain.com/robots.txt. Most robots will read it – some of the spammy robots or nasty ones won’t, and there isn’t much you can do about it (not related to robots.txt, anyway). You could use a htaccess or a script to block nasty bots, but that isn’t for this article.
How to block all robots from all of your pages
This is quite simple, however at this basic level I doubt you would want to use it much
User-agent: *
Disallow: /
User-agent: * includes all user agents (* is a wildcard – standing for anything, any length). There are many pages on the internet which will help you, such as this one with a massive list of user agents.
User agent info:
Some common search engine user agents that you may be interested in are as follows:
- Main web search (some also image search):
- Google: googlebot
- MSN: msnbot
- Yahoo: yahoo-slurp
- Ask: teoma
- Alexa: ia_archiver
- Google: googlebot-image
- MSN Pic search: psbot
So if you wanted to block just Alexa from all directories, you would use:
User-agent: ia_archiver
Disallow: /
How to block a certain directory (or more than one) using robots.txt
This isn’t very different to the previous example. This will block all user agents to /private/ and /cgi-bin/
User-agent: *
Disallow: /cgi-bin/
Disallow: /private/
Comments in robots.txt
Comments in robots.txt are simply written after a “#” sign, for example:
User-agent: * #this is a comment here
# another comment
#
# another one! robots.txt is so much fun…
Disallow: /temp/
Other things to add to robots.txt
There has been a Extended Standard for Robot Exclusion proposed, adding features such as request rate (# of requests per # of seconds) and visit time (time that the bot should visit, such as 0400-0800). Some bots will follow it, but you can’t count on it (at the moment, anyway) being too handy. Here is a basic example of it in action.
User-agent: *
Disallow: /temp/
Request-rate: 1/3
Visit-time: 0400-0800
Further Reading/Related Links
- Robots.txt Generator – a neat robots.txt generator
- Robots.txt blog on Webmaster world. Making very good use of its comments feature…


No Comments »
No comments yet.
RSS feed for comments on this post. TrackBack URL
Leave a comment