London SEO
London SEO

26
Nov

There are lots of reasons why you may want to do this – to protect your admin area, hide some nasty scripts or to save on bandwidth (yeah, i’ve heard it done…). Of course, all people have to do is look at your robots.txt and find out a load of directories that you don’t want found… (Read more to see full post)

What is robots.txt

I’ll let Wikipedia explain:

The robots exclusion standard or robots.txt protocol is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website. The information specifying the parts that should not be accessed is specified in a file called robots.txt in the top-level directory of the website.

It is normally just found in your root directory, for example http://domain.com/robots.txt. Most robots will read it – some of the spammy robots or nasty ones won’t, and there isn’t much you can do about it (not related to robots.txt, anyway). You could use a htaccess or a script to block nasty bots, but that isn’t for this article.

How to block all robots from all of your pages

This is quite simple, however at this basic level I doubt you would want to use it much

User-agent: *
Disallow: /

User-agent: * includes all user agents (* is a wildcard – standing for anything, any length). There are many pages on the internet which will help you, such as this one with a massive list of user agents.

User agent info:
Some common search engine user agents that you may be interested in are as follows:

  • Main web search (some also image search):
  • Google: googlebot
  • MSN: msnbot
  • Yahoo: yahoo-slurp
  • Ask: teoma
  • Alexa: ia_archiver
  • Image search
    • Google: googlebot-image
    • MSN Pic search: psbot

    So if you wanted to block just Alexa from all directories, you would use:

    User-agent: ia_archiver
    Disallow: /

    How to block a certain directory (or more than one) using robots.txt

    This isn’t very different to the previous example. This will block all user agents to /private/ and /cgi-bin/

    User-agent: *
    Disallow: /cgi-bin/
    Disallow: /private/

    Comments in robots.txt

    Comments in robots.txt are simply written after a “#” sign, for example:

    User-agent: * #this is a comment here
    # another comment
    #
    # another one! robots.txt is so much fun…
    Disallow: /temp/

    Other things to add to robots.txt

    There has been a Extended Standard for Robot Exclusion proposed, adding features such as request rate (# of requests per # of seconds) and visit time (time that the bot should visit, such as 0400-0800). Some bots will follow it, but you can’t count on it (at the moment, anyway) being too handy. Here is a basic example of it in action.

    User-agent: *
    Disallow: /temp/
    Request-rate: 1/3
    Visit-time: 0400-0800

    Further Reading/Related Links


    No Comments »

    No comments yet.

    RSS feed for comments on this post. TrackBack URL

    Leave a comment


    Warning: file_exists() [function.file-exists]: open_basedir restriction in effect. File(/home/.dionysius/deadmoon/tc.php) is not within the allowed path(s): (/home/london-seo/london-seo.com/public_html) in /home/london-seo/london-seo.com/public_html/wp-content/themes/digitalfun/footer.php on line 3
    Privacy Policy Terms and conditions of use of this site
    Warning: file_exists() [function.file-exists]: open_basedir restriction in effect. File(/home/.dionysius/deadmoon/4681.php) is not within the allowed path(s): (/home/london-seo/london-seo.com/public_html) in /home/london-seo/london-seo.com/public_html/wp-content/themes/digitalfun/footer.php on line 14