<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>London SEO &#187; robots.txt</title>
	<atom:link href="http://london-seo.com/seo/robotstxt/feed/" rel="self" type="application/rss+xml" />
	<link>http://london-seo.com</link>
	<description>London Search Engine Optimisation</description>
	<lastBuildDate>Thu, 03 Apr 2008 18:38:44 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>How to block pages from robots using robots.txt</title>
		<link>http://london-seo.com/hot-to-block-pages-from-bots-robotstxt/57/</link>
		<comments>http://london-seo.com/hot-to-block-pages-from-bots-robotstxt/57/#comments</comments>
		<pubDate>Sun, 26 Nov 2006 17:52:17 +0000</pubDate>
		<dc:creator>search engine optimiser</dc:creator>
				<category><![CDATA[Hosting and Hosts]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Techniques]]></category>
		<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[robots.txt]]></category>

		<guid isPermaLink="false">http://london-seo.com/hot-to-block-pages-from-bots-robotstxt/57/</guid>
		<description><![CDATA[There are lots of reasons why you may want to do this &#8211; to protect your admin area, hide some nasty scripts or to save on bandwidth (yeah, i&#8217;ve heard it done&#8230;). Of course, all people have to do is look at your robots.txt and find out a load of directories that you don&#8217;t want [...]]]></description>
			<content:encoded><![CDATA[<p>There are lots of reasons why you may want to do this &#8211; to protect your admin area, hide some nasty scripts or to save on bandwidth (yeah, i&#8217;ve heard it done&#8230;). Of course, all people have to do is look at your robots.txt and find out a load of directories that you don&#8217;t want found&#8230; (Read more to see full post)</p>
<p><span id="more-57"></span></p>
<p><strong>What is robots.txt</strong></p>
<p>I&#8217;ll let Wikipedia explain:</p>
<blockquote><p>The robots exclusion standard or robots.txt protocol is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website. The information specifying the parts that should not be accessed is specified in a file called robots.txt in the top-level directory of the website.</p></blockquote>
<p>It is normally just found in your root directory, for example http://domain.com/robots.txt. Most robots will read it &#8211; some of the spammy robots or nasty ones won&#8217;t, and there isn&#8217;t much you can do about it (not related to robots.txt, anyway). You could use a htaccess or a script to block nasty bots, but that isn&#8217;t for this article.</p>
<p><strong>How to block all robots from all of your pages</strong></p>
<p>This is quite simple, however at this basic level I doubt you would want to use it much</p>
<blockquote><p>User-agent: *<br />
Disallow: /</p></blockquote>
<p><em>User-agent: *</em> includes all user agents (* is a wildcard &#8211; standing for anything, any length). There are many pages on the internet which will help you, such as this one with <a href="http://www.psychedelix.com/agents/index.shtml">a massive list of user agents</a>.</p>
<p><strong>User agent info:</strong><br />
Some common search engine user agents that you may be interested in are as follows:</p>
<ul>
<li>Main web search (some also image search):</li>
</ul>
<ul>
<li>Google: <em>googlebot</em></li>
<li>MSN: <em>msnbot</em></li>
<li>Yahoo: <em>yahoo-slurp</em></li>
<li>Ask: <em>teoma</em></li>
<li>Alexa: <em>ia_archiver</em></li>
</ul>
<li>Image search</li>
<ul>
<li>Google: <em>googlebot-image</em></li>
<li>MSN Pic search: <em>psbot</em></li>
</ul>
<p>So if you wanted to block just Alexa from all directories, you would use:</p>
<p style="margin-left: 40px">User-agent: ia_archiver<br />
Disallow: /</p>
<p><strong>How to block a certain directory (or more than one) using robots.txt</strong></p>
<p>This isn&#8217;t very different to the previous example. This will block all user agents to /private/ and /cgi-bin/</p>
<p style="margin-left: 40px">User-agent: *<br />
Disallow: /cgi-bin/<br />
Disallow: /private/</p>
<p><strong>Comments in robots.txt</strong></p>
<p>Comments in robots.txt are simply written after a &#8220;#&#8221; sign, for example:</p>
<p style="margin-left: 40px">User-agent: * #this is a comment here<br />
# another comment<br />
#<br />
# another one! robots.txt is so much fun&#8230;<br />
Disallow: /temp/</p>
<p><strong>Other things to add to robots.txt</strong></p>
<p>There has been a <a title="http://www.conman.org/people/spc/robots2.html" class="external text" href="http://www.conman.org/people/spc/robots2.html">Extended Standard for Robot Exclusion</a> proposed, adding features such as request rate (# of requests per # of seconds) and visit time (time that the bot should visit, such as 0400-0800). Some bots will follow it, but you can&#8217;t count on it (at the moment, anyway) being too handy. Here is a basic example of it in action.</p>
<p style="margin-left: 40px">User-agent: *<br />
Disallow: /temp/<br />
Request-rate: 1/3<br />
Visit-time: 0400-0800</p>
<p><strong>Further Reading/Related Links</strong></p>
<ul>
<li><a href="http://www.mcanerin.com/search-engine/robots-txt.htm">Robots.txt Generator</a> &#8211; a neat robots.txt generator</li>
<li><a href="http://www.webmasterworld.com/robots.txt">Robots.txt blog</a> on Webmaster world. Making very good use of its comments feature&#8230;</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://london-seo.com/hot-to-block-pages-from-bots-robotstxt/57/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
