What is robots.txt file?

Sometimes you may not want the search engines to spider specific directories of your site because you don’t want the informationto be read by the public. You can accomplish this by creating a robots.txt file and placing those files in it.


The robots.txt file tells the search engine spider, which Web pages of your site should be indexed and which Web pages should be ignored.

The robots.txt file is a simple text file (not an HTML file), which can be created in Notepad and placed in your root directory, for example:

http://www.yourwebsite.com/robots.txt

Benefits of robots.txt file

All the major search engines look for the robots.txt file on your site. I recommend including a robots.txt file even if you don’t need to prevent spiders from accessing any part of your site. It helps invite spiders to crawl your web site.

Here are some circumstances for excluding search engines from your web site:

1. During the site design

I often create a new web site in a sub directory of my main site. I therefore don’t want the client’s site to be spidered while it is being built. An alternative method is to create a password protected directory. The client can only access his site with a user name and
password.

2. Prevent certain directories from being crawled

Directories such as your cgi-bin don’t need to be crawled. You may have a directory containing images you designed and you don’t want them to be made available for public consumption. Place these directories in the robots.txt file so they can’t be crawled.

Example:

User-agent: *
Disallow: /images/

3. Prevent specific spiders

You may want to stop certain spiders from accessing your site. ie if you don’t want Google to spider your site you can add the Googlebot spider to your robots.txt file.

Here’s an example:

User-agent: googlebot
Disallow: /cgi-bin/

This robots.txt file would allow the “googlebot”, to retrieve every page from your site except for files from the “cgi-bin” directory. All files in the “cgi-bin” directory will be ignored by the googlebot.

Trackbacks

  1. […] C. Rather than use the meta robots tag to not index a web page use the robots.txt file to block crawler access to pages. What is robots.txt file? […]

Speak Your Mind

*