What is a robots txt file used for?

robots.txt

The Robots Exclusion Protocol (robots.txt) is a text file that tells certain search engine spiders not to index certain pages or sections of your website. For instance you may not want search engines to spider your site while it’s under construction or want to prevent a directory from being indexed.

What is a robots txt file used for?

1. Prevent indexing of specific web pages

When constructing a website there may be certain pages or sections that are of no interest to your readers. For example you don’t need search engines to index your terms and conditions and privacy policy pages. Having them indexed takes more time for the spiders to crawl your website. One of the main ranking factors of Google is how fast your pages load. Using a robots text file you can prevent search engine spiders from crawling these pages thus reducing the load time of your website as well as reduce bandwidth.

2. Prevent indexing of specific directories

Most websites contain directories that don’t need to be crawled by the search engines. For instance your cgi-bin directory usually contains scripts for contact forms, etc. It offers no value to your visitors so by adding it to your robots txt file means the search engines don’t have to spend time crawling this directory. This also reduces the load time of your site.

3. Exclude bots or spiders

If you want to block a particular bot or search engine from collecting e-mail addresses you would add specific code to your robots text file.

4. Avoid duplicate content penalties

Google and other search engines want people to receive the most targeted search engine results based on their keywords. They don’t want you to view exactly the same information on multiple web pages or websites. To prevent this from happening they usually rank the first web page that appears in the search engines. The others may not rank at all.

Example:
You have 2 web pages that contain the same content except one is for viewing online and the other is for printing. To prevent duplicate content appearing in the search engines you add some code to your robots text file to prevent indexing of your print web page.

5. Save bandwidth

By excluding certain files from being indexed you’ll reduce bandwidth and speed up the loading of your website. For example you may want to exclude large images, stylesheets and JavaScript from being indexed by the search engines and only display the most important website files to your visitors.

How does robots.txt work?

A search engine robot first checks your robots txt file before it visits other pages on the server. For example if your website URL is http://www.domainnname.com/about.html it will first check http://www.example.com/robots.txt.

How to create a robots txt file

* Open notepad or any text editor
* Place these 2 lines in it

User-agent: *
Disallow: /

User-agent: * means this section applies to all robots.
Disallow: / tells the robot that it should not visit any pages on the site. This line can be repeated for each directory or file you want to exclude, or for each spider or bot you want to exclude.

* Save the file as robots.txt
* Upload it to the root of your server (same location as your index.html file)

Examples

1. Exclude a file from being indexed by Google

If you’ve a file named about.htm, in a directory called ‘private’ that you do not wish to be indexed by Google add these lines to your robots.txt file:

User-Agent: Googlebot
Disallow: /private/about.htm


2. Exclude a section of your website from all spiders and bots

Say you are building a new website in the subdirectory of your current site and you want to prevent ALL search engines from indexing this directory. In this case use a wildcard character * to exclude them all.

User-Agent: *
Disallow: /subdirectory/

Notice there is a forward slash at the beginning and end of the directory name. This indicates you don’t want any files in that directory indexed.

3. Allow all spiders to index everything

Use the wildcard symbol * to allow all search engine spiders to index your website files.

User-agent: *
Disallow:

4. Prevent all search engine robots from indexing your images folder

User-Agent: User-agent: *
Disallow: /images/

Robots Checker
Use the robots checker to validate your robots.txt file
http://tool.motoricerca.info/robots-checker.phtml

It’s easy to make a mistake when creating your robots.txt file by omitting slashes or colons. Use the validator to make sure everything is correct

Important tips

  • The robots.txt file is not a security method. Hundreds of bots and spiders constantly crawl the Internet. Most will respect your robot.txt file but some will not.
  • Even though the robots.txt file tells search engines which files and directories not to visit it can still be viewed by the public. This means anyone can see which files you don’t want viewed on your site.
  • Malware robots and email address harvesters often scan the web for security vulnerabilities. Having a robots.txt file doesn’t stop them from reading it.
  • If you want to keep your information confidential create a password protected directory on your server.

Resources
Robots Generator
Free Robots.txt Creator & Validator
Robots Generator

Frequently asked questions
http://www.robotstxt.org/faq.html

Block or remove pages using a robots.txt file (Google)
http://www.google.com/support/webmasters/bin/answer.py?answer=156449

**********************
Need help with the Design, Redesign or SEO of your website?
Go to: Web Designer Maryland

Comments

  1. my sentiments and I will instantly snatch your rss feed to be updated on any upcoming content you may publish,I am really fan of your information,2

Trackbacks

  1. […] Articles What is a robots.txt file How to Optimize Your Website for the Search Engines 301 vs 302 […]

Speak Your Mind

*