Robot Exclusion

Source: http://www.robotstxt.org/
Source1: http://www.indiaseos.com/user-agent-robot-txt.htm

Robots are programs that goes to many pages recursively retrieving linked pages. They are also called WWW Robots, Spiders or Crawlers.

They were useful in the past when the dial-up calls were expensive and a cheaper solution was to download all the texts (newspapers, books, etc) you want to read in your computer and then hang up the phone line, saving you some money in the phone bill.

One popular program I used in that days was webmirror ( http://www.bmtmicro.com/BMTCatalog/multipleos/webmirror.html )

Some time in the years 1993 or 1994 there have been occasions where robots have visited web servers where they weren’t welcome for various reasons. One of these reasons were robot specific swamped servers with rapid-fire requests, retrieved the same files repeatedly or going very deep virtual trees.

These incidents indicated the need for established mechanisms for web servers to indicate to robots which parts of their server should not be accessed.

The solution to exclude robots from accessing sensitive information on a server was to create a file on the server which specifies an access policy for robots. This file must be accessible via HTTP on the local URL “/robots.txt“.

This approach was chosen because it can be easily implemented on any existing WWW server, and a robot can find the access policy with only a single document retrieval. Even though this control is implemented in the robot, and can deactivated.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: