Is robots.txt helpful?
This file, in the format of the operating system of the server, Unix if the server is Unix, must be stored at root of any website.
It said to search engines
which pages should be indexed or which pages or directories must not be added to the index.
There is no standard but common rule to follow. Even if a page is excluded in the robots.txt file, that does not imply it will be removed from the index when it was already added.
The default typical content of robots.txt is:
User-agent is the name of a search engine crawler and Disallow specifies the full pathname (with / at the beginning) of a page or directory that you want to exclude from the index. Note that
excludes the whole site to be indexed!
To exclude the cgi directory, the format will be:
User-agent:* Disallow: /cgi-bin/
To exclude a file:
User-agent:* Disallow: /rep/filename.html
The names are case-sensitive. Do not put multiple filenames or crawlers
on the same line, put several groups of User-agent+Disallow or several Disallow
with the same User-agent.
Note: Do not include white line without the # code at the beginning of the line. And consequently do not store empty file under the name robots.txt.
It is possible to check the validity of a robots-txt file from the webmaster tools of Google.
How does Google use robots.txt?
A full explanation was given by Matt Cutts in a video.
- If a page is in disallow, the robot of Google ignores it and does not crawl the content.
- But if this page has backlinks, it may appear in results (disallow does not mean no index). Anchors links to this page will be used for the description.
- Possibly, if this page has a link in Dmoz (ie ODP), the Dmoz description can be included in results page of Google.
- To de-index a page, use the value in the noindex in the robots meta tag.