Is robots.txt helpful?
This file, in the format of the operating system of the server, Unix if the server is Unix, must be stored at root of any website.
He said to search engines
which pages should be indexed or excluded.
The default typical content of robots.txt is:
User-agent:* Disallow:
User-agent is the name of a search engine crawler and Disallow specifies the full pathname (with / at the beginning) of a page or directory that you want to exclude from the index. Note that
Disallow:/
excludes the whole site to be indexed!
To exclude the cgi directory, the format will be:
User-agent:* Disallow: /cgi-bin/
To exclude a file:
User-agent:* Disallow: /rep/filename.html
The names are case-sensitive. Do not put multiple filenames or crawlers
on the same line, put several groups of User-agent+Disallow or several Disallow
with the same User-agent.
Note: Do not include white line without the # code at the beginning
of the line. And consequently do not store empty file under the name
robots.txt.
It is possible to check the validity of a robots-txt file from the webmaster
tools of Google.
How does Google use robots.txt?
A full explanation was given by Matt Cutts in a video.
- If a page is in disallow, the robot of Google ignores it and does not crawl the content.
- But if this page has backlinks, it may appear in results (disallow does not mean no index). Anchors links to this page will be used for the description.
- Possibly, if this page has a link in Dmoz (ie ODP), the Dmoz description can be included in results page of Google.
- To de-index a page, use the value in the noindex in the robots meta tag.
- Google and robots.txt. The vidéo.
References
- SEO manual. Step by step manual for how to succeed in SEO and to increase the number of visitors.
- Answers from
Google to webmasters
Lot of questions and the team at Google Webmaster Central answered all of them. - Articles on robots.txt.