Technical infos about search engines

To manage your site in accordance with search engines' rules, some technical insights...


Accessing foreign versions of Google

To not be automatically redirected to the local version of the search engine, a language parameter must be added. For example:
For the french version of Google. It you want not be automatically redirected to your country version type:
"ncr" means for "no country redirect". It may be placed in bookmark.

Excluding a page from the index

Insert a meta tag within <head> </head> into the HTML page:

<meta name="robots" content="noindex" />
A robots.txt at the root of the site can also contain rules to search engines for not parsing files or directories but they will not be removed from the index if links point to them.

The operator link in the search bar (link: site-name) is a command to display the number of links pointing to a site. In fact this command provides only a fraction of backlinks, in order to save servers bandwich.
The choice of outcome is totally random, this was confirmed by Matt Cutts in a video on Youtube. They have nothing to do with the ranling to the page or with the quality, they are taken randomly.

You can modify a snippet

This is the name that Google gives to the description under the title of the page in search results. It is actually possible to change this text and make it more attractive, especially with the meta description.

The answer is given by Google on his blog for webmasters, in the article entitled "Improve snippets with a meta description makeover".
The description in the meta must be unique and must give details on the page. It should contain keywords related to its contents.

  ... some tags ... 
  <meta name="description" content="information legible and utile">
The text assigned to the content attribute must be a real description:

Managing a site down

Providing we know in advance that the site will be unavailable...
If the outage is expected, the ideal is to return an HTTP code 503, which is defined for this situation. In PHP, the code of the home page or all pages in the case of a CMS can be like this:

header('HTTP/1.1 503 Service Temporarily Unavailable');
header('Retry-After: Mon, 25 Jan 2011 12:00:00 GMT');

This code is supplied by Google.

Managing duplicate content

The presence of same contents on page in the same site or in different site, could happen with different URLs pointing on the same page or with copies of pages.
In the case of scrapers Google answers on his blog. There are two possible cases, a page is duplicated on the same site, in this case we must simply tell Google which page to index, just put the URL in the sitemap or use the canonical tag. The other will be ignored.
In the case of incorporation of a portion of an article from another site on its site, it is a penalty factor unless it is a quote placed in a <blockquote> tag . Quotations must be accompanied by a personal text.
More about duplicate content.

Managing a design change

Some webmasters have experienced a loss of ranking with the change of design of a site without changing the content, immediately after the passage of Googlebot.
This experience has been shared on Webmasterworld. The ranking returns to the previous state after a variable delay. It is probable a massive change raises some signal on the engine.
Advice from Google: do not change the design when you change the domain and redirect the pages. Change the design little by little, if something leads to a penalty it will be easier to see what it is.

robots.txt may be useful

This file inform search engines of which pages should be indexed or which pages or directories must not be added to the index. Even if a page is excluded in the robots.txt file, that does not imply it will be removed from the index when it was already added. The default typical content of robots.txt is:


User-agent is the name of a search engine crawler and Disallow specifies the full pathname (with / at the beginning) of a page or directory that you want to exclude from the index. Note that :


excludes the whole site to be indexed!
To exclude the cgi directory, the format will be:

Disallow: /cgi-bin/

To exclude a file:

Disallow: /rep/filename.html 

The names are case-sensitive. Do not put multiple filenames or crawlers on the same line, put several groups of User-agent+Disallow or several Disallow with the same User-agent.
Note: Do not include white line without the # code at the beginning of the line. And consequently do not store empty file under the name robots.txt.
It is possible to check the validity of a robots-txt file from the webmaster tools of Google.

Sitemaps are useful

The sitemap is a standard XML or HTML file containing a list of all the pages in the form of URLs, to search engine in the first case, to visitors in the second. The sitemap can be generated automatically by a CMS or with a script as simple map on a static site.

  1. The main interest of the sitemap is to facilitate the work of the search engines.
  2. Dynamic link are ignored by robots of search engines. The XML or HTML sitemap create a static link for it.
  3. If all properties of items in the sitemap are identical (priority, frequency, last modified) an XML sitemap has few interest.
  4. The XML sitemap can now be used by all the major search engines.
  5. It will be necessary to rebuild the sitemap whenever the site's content is changed. But it should be registered only once.
  6. Once registered the sitemap, it is possible to obtain statistics and analysis of the site by Google, with errors encountered.
  7. The address of the XML sitemap can be placed in the robots.txt file.
  8. There is a special sitemap format for indexing video.
  9. In conclusion, build an XML sitemap if your site is poorly indexed or indexing not quickly updated, or if you want to have statistical information.

Getting more information about Googlebot

Googlebot is the crawler of Google. It could parse some pages on your site every day. This Googlebot FAQ gives details of how it works.