Overview Of The True Google's Algorithm
Google, the most visited
website in the world, depends upon the quality of its search engine and
all the interest is in the classification of the pages in results of queries.
Most of webmasters are still believing the algo used is that of the PageRank
but actually PR is only one criterion of page scoring among hundreds others,
and actually the real algorithm is never achieved, the teams at Google work
continuously on analysis of results to correct it and control the classification
of Web pages.
A journalist, Saul Hansell, has had the occasion to spend one day with engineers
of Google directly implied in the development of the algorithm, and to take
part in their meeting.
You will understand why sometimes website disappear from the list of results
independently of the penalty known as sandbox.
Why is the algorithm modified?
The team work is justified by complaints of companies whose site is badly
classified without reason, and by their own analysis of the results. It
should be known that each one of the 10000 employees at Google has a "buganizer",
a tool to dispatch problems encountered in a query, and that all problems
are transmitted to the team working on the algorithm.
It was noted for example that queries about "French revolution"
directed on articles of election campaign because the candidates spoke about
"revolution"! Correction simply consisted in this case by giving
more weight to the words "French revolution" when the terms are
coupled.
Which tools?
The team has a special tool named "Debug", which displays how
computers evaluate each request and each Web page. One can see thus which
importance the algorithm gives to links to a page, and correct it if needed.
Once the problem identified, a new mathematical formula is developed to
address the case, and it is incorporated to the algorithm.
The dilemma of freshness
A crucial problem for the development team is that of freshness. Is it
necessary to privilege new pages, that have better chance to reflect the
actuality, or on the contrary oldest ones which already proved their quality,
by the number of backlinks?
Google always privileged the last ones but recently it realized that this
was not always the good choice, also it has been necessary to develop a
new algorithm which determines when the user needs fresh information and
when they must be stable on the contrary. That is called the QDF formula
for Query Deserves Freshness.
One can determine that a subject is hot when many blogs are suddenly speaking
about it, or when there is a sudden lot of queries on this subject.
Snippets
A group works on snippets. It is a manner of improving presentation of results, by extracting information about a site and by displaying it to inform users about the site before they click on the link.
A gigantic index
Google has hundreds of thousands of computers to index the billion of pages
from all Web sites in the world. The goal is - apart the addition of new
pages that is continuous - to be able to update the entire index within
a few days!
It is important to know that datacenters store a copy of all the pages of
the Web to be able to reach them quickly.
PageRank: signal and classifier
PageRank used at the beginnings of the company by Larry Page and Sergey
Brin, is a score corresponding to the numbers of links on a page, which
guarantees its quality. But it is now deprecated. Google now uses 200 criteria
which it calls "signals". That depends at the same time on the
contents of the page, and on its history, the queries and the behavior of
users but all that is described in detail in the
PageRank and Sandbox patent.
Beside signals on the pages and their history, Google uses classifiers on
queries, whose goal is to identify the context of a search, the mind of
the user who made it. For example, does one want to find a product to buy
it or to get some information?
The need for diversity
Once pages were selected and classified, some should occupy the first ten positions, the best ones, but it is not finished. Google wants to add a diversity of point of view, for example blogs and commercial sites, and so page with a lower score can be moved at the top of results, the first of each category being thus promoted.
All is not said
The techniques of Google seem rather academic, with its signals and classifiers, if one compares them with the competitors as Microsoft which uses neural networks. But one does not know all, Google still preserves many secrets, wanting not to reveal to competitors all his techniques.
-
The complete article by Saul Hansell published in the New York Times: Google Keeps Tweaking Its Search Engine.