Overview of the true Google's algorithm and its patent
Google, the most visited website in the world, depends upon the quality of its search engine and
all the interest is in the classification of the pages in results of queries. Most of webmasters are still believing the algo used is that of the PageRank but actually PR is only one criterion of page scoring among hundreds others,
and actually the real algorithm is never achieved, the teams at Google work continuously on analysis of results to correct it and control the classification of Web pages.
A journalist, Saul Hansell, has had the occasion to spend one day with engineers of Google directly implied in the development of the algorithm, and to take part in their meeting.
You will understand why sometimes website disappear from the list of results independently of the penalty known as sandbox.
Why is the algorithm modified?
The team work is justified by complaints of companies whose site is badly classified without reason, and by their own analysis of the results. It should be known that each one of the 10000 employees at Google has a "buganizer", a tool to dispatch problems encountered in a query, and that all problems are transmitted to the team working on the algorithm.
It was noted for example that queries about "French revolution" directed on articles of election campaign because the candidates spoke about "revolution"! Correction simply consisted in this case by giving more weight to the words "French revolution" when the terms are coupled.
The team has a special tool named "Debug", which displays how
computers evaluate each request and each Web page. One can see thus which
importance the algorithm gives to links to a page, and correct it if needed.
Once the problem identified, a new mathematical formula is developed to address the case, and it is incorporated to the algorithm.
Beside PageRank and other signals, the algorithm use various models...
- The models of languages: the ability to understand phrases, synonyms, accents, misspelling, etc..
- The query models: there is not just language, but how it is used today.
- The time models: some pages are best answers when they exists for 30 minutes, and others when they underwent the test of time.
- The personalized models: all people are not looking for the same things (same thought behind same words, a personal note).
The dilemma of freshness
A crucial problem for the development team is that of freshness. Is it
necessary to privilege new pages, that have better chance to reflect the
actuality, or on the contrary oldest ones which already proved their quality,
by the number of backlinks?
Google always privileged the last ones but recently it realized that this was not always the good choice, also it has been necessary to develop a new algorithm which determines when the user needs fresh information and when they must be stable on the contrary. That is called the QDF formula for Query Deserves Freshness.
One can determine that a subject is hot when many blogs are suddenly speaking about it, or when there is a sudden lot of queries on this subject.
A group works on snippets. It is a manner of improving presentation of results, by extracting information about a site and by displaying it to inform users about the site before they click on the link.
A gigantic index
Google has hundreds of thousands of computers to index the billion of pages
from all Web sites in the world. The goal is - apart the addition of new
pages that is continuous - to be able to update the entire index within
a few days!
It is important to know that datacenters store a copy of all the pages of the Web to be able to reach them quickly.
PageRank: signal and classifier
PageRank used at the beginnings of the company by Larry Page and Sergey
Brin, is a score corresponding to the numbers of links on a page, which
guarantees its quality. But it is now deprecated. Google now uses 200 criteria
which it calls "signals". That depends at the same time on the
contents of the page, and on its history, the queries and the behavior of
users but all that is described in detail in the
PageRank and Sandbox patent.
Beside signals on the pages and their history, Google uses classifiers on queries, whose goal is to identify the context of a search, the mind of the user who made it. For example, does one want to find a product to buy it or to get some information?
The most famous part of our ranking algorithm is PageRank, an algorithm developed by Larry Page and Sergey Brin, who founded Google. PageRank is still in use today, but it is now a part of a much larger system.
The article that is the source of this quote (see below) tells us that the PageRank has been amended in January 2008 and so it is not static!
The need for diversity
Once pages were selected and classified, some should occupy the first ten positions, the best ones, but it is not finished. Google wants to add a diversity of point of view, for example blogs and commercial sites, and so page with a lower score can be moved at the top of results, the first of each category being thus promoted.
Improving the algorithm
Besides the work to improve the algorithm, other groups operate on the evaluation of results. In real time we assess the quality of the responses of the algorithm, to check its relevance, especially with the control about improvements as soon as they are made. This is the work of statisticians to measure the quality of results.
A group is devoted to spam and all types of abuse as hidden text. This' webspam 'group, we learn, is working together with the Google Webmaster Central group, which provides aid and tools for webmasters.
All is not said
The techniques of Google seem rather academic, with its signals and classifiers, if one compares them with the competitors as Microsoft which uses neural networks. But one does not know all, Google still preserves many secrets, wanting not to reveal to competitors all his techniques.
- An introduction to Google search quality.
- A part of these informations comes from an article by Saul Hansell published in the New York Times.