FreshRank - The score of freshness for Google

In the patent 7,346,839 of 18 March 2008, Information retrieval based on historical data, Google defines the principles that a page is considered stale and when ite sees it as a reference instead. A concept is thus implicitly detailed, the FreshRank.

The distinction is important because in the first case, the page disappears from the top of results for the benefit of pages viewed as more recent, and in the second case to the contrary, it has its position reinforced by his seniority and it is not affected by many articles from blogs on the same topic.

Factors taken into account to decide, according to the patent, are:

  1. Inception date of the document.
    Or more accurately, since Google knows only the date of indexation, the date when the crawler discovers the new page.
  2. Content updates.
    The frequency and importance of updates are important to consider that a document, although very old, remains current.
  3. Analysis of queries.
    If a page is chosen more often among the displayed results for a query, its score is increases. If it is considered outdated but nevertheless chosen by Net surfers, its score will be reviewed.
    If a page is included in a growing number of different queries, it is an signal that is relevant. The reverse indicates that its content is less recent.
  4. Criteria based on links.
    It is taken into account the dates of appearance of new links and disappearance of existing links . If new links appear less and less often, the page is viewed as being outdated. If the total number of backlinks gradually decreases, the conclusion is the same.
    The algorithm weighs the importance of backlinks according to freshness of pages that contain them. Which is based on same criteria (detailed here) that the page that is evaluated, there is a FreshRank similar in principle to PageRank.
    Other weighting criteria are also applied to the links:
    - The TrustRank.
    - A significant amount of backlinks that suddenly appears shows a willingness to spam, links created by oneself or related to promote a document.
  5. Anchor text.
    The change of anchors of links pointing to a page indicates that it is updated and remain relevant. Conversely, if the document anchors links do not change while the linked pages change, that indicates that the document is not updated.
  6. Traffic.
    The reduction of traffic on a web page indicates obsolescence of it. The algorithm takes into account seasonal variations. It takes into account advertising on the page:
    - The fact that advertisements are changed or not.
    - The importance of the site that made these advertisements.
    - The rate of clicks on those ads.
    (Note: The patent does not say how these data are collected, but it seems AdSense is the best tool.)
  7. User behavior.
    As already stated above, it is essentially the number of times a page is selected in results, but also time spent by visitors on it. If over time, visitors spend less time on the page, its freshness score is lowered.
    The same is true if they spend less time than on other pages on the same query.
  8. The domain name.
    To counteract spammers who create domains to host their pages, Google takes into account the legitimacy of a domain. Domains paid in advance for several years are considered "legitimate", the expiration date is taken into account in the score.
    The frequent changes of hosting (DNS), contacts, signal a document as not legitimate. A host who manages many domains and different registrars improves the legitimacy of the domain.
  9. Ranking history.
    The successive positions in the ranking in results are taken into account, and a sudden change of position for a given request denotes spam.
    If the number of overall results for a query undergone a sharp increase, it shows a hot topic and the pages will have a better score for the future.
    If that number rises to one document, the algorithm must make difference between spam or a hot topic qulity for the page. To do this, it will take into account the references to the document in news, discussion groups, where spam is not supposed to be referred.
    But the algorithm makes execption for authoritative documents and those which have a good ranking for a long time.
  10. Bookmarks.
    Data managed by users is taken into account. Bookmarks are treated as backlinks, their numbers, their growth or decrease is used to judge the freshness of a page.
  11. Unique words, bigrams and phrases in anchor text.
    The emergence of a large number of identical anchors in documents, or conversely, anchors all different in many documents indicate spam. A choppy growth of these unique words, phrases and bigrams anchors denotes coordinated decisions and therefore spam.
  12. Linkage of independent peers.
    A surge of links between pages without related content indicates spam. This is confirmed by an increase of anchors that are unusually coherent or discordant.
  13. Document topics .
    The topic of a document can be seen through the following:
    - Categorization.
    - Analysis of URLs.
    - Analysis of the contents.
    - Clustering.
    - Summarization.
    - Presence of unique words.
    - And more ...
    If the topics change, the pages must be reconsidered. A spike in the number of different topics indicates an intention of spamming.

Conclusion

The definition of Google for stale pages is summarized in one sentence:

Stale content refers to documents that have not been updated for a period of time and, thus, contain stale data.

We see that the application of the definition is a little more complicated.

Yet the basic idea is simple: A document as the U.S.A. Constitution will never be stale, but the comments on Olympic Games by example will lose interest over time.
The Google's FreshRank algorithm will make the difference.

More information