Duplicate content and penalties : When and why
Google wanted to demystify the notion of penalty for duplicate content because webmasters tend to have misconceptions about this. Firstly, there is no really "penalty for duplicate content."
But things have changed since the Panda Update. On the contrary, the notion of duplicate content has expanded to sites that are too many similar pages and they are inflicted a negative overall score which devalues their ranking as a whole.
An analysis of a case of content similar, but not exactly copied is given by a Google employee. In this case the site got a -50 penalty.
They exist in practice in the sense that a site will not be indexed or well ranked if:
- It shows the same content under different domain names.
- It incorporates content taken on another site.
- It publishes again articles previously published, without substantial additions.
- Too many similar pages.
Accidental duplicate content
One of the most frequent and most annoying case is when two domain names point to the same site. The webmaster imagines that this will bring on the site users who typed the domain with different extension, for example .com or .org or a country extension, but for search engines crawlers, these are different sites with the same content, and they may not be both indexed.
The same problem can occur if it happens that is given to robots access to a dynamic URL as http://www.scriptol.com?x=5 and a most significant URL created with the title of the post but pointing on the same page, what can happen with a CMS.
These pages are not penalized, but suffer filtering operated by search engines that do not want to have the same pages in multiple copies in their indexes. (Reference)
What will happen then is defined in three points:
- Two pages are recognized as identical.
- One of them is selected as the best URL.
If either one appears in the sitemap and not the other, the first one is chosen.
- It is then taken into account factors which confirm whether or not this quality, mainly the number of backlink to the URL.
As has been said by Matt Cutts in an interview with a group of webmasters, the URL used for the index is that which is regarded as original and which has the most backlinks.
If two pages contain the same information without being strictly similar, and if one has a link on the other, the other will be regarded as the reference.
The canonical tag
To avoid duplicate content created by the legitimate holder of pages, Google has introduced a tag to be placed in the <head> section which indicates the URL to be considered for a page when it is available in several different addresses.
<link rel="canonical" href="url of page" />
See how to create a generic canonical tag in PHP.
Having duplicate content on a site may penalize in many ways without it is applied a formal penalty by search engines. If the duplicate is not detected, the PageRank will be diluted between the two pages, and if it is detected, only one will be indexed, but we are not sure it is the right one.
However, you should not worry if you find that it was duplicate content accessible by robots: just delete the duplicated content, or simply make it inaccessible, for the negatives effects disappear.
- The article on Blogcentral (Google).
- Another article on duplicate content dealing with the case where someone copies the contents of your site.
- See also the Webmaster's guide about contents duplicated.