Monday, December 18, 2006 at 2:28 PM
At the recent Search Engine Strategies conference in freezing Chicago, many of us Googlers were asked questions about duplicate content. We recognize that there are many nuances and a bit of confusion on the topic, so we'd like to help set the record straight.What is duplicate content?
Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Most of the time when we see this, it's unintentional or at least not malicious in origin: forums that generate both regular and stripped-down mobile-targeted pages, store items shown (and -- worse yet -- linked) via multiple distinct URLs, and so on. In some cases, content is duplicated across domains in an attempt to manipulate search engine rankings or garner more traffic via popular or long-tail queries.
What isn't duplicate content?
Though we do offer a handy translation utility, our algorithms won't view the same article written in English and Spanish as duplicate content. Similarly, you shouldn't worry about occasional snippets (quotes and otherwise) being flagged as duplicate content.
Why does Google care about duplicate content?
Our users typically want to see a diverse cross-section of unique content when they do searches. In contrast, they're understandably annoyed when they see substantially the same content within a set of search results. Also, webmasters become sad when we show a complex URL (example.com/contentredir?value=shorty-george〈=en) instead of the pretty URL they prefer (example.com/en/shorty-george.htm).
What does Google do about it?
During our crawling and when serving search results, we try hard to index and show pages with distinct information. This filtering means, for instance, that if your site has articles in "regular" and "printer" versions and neither set is blocked in robots.txt or via a noindex meta tag, we'll choose one version to list. In the rare cases in which we perceive that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we'll also make appropriate adjustments in the indexing and ranking of the sites involved. However, we prefer to focus on filtering rather than ranking adjustments ... so in the vast majority of cases, the worst thing that'll befall webmasters is to see the "less desired" version of a page shown in our index.
How can Webmasters proactively address duplicate content issues?
- Block appropriately: Rather than letting our algorithms determine the "best" version of a document, you may wish to help guide us to your preferred version. For instance, if you don't want us to index the printer versions of your site's articles, disallow those directories or make use of regular expressions in your robots.txt file.
- Use 301s: If you have restructured your site, use 301 redirects ("RedirectPermanent") in your .htaccess file to smartly redirect users, the Googlebot, and other spiders.
- Be consistent: Endeavor to keep your internal linking consistent; don't link to /page/ and /page and /page/index.htm.
- Use TLDs: To help us serve the most appropriate version of a document, use top level domains whenever possible to handle country-specific content. We're more likely to know that .de indicates Germany-focused content, for instance, than /de or de.example.com.
- Syndicate carefully: If you syndicate your content on other sites, make sure they include a link back to the original article on each syndicated article. Even with that, note that we'll always show the (unblocked) version we think is most appropriate for users in each given search, which may or may not be the version you'd prefer.
- Use the preferred domain feature of webmaster tools: If other sites link to yours using both the www and non-www version of your URLs, you can let us know which way you prefer your site to be indexed.
- Minimize boilerplate repetition: For instance, instead of including lengthy copyright text on the bottom of every page, include a very brief summary and then link to a page with more details.
- Avoid publishing stubs: Users don't like seeing "empty" pages, so avoid placeholders where possible. This means not publishing (or at least blocking) pages with zero reviews, no real estate listings, etc., so users (and bots) aren't subjected to a zillion instances of "Below you'll find a superb list of all the great rental opportunities in [insert cityname]..." with no actual listings.
- Understand your CMS: Make sure you're familiar with how content is displayed on your Web site, particularly if it includes a blog, a forum, or related system that often shows the same content in multiple formats.
- Don't worry be happy: Don't fret too much about sites that scrape (misappropriate and republish) your content. Though annoying, it's highly unlikely that such sites can negatively impact your site's presence in Google. If you do spot a case that's particularly frustrating, you are welcome to file a DMCA request to claim ownership of the content and have us deal with the rogue site.
In short, a general awareness of duplicate content issues and a few minutes of thoughtful preventative maintenance should help you to help us provide users with unique and relevant content.
63 comments:
I really appreciated this post! Thank you much! This stuff seems obvious after I read it, but before I read about it it just seems so complex.. Thanks googlewebmasters!
301 Redirects Guide
The duplicate content on the same site issue is sometimes completely valid. As users site search technology gets better, this dup content thing will need to be better addressed.
For example…take software like Endeca/Mercado that do guided navigation. The user can pick any data point to filter by, in any order. This means you naturally have different urls with the same content:
/search/brand/Compaq/color/red/
/search/color/red/brand/Compaq/
The user can take either path, and get to the same filtered data. Should the site be penalized for using friendly urls here? Absolutely not.
/search?brand=Compaq&color=red
/search?color=red&brand=Compaq
This is even worse since search engines hate query string for the most part.
In either case, no malice has taken place and there’s no reason a site should be banned from an index just because it’s using friendly urls and good search technology.
The struggle is, how do you prevent that kind of valid duplicate content? The use of friendly urls instead of query strings is fone to make Google index more content. But now noindex and other index-defeating techniques need to be used to to keep Google from the duplicate content...at some level defeating the purpose of getting Google better urls.
And that's aside form the service to the customer: allowing him/her to find data but making choices in the order that makes sense to them.
Thanks for clearing this up for so many users.
Thanks for posting on this issue, but it doesn't answer a key question I, and I think others, have.
I have had a .com domain for a number of years. As a web design company, I have used the '.com' as part of my branding. I have and continue to serve an international market.
However, I am based in the UK, which has now become a critical market for me, but my site is not listed in the 'Pages from the UK', and performs worse in google.co.uk than google.com.
So, my question is, how do I add the .co.uk domain while retaining the .com for brand and the international market, without losing the benefit of age and the links I've won, or dividing my link building and content management efforts between 2 domains?
Many thanks,
Iain
Thanks for this post, it clarifies a lot! One thing remains for us, though. We have a site promoting tourism in part of Kenya, hosted within Kenya (www.kisumu.co.ke). Because the Internet connectionsto the outside world are still pretty bad, we also hosted the site in the UK, (www.westkenya.com). Otherwise folks in Europe or the Americas would have had trouble reaching the site in Kenya and/or folks here in Kenya would have had to wait long for the UK-hosted site. It also provides some redundancy.
The site are exact mirrors and that is actually mentioned on the home pages. Does anyone know how this will be handled by search engines?
Thanks,
Erik
Nice article.
But what happens if a site has widespread cloners that are skilled in link building and overpower the owner of the original content in multitude? Is it possible to devalue an entire website? If so how does one recover.
Also what happens if we accidentally use a robots.txt wildcard that blocks the spiders does our content becomes the property of the cloners?
Are you saying that the Google Bowling Effect is impossible with this filter?
Thanks & Best!
A year ago in August 2006, my google page rank for www.healthcarehiring.com dropped from 5 to 0. Traffic and advertising revenue dropped by 75%.
I have been trying to figure out why. No paid links, no link exchanges, no intentional duplicate content. This site is huge, lots of databases with millions of records of contact information. This represents years of work.
A google search for site:healthcarehiring.com right now shows 254,000 pages indexed. A google search for link:healthcarehiring.com or link:www.healthcarehiring.com shows 19 inlinks. A yahoo search for link:healthcarehiring.com shows 4 inlinks, while a yahoo search for link:www.healthcarehiring.com shows 751 inlinks. A lot of these are from university websites, very high-quality incoming links.
Just yesterday I found some inadvertent duplication. I have 2 inactive websites which DNS routes to the same machine. The default behavior of apache httpd.conf name-based virtual host records is that the top VirtualHost entry gets the traffic for any domain that is not routed to its own directory. Guess what domain was on top of the list! I have fixed this by channging httpd.conf and putting a redirect 410 in the .htaccess file. It remains to be seen if this fixes the problem.
I have been re-designing and re-writing the entire site. Anybody who can shed any further light on this problem, please do so.
Thanks
Mike Clark
The duplicate content issue is a very important one. From my experience, your pages which have duplicate content will go straight into the supplemental index of Google. Consequently, they will not rank for any keyword. Before you put an article live on one of your sites, I really recommend to check it with copyscape. Also, you are better off if you don't submit to article directories. I have just started a site about satellite television, and I haven't submitted any articles to directories. The result? the pages start ranking immediately, and even though it's a new site, it already ranks 3rd for a few keywords. The conclusion? Avoid duplicate content at all costs!
So from an SEO, page rank point of view what is the best way to internally link back to your home page? "www.domain.com/", "www.domain.com/index.html", "index.html", or "/"
So I'm guilty, mea culpa...
I had duplicate content across 50 of my sites; it wasnt meant to be that way in the first place, but it was a temporary solution in my network setup.
So I flew way high in SERPS and now already 4 or 5 of these sites have been very severly penalized (but not banned as I can see so far). I've obviously readjusted to remove all duplicate content, now how do i reverse the penalty????
Hello, I would like to ask something. If someone wants to begin a blog which is cross posted somewhere else, also, what is the best tactic to follow ?( I want to mirror my blogspot blog to my site's blog account just to provide content) Should I use the robot meta tag somehow on the mirrored blog? If yes, which is the best combination?
I want to make the original blog top at page ranking, but also provide content to the root site of the mirrored blog.
So [NOINDEX,NOFOLLOW] on the mirrored blog or noindex, follow?
Thank you.
I am a beginner..still havent made my first post due to this matter, that I am considered.
Aggelos
I am just curious. Under the "careful syndication" paragraph it states to provide a link back to the original content. What kind of link? What is the specific anchor text and link format we should use? or is there a "html quote link" that I am not familiar with?
HELP !
Michael
I have a site that makes the same content available to mobile devices (XHTML-MP) as to desktop devices (HTML and XHTML). The site is dynamic and adjusts the content based on the capabilities of the users agent (e.g. Firefox users get XHTML, IE users get HTML, mobile phone users get XHTML-MP optimized for handheld devices, etc.) For any given page, there is only a single URL but I do have two sitemaps (a desktop version and a mobile version which contain the same URL's) so that Google crawls twice and requests each page both as a mobile device as well as a desktop device. Does this count as duplicate content?
When Google was first making mobile searches available as a separate system this setup worked great and we got a lot of new mobile traffic on top of our desktop traffic. However, recently, it looks like they have lumped everything back together and now we are getting poorer response for both desktop and mobile users.
I would rather not create different URL's for mobile devices because the content is the same (just formatted in a different way for the mobile device) but it looks like Google is getting confused - standard HTML pages are being returned with an "unknown format" label as if the XHTML-MP version was being referenced in response to a desktop search.
I've found a web site that copied in its page a text of a page of my web site.
What can I do? Is it a duplicate content?
Is there any way to protect a website from stealing a website via proxy? For instance:
http://nikat.org/www.amazon.com
http://nikat.org/www.seomoz.org
It's outright theivery and frustrating. Any way to protect oneself without having to manually ban the IP of each, which is only effective if you've been made aware of the proxy and get its IP?
Will google treat URLs like www.example.com/MyPage and www.example.com/mypage as being duplicate content pages if they both link to the same page?
Is google case sensitive when it comes to URLs?
its not often we find people who understand these issues and feel as passionately about them as you. Thank you. From
your friends at Unix Commerce Web Design
The user can take either path, and get to the same filtered data. Should the site be penalized for using friendly urls here? Absolutely not. pick any data point to filter by, in any order.
Posted By: James Shirley - Blackpool Web Design
Duplicate content I had one of these before when I type in Blackpool Web Design I got a web site that looked exactly like mine i contacted the person and then the next day there web site had vanished off google :) i also noticed that alot of Blackpool Hotels are copying off www.hotels-blackpool.com
Thanks for these great information.
I appreciate the information in this article and am trying to implement the stated suggestions in practice.
Reference: Duplicate website content and how to avoid it
Cheers!
Well, you ought to exclude the factoring of duplicate content between FORM tags because the drop down selection lists for items (services) one might market may be repititious - Like in my business for instance, Pickup (origin) and Destination in the vacation capital of the U.S. if not the world - has Resorts attractions and many cruise ships to choose from. I am using forms to generate requests for quotes when after your recent changes - I lost my PR of 5 and went to a zero - I assume because of the content of a single form on my home page.
This site is huge, lots of databases with millions of records of contact information. This represents years of work.
Posted By: Blackpool Hotel
Although it was stated above that “ you shouldn't worry about occasional snippets (quotes and otherwise) being flagged as duplicate content.” I have seen first hand how technically unique pages with similar themes were considered duplicate content. In this case the pages each contained a unique video and unfortunately only a single, but unique, paragraph of text followed by what I considered a “snippet” of 5 lines of contact information. When I removed the contact information that was replicated across all the pages they were shortly removed from the supplemental index. So I guess that statement depends on what you consider a "snippet" to be.
A year of working duplicate content out of my sites, I am still finding pages that are in the supplemental index due to duplicate content.
Duplicate content issue used to place your pages in the supplemental index. Unfortunately, the command that used to check the pages which were in the supplemental is not available anymore. :((
Best Internet Marketing product reviews.
Hello,
We are the representative of a US skin care company MURAD (www.murad.com) in Canada and we just launched our site (http://murad.balleydirect.com). To keep the branding consistent with the US firm, we have used direct content from their site.
Just to ensure that we are following the guidelines could you please verify if we in Canada representing a US firm are able to utilize content from their website.
For reference please review:
www.murad.com (US)
http://murad.balleydirect.com (Canada)
Thanks
Harun Tasci
www.murad.ca
Regarding scraping, Adam wrote, "If you do spot a case that's particularly frustrating, you are welcome to file a DMCA request to claim ownership of the content and have us deal with the rogue site."
DMCA requests to Google require that the victim send a letter or a fax to Google. No email notification is permitted (except by prior arrangement). Do you think that asking victims to use these time-consuming methods will encourage them to report abuse of copyright? A few weeks ago, I came across a site that contained almost verbatim an article from the TechScribe site. After reading Google’s DMCA request page, I decided that notifying Google was not worth the effort.
Thanks!
This is something I have been wondering about for a while, now I have come across this post, I have been making changes to my site to ensure less duplicate content.
Hi,
I'm thinking about posting some content on the site, and adding "printable or Download PDF's" same content as HTML. Would that be considered duplicate content or just printable version.
Thanks
As our blog posts describes, "printable versions" of pages are usually duplicate content (since you're offering the same content in two formats). There's nothing wrong with doing this, but—as Adam states in his first bullet point—you may wish to disallow the printable versions in your robots.txt file in order to control which version gets indexed.
there is seriously a HUGE PROBLEM with DUPLICATE CONTENTS! its technically THEFT!!
i discovered this morning that i was a VICTIM of an OBSSESED cyber stalker who DUPLICATED my content from my previous http://madgeroxmyworld.blogspot.com
and open up a BLOGSITE and store the contents here!!-->http:zarainginue.blogspot.com !!!
all my pictures, my postings evrything was COPIED, DUPLICATED and a total impersonation of ME!
i have deleted my old blog addy & my gmail a/c totally!
I DO HOPE AND SEEK HELP HERE FOR GOOGLE WEBMASTER TO look into this account & blog addy -- http://zarainginue.blogspot.com
as the contents are INFRINGEMENT of me! and i hope repeated infringements from this site, GOOGLE WEBMASTER will deletd this BLOG USER aka FRAUDSTER !!
thank you!
Hi Zara,
If someone is copying your content without your permission, you can file a DMCA request to take down the copied content.
Thanx susan!:)
will sought for account termination on this rouge blog –> http://zarainginue.blogspot.com & DCMA. Obviously Google webmaster will hav the registered IP, registered email & necc. info for this fraudster to be able to open up a blog account.
perhaps the culprit thinks its funny, but i think it as a SICK conduct.
Already the permalink & ‘post comment’ section is disabled. Content was copied coupled with series of edits being copied too. This is a real serial stalker we r talkin about. lol. :))
"Already the permalink & ‘post comment’ section is disabled. Content was copied coupled with series of edits being copied too. This is a real serial stalker we r talkin about. lol. :))"
yeap, every edits was captured which clearly show i have been stalked every second/minute by a certain cyberstalker, and even some posts i have deleted prior to closin down my ENTIRE blog after findin out there's a ROUGE blog 7 an impersonation of me & its duplicate contents!
the issue here is IT IS NOT MY BLOG, NOR IT IS MY REGISTERD EMAIL OR MY REGISTERED IP for the rouge/fraud blog account--> http://zarainginue.blogspot.com/
thanks for the advice Susan! :)
It is annoying to have content copied as in Zara's case either she files for DCMA request for the content copied at the allege rouge blog or no use fretting about it because you cannot control millions of cyberspace users copying contents.
Hello,
Thanks for this excellent advice! When I first published my site, I provided printer-friendly pages in an attempt to be helpful to the end user. But when the site was indexed, some of the "regular" pages disappeared.
Now, following Google's recommendations, I have used "no index" meta-tags on the printer-friendly pages, so that Google avoids picking the printer-friendly pages for indexing rather than the "normal" pages.
Thanks Google for this very helpful advice!
Best wishes,
Bo C. Klintberg
Editor, Philosophical Plays
A copied/duplicate content is still a COPIED CONTENT and no less regardless of its motives.
Annoying, but hey this is cyberspace and two things differs from the other is a valid content with the same URLS by the same user or a rouge user, same content & a totally different URL.
and when deftly dealing with duplicate content only the owner knows his/her authentic URL via the login account & the valid email address.
I have two websites (two domains with one hosting. Using one DNS for both websites.
Both websites shows on and off on google. Is this a duplicate content issue? Do I have to change the code in .htaccess? Does this cost my website to rank? When it shows on google, it #5 on the 10 top list. But it appear for a few days and it's lost again!
I also get code 404 not found on both websites or when I am looking for another website. Does this means that I am not register?
Please help me with this problem.
Thank you very much!
Susan,
I'm wondering if you can shed some light on this particular situation: Imagine a templated real estate website that offers great buyer and seller sections (tips and resources mostly). These sections however are generic and are offered on every instance of that template that is sold to each individual Realtor. Thus, in theory, 1000s of realtors could have the same generic content (or a version of it) on their websites. Also - what if this "generic" content contained links to many of the same web resources (such as a mortgage calculator)? At the same time, those sites also contain other, localized and relevant content in addition to the generic content. In this case, duplicate content is most likely of value to the reader of the site, yet it is still clearly duplicate content. I understand there is likely no SEO value to having such content but is having it also potentially dangerous to the overall health of your site? What kind of exposure might someone have in this situation? I have heard that site A needs to be more than 50% the same as site B to enter the "at risk" category. Can you shed any light? Thanks for your valued input!
Hi Kristen:
Check out our article on duplicate content. In particular:
"Google tries hard to index and show pages with distinct information. [Generally we filter out duplicates and] choose one of them to list. In the rare cases in which Google perceives that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we'll also make appropriate adjustments in the indexing and ranking of the sites involved."
You'll need to decide where your particular site falls along the spectrum of "content duplication", but the article offers lots of suggestions on how to minimize duplication. You may also want to check out our article on original content and remember that, if you are providing content that a) you didn't generate, and b) is available elsewhere, it's important to add some sort of value of your own. Otherwise, why not just link to the original version of that content?
Thanks Susan,
I couldn't agree with you more - in the world of content it is ALL about adding value in a timely and relevant manner from your own unique and honest perspective. However, for those that understand this and do regularly follow that mantra to build their foundation, do you believe adding generic third party content to the equation (let's say 25% of the time as an example) could tarnish a site's reputation? What if that third party content did not initially originate from the web (there is no original link, just the same content on 100s of other websites (who knows who was first) and was being provided solely for wide spread distribution and re-use by some sort of industry specific content generator. What if those articles did add value for your local sphere (ie. the people that you are not necessarily trying to connect with through search, but those already following you)? I guess what i'm asking is - if you don't care about optimizing this generic content and the value is more intrinsic, can you offer it and still feel relatively comfortable that it won't hurt your original content in search (since we do of course care about some indexing!). This is a very common situation, so I ask for all those out there with templated sites wondering if they are exposed. If this were a cause for concern, could it be ebbed by using robot no follow tags on those pages containing this content? I realize these questions are nearly impossible to answer definitively - anything you can add is appreciated. Thanks!
When in doubt, I'd do what makes most sense from a user perspective (what is most helpful to human visitors). You could also block those pages using robots.txt if you're worried about how search engines will view them. But the duplicate content article is the best guideline I can give.
Hi Susan,
I have an Ecommerce site with two different domains for the American and Canadian versions. The html on the product pages are identical (though javascript is dynamically displaying availability and price and add to cart buttons which differ across sites). It looks as if the Canadian site has been indexed properly and the American site has almost no google links. Even a search for the American version of the site shows the Canadian site as the first result (with the American being 9th). It seems as though in this case the duplicate content is necessary, do you have any recommendations?
Thanks,
Amit
I'm not a professional webmaster and have two questions:
1. If I submit an article to one or two dozen article directories and they all publish that same article, will I run into the problem of duplicate content?
2. If so, what is the "safe" number of directories that I should submit articles to?
Can anyone help? Thanks a million!
OK,
What happens when a website usch as americangaragefloor.com (american garage floor) and diversifiedfloorsystems.com are owned by the same individual (Mike Parker). I subscribe to the garageflooringinfor.com blog and just found out about this. You can see the documents from the indiana SOS at http://blog.garageflooringinfo.com/
I noticed that my site has been listed under the IP address http://72.41.103.216/ and also the domain name http://www.bidlinks.co.uk
I would like to delist the IP version as it could possibly have a negative effect due to perceived duplicate content.
How can this be corrected?
I have an article directory that is effectively 100% duplicate content (More Than Articles). Not long after it first started I did notice a dip in traffic and found that all my pages were in the supplemental index. I provide the articles formatted in HTML and in plain text, as well as the standard version. So basically every article appears 3 times on the site with minor variations.
I reworked the navigation and the robots.txt to exclude everything but the standard version from indexing. This has lead to all the pages going back to the main index and a gradual increase in traffic.
From that experience I have to conclude that duplication within a domain is rather more important than duplication across domains.
i want to index my blogspot with google... how can i do that ... can anyone please help me with this
Hi snape39,
Start here and here. If you have further questions about your site, please post them in our Webmaster Help Group.
Hi,
My content was being duplicated, I found it and delted that content, How to get my lost rank back now? I have now just original content in one place.
I feel & dig both Ko1man & especially sunil kumar gupta. according to blog admin, u can file for duplicate content under DMCA here:
http://www.google.com/dmca.html
It is annoying to have content copied. Either one files for DCMA (dcma is one step) request for the content copied at the allege rouge blog or no use fretting about it because you cannot control millions of cyberspace users copying contents especially teh existence of cyberstalkers which only perhaps the owner of the blog content is able to know as the duplicate blog has an individual account & registered email address.
thank you.
This was a good post but it did not quite answer the question I had when I found this in a search. I have a fairly well received site providing proofreading and copy editing services with links Internationally. I want to make sure country specific people were able to access the site under their country domain (http://editfast.ca & http://editfast.us) so I plan to create both of these country domain sites. The content will be the same except for the following changes:
1) Spelling will be specific to the country domain.
2) The contact info will be adjusted to reflect the correct contact info for that country (our New York Office and our Halfmoon Bay office (Canada).
3) A search for editors will bring up editors relevant to the country domain. (Canadian editors will show up when a search is performed on editfast.ca and American editors will show up when a search is performed on editfast.us.)
There are over 7000 pages on each of these sites and all are essentially the same. Is this going to cause trouble with duplicated content?
I have done this already on another of my sites which provides boat security checks, for the same reason -- to offer country specific sources (safemarina.ca & safemarina.us) for the service. I am using this as a test (it is a small, not too well know site) but thought I would ask here as well.
Yes, I am in a similar situation. You see, I have 2 websites, one deals with credit card offers and the other offers affiliate programs with custom templates and different product contents. I guess, the next step would be to allow each affiliate to create their own content so at least the unique content gets indexed
Hi everyone,
Since over a year has passed since we published this post, we're closing the comments to help us focus on the work ahead. If you still have a question or comment you'd like to discuss, free to visit and/or post your topic in our Webmaster Help Group.
Thanks and take care,
The Webmaster Central Team
Post a Comment