Using Google for Duplicate Content Detection

A month or so ago I was looking at a camping equipment website called outdoorpros.com. I love this site and would recomend it to anyone. Being an SEO, however, I couldn’t help but notice that they were using some suspicious looking paginated links on their categories pages, so after getting all excited about my new camping stove I decided to take a quick look in their Google site index to see how search engines might be indexing the site.

This post covers some basic tips on “site diagnostics”, specifically; duplicate content detection by using Google search. Checks that every SEO should do as part of investigating potential issues that may negatively impact search engine positioning.

Here’s the approach I always follow, using outdoorpros.com as an example site:

1) Use your common sense

Let’s start by doing a site:www.outdoorpros.com in Google search.

As you can see from the screen grab, Google is reporting 72,100 indexed pages. Is that too many? If so you may have some kind of duplicate content issue.

2) Skip around the index and see if you spot something weird

Ok, not terribly technical advice, but it doesn’t have to be.

Click to around page 10 and take a quick look at the indexed URL’s. If you don’t see anything weird, skip ahead another 10 pages. Go as far to the back of the index as you possibly can, because that’s where the good bad stuff usually hides. You’re looking out for malformed urls, query strings (like ?=sessionid or ?first_page etc) or many repeated results with the same title / description.

In the case of our friends at outdoorpros.com you can see straight away that something doesn’t look right

That set of results tells me a lot about this site, and I’ve only been looking at it for 30 seconds. We’ve identified some query strings in the index. They might be causing duplicate content. How do we confirm that though?

3) Assessing if there really is a problem on individual page types

Take one of the query strings we saw in the index. Let’s use:

?attribute_value_string

Is that indexed string causing a problem? Let’s see. The url was:

http://www.outdoorpros.com/Brands/Kershaw/96?attribute_value_string%7CColor=Pink

It looks like a brand / category page for Kershaw Knives. Checking if that page is indexed with and without the query string is the first step. Here’s the cached page with a query and without. Woops. There are at least two copies of this page in the index.

But those pages have different content? Well, yes in that products the page links to are different, but, the brand category page is the same every time. Each copy of the page has the same meta title, description – it’s duplicating! It may be why Outdoorpros don’t rank organically for “Kershaw” or “Kershaw knives”

4) Deciding how may URLs you have in the index are duplicated

That’s quite easy. To get a feel for the number of URLs that are duplicating, just do a query like

site:www.outdoorpros.com inurl:attribute_value_string

This site looks to have at least 13,000 urls that contain the query string. Drill down a little by picking a few different titles from indexed pages such as:

site:www.outdoorpros.com intitle:”Buck Knives – OutdoorPros.com”

There are 65 pages with that exact <title>. Doh!

5) How do I fix this?!

Ok, first of all let me recap on what we’ve done so far. We’ve used a basic site: command and taken a common sense snapshot of how many pages there are in the index. When you’re an e-commerce site with 100,000 indexed pages and only 5,000 products, you might need to think about it.

Next, we drilled down by just checking Google’s index in random positions to see if there was anything that didn’t look right. Something was definitely wrong. By carrying out a query that told us how many instances of the query string were present, we had a total number of indexed pages using that string. Finally, we picked a specific page <title> and found 65 instances of the same page.

There is a solution, and sadly just nofollowing paginated links won’t work. The damage has been done – you have some indexed urls and some housekeeping to do.

I’m going to offer some advice in this post, but I’m going to cover fixing duplicate content issues in my next post soon. Add my RSS feed to get that post when it’s done. In the meantime, my best advice to outdoorpros.com is they need to create a list of all of the query strings that describe paginated pages and set up a rule to noindex,follow anything above the value of the first page.

Here’s my example:

Let’s look at their pants page. :-) It’s a perfectly good pants page and I’ll hear no sniggering at the back of the class please..

The main url to this page is:

http://www.outdoorpros.com/Cat/Pants/5/List

Check out the paginated navigational links. Each one of them produces a different url that looks like this:

http://www.outdoorpros.com/Cat/Pants/5/List?first_answer=13

The fix? A simple noindex,follow should be added in the page head whenever that query string is generated.

<html>
<head>
<title>...</title>
<META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW">
</head>

This way, the many versions of the same page will be crawled but not indexed. All links on the page will be followed so the products will still be added to Google’s index. You’ve identified the canonical version of your pants page and Google will be grateful. Job done.

Comments

  1. Dr. Pete

    Nice review of how a couple of basic tools can turn into advanced techniques. Just like good games, good SEO tools take moments to learn and a lifetime to master.

  2. Tom

    Nice post! Solid tips :-)

    Only thing is I think your code at the end is wrong – should say noindex, follow not noindex, nofollow

    More of this kind of thing please!

  3. BottomTurn

    very good post Richard.

    If I may ask : couldn’t you use the google webmaster tool in order to get directly informations like duplicate titles? it works faster for you and scan the whole site.

  4. richardbaxterseo

    Hi There BtoomTurn, you’re right you could use webmaster tools to get dupicate titles. Webmaster is definitely one source of information, though you won’t get the details that you need to perform a complete diagnosis. WMT is a very important step, and I’d put that part of the diagnostic under “Use your common sense” – good call.

  5. Jerry Okorie

    Very helpful and like to said might be difficult for a site with over 100,000 odd pages. SEO is more about thinking like a tester but with a good understanding of SE, – Search for what doesn’t work in a site and you will find it,then apply your knowledge to the tools available and you find a solution.

  6. Ramesh

    Hi there very useful tips, its fantastic input for a webmaster, tanks for posting i will look forward to visit your post..

  7. Pingback: Dis papa, c’est quoi le duplicate content ?

  8. Pingback: aimClear’s 2009 Daily Training Link Library » aimClear Search Marketing Blog

  9. Pingback: Encontre Conteúdo Duplicado Com o Google « Blog World Online

  10. Door Manufacturer Keith

    Is there any “official” checking tools for duplicate content? Not in terms of URL, but like similar content in two different URLs – any tool to see if Google (or any other SE) sees that as duplicate content?

    This is an issue for ecommerce stores, or product catalogs – where the difference between two products may just be the color, or a slight redesign, and the model number. While we can attempt to craft a different title, or meta description, most of the content remains the same.

  11. Pingback: Fixing duplicate content (and no, I'm not going to talk about pagination)

  12. Pingback: SEO hosting 101 – don't leak your staging urls into Google!

  13. Pingback: Raven SEO Weekly Digest – Issue 33