Page Level Search Engine Indexation [Data & Collection Methodology]

Seoul Design Expo

Our first post of the New Year looks at the topic of measuring SEO indexation. Inspired by Rob’s post on Distilled, and a wish to revisit some of my previous work on the topic, I thought it might be interesting to share a method of collecting data to build a clearer understanding of page level indexation.

Hopefully by the end of this post you’ll have a few new methods to collect site index data for your own SEO studies.

Why do SEO’s need an understanding of the principles of indexation?

How hard is your website working for you? Which pages and content groups yield the most benefit, traffic wise? Are there any weak spots, groups of pages that don’t seem to be working well? How can you make changes to navigation, architecture and sometimes page layout to improve a website’s overall search engine visibility or long tail traffic performance? These are questions that should occur to an SEO on a regular basis – but coming to a reliable answer is not always straightforward.

Seeking answers to indexation and site architecture related questions is a worthy cause, but achieving a meaningful answer is a significant hurdle to overcome. All of the (excellent) resources on this topic tend to approach indexation from the perspective of analytics data, or content grouped together inside sitemaps. What about individual pages, though? I use the term page level indexation, because I’m seeking a granular, page level answer to my indexation questions.

What data sources can tell us how a website has been indexed by Google?

For me, there are a number of approximate indicators that a page (or page group) is indexed – for example, reporting on pages that receive at least one entry from Google via Analytics. You might wish to take a look at the “URLs in Web Index” report in the sitemaps section of Webmaster Tools. Savvy webmasters and SEO’s may even use multiple sitemaps to get clearer page group level insight.

Logfiles will tell you a lot about GoogleBot visits, but not indexation, so where else can we look for inspiration?
Number of pages included in Web index according to Google

Expressing the number of pages included in a web index as a percentage, but exactly which URLs are included?

Collecting indexed pages data using the “cache” query

As a method to compliment your existing approach, you might find this methodology quite interesting. The outcome will be a page by page URL list for your site, where Google cache data, SEOmoz custom crawl and XENU data will give you a cracking starting point for you to diagnose your indexation problems. These steps involve having a Mozenda account, although you could do the same (or similar) by building your own crawler or using 80Legs.

Collect Google cache data via a proxy

Fundamentally we’re going to be executing a series of Google cache queries via a proxy. With Mozenda you have to have a method of distributing queries to Google via a proxy. even then it gets unreliable quickly if you overcook the requests. If you use a simple PHP proxy and go very, very slowly, you’ll probably be alright.

Get a URL list

For this, you’ll need a list of all of the URLs your site can generate. The easiest way to get to this list is to extract all of the URLs from your XML sitemap(s) or ask your developer. Remember that, if you crawl your site with XENU you might miss orphaned pages.

Build a Mozenda agent scraper

Your crawler needs to execute the Google cache query and should be configured to capture the URL and cache date.[URL]

Would result with:

cache date from Google

If no results are found, your agent needs to be able to record the alternative result. When you’re happy with the agent you’ve built, upload and run the agent. Execute this very slowly (proxy in image is a publicly available service – proceed with caution).


Mozenda running my cache agent:

cache data being collected

Combine your new data with a few other sources

While your cache scraper is running, think about where else you could gain insight through combining your data sources. Let’s not forget we’re trying to locate pages that are not indexing. Some of the data points you could include may be;

- Click depth of URL from home page
- Internal links out from page
- Internal links in to page
- Meta robots
- X-Robots in server header
- Status code response

All of these data points can be gathered from two sources – Xenu’s link sleuth and the SEOmoz Custom Crawl tool. Xenu needs little introduction, but few know that click depth, internal links in and out of a page are part of the available data. SEOmoz’s Custom Crawl is awesome, and includes data on the server header response, contents of the X-Robots tag, meta title and rel canonical target.

Custom crawl

Having a list of all URLs on your site, with a definitive answer on click depth, number of internal links and the Google Cache status is a very interesting piece of data to have, but (of course), it can be extended even further.

If you’re looking for a larger crawl of your site, but the same data, Adam from SEOmoz has pointed out you can get 10,000 pages + (depending on your membership level) crawled and exported from the SEOmoz Pro Account:

seomoz pro

You can find this data via the “Crawl Diagnostics” tab in your campaign dashboard. Thanks Adam!

Content grouping

Most websites have a relatively simple approach to content types via their URL formation. This blog, for example, uses “/category/” in the URL to indicate the category content type. Paginated URLs might appear as “/page/*/”. If you’re a retail site, perhaps your product pages contain “/product/”.

By using an Excel query to group your contnet types, you’ll have the ability to get a sense of overall indexation in an area of your site, without having to group the sitemaps together. Try something like:

=NOT(ISERROR(SEARCH(“[URL CHUNK]“,Table3[[#This Row],[URL]],1)))

Where “[URL CHUNK]” could be “/page/”, “/products” or whatever. The outcome is “TRUE” if your URL belongs to a recognised group, and “FALSE” if it doesn’t.

Entries via Google to URL

With a simple VLOOKUP, you can combine traffic numbers by URL in your indexation data. This might help highlight pages that *should* have a little traffic from Google, but don’t – or at least you’ll have another point of reference for your investigations.

Landing Pages

The end result

Here’s a screenshot of the example data I built while writing this post. You’ll see all of the data I’ve mentioned in this post, along with a number of “content groups” I found most relevant to my blog. There are some properly configured duplicated pages with SEOgadget which, I can confidently report, are not cached, nor are they generating traffic. My data tells me that the paginated URLs on the homepage, category and tag pages are properly set to noindex but that those pesky comment pages (where a blog post has more than a certain level of comments, we paginate them) are misbehaving (they should be set to noindex). Time to roll my sleeves up.

Click to enlarge…

Indexation Data

I hope you can see from this screenshot how you might benefit from combining data into a single point to identify, diagnose and fix indexation issues on your site. Of course there are other data sources out there, and we’ve not touched on the visual aspect of representing this data, which I’m saving for another post.

In the meantime, I’d really like to hear your thoughts, particularly on the data you might choose to help diagnose your architecture and indexation issues.

Image credits:
Justin De La Ornellas

Page Level Search Engine Indexation [Data & Collection Methodology], 5.0 out of 5 based on 1 rating


  1. Ross Hudgens

    Great post Richard. This post should be a must-reference for anyone with a large website. Thankfully (or maybe not thankfully?) I don’t have any massive websites to play with ATM, so this isn’t as much of a concern. But when that time inevitably comes, I’m sure I’ll reference back to this great post.

  2. adamSEO

    One thing I’d like to add is that the latest version of Xenu supports wildcards which is handy when you’re focusing on a specific area of a large site. I’m pretty sure Integrity for mac also allows this.

  3. richardbaxterseo Post author

    @Nuttakorn – I’m talking with the SEOmoz team right now, I’m pretty sure the pro tool will export up to 10,000 pages. I also suspect that custom crawl, given an internal URL may crawl a different list of URLS. Several crawls, follwed by a de-dupe could yield more data.

  4. john

    Great stuff richard, never failed to be impressed by your Excel wizardry!

    A lot of the techniques you use here can also be applied to scraping your backlink data to analyse the lifecycle of your links and where links are rotting away.

    The SEomoz tool is an interesting one, I know you’re using your own site as an example but realistically this is the sort of exercise which will be most valuable to sites with tens or hundreds of thousands of pages which makes me think 3000 (or even 10000) pages is pretty useless don’t you think?

  5. Richard Baxter

    @John thanks buddy (I do like a touch of Excel) – heads up on the SEOmoz crawl, the number of pages the pro account crawls depends on your subscription level. You can have up to one million pages. It’s the best Crawler for SEO in the industry!

  6. Jerry Okorie

    What baffles me is the extent you go to show real examples and steps to follow. In terms of collecting data, does Mozenda have a limit in size of pages it crawls? Keep up the great posts