Our first post of the New Year looks at the topic of measuring SEO indexation. Inspired by Rob’s post on Distilled, and a wish to revisit some of my previous work on the topic, I thought it might be interesting to share a method of collecting data to build a clearer understanding of page level indexation.
Hopefully by the end of this post you’ll have a few new methods to collect site index data for your own SEO studies.
Why do SEO’s need an understanding of the principles of indexation?
How hard is your website working for you? Which pages and content groups yield the most benefit, traffic wise? Are there any weak spots, groups of pages that don’t seem to be working well? How can you make changes to navigation, architecture and sometimes page layout to improve a website’s overall search engine visibility or long tail traffic performance? These are questions that should occur to an SEO on a regular basis – but coming to a reliable answer is not always straightforward.
Seeking answers to indexation and site architecture related questions is a worthy cause, but achieving a meaningful answer is a significant hurdle to overcome. All of the (excellent) resources on this topic tend to approach indexation from the perspective of analytics data, or content grouped together inside sitemaps. What about individual pages, though? I use the term page level indexation, because I’m seeking a granular, page level answer to my indexation questions.
What data sources can tell us how a website has been indexed by Google?
For me, there are a number of approximate indicators that a page (or page group) is indexed – for example, reporting on pages that receive at least one entry from Google via Analytics. You might wish to take a look at the “URLs in Web Index” report in the sitemaps section of Webmaster Tools. Savvy webmasters and SEO’s may even use multiple sitemaps to get clearer page group level insight.
Logfiles will tell you a lot about GoogleBot visits, but not indexation, so where else can we look for inspiration?
Expressing the number of pages included in a web index as a percentage, but exactly which URLs are included?
Collecting indexed pages data using the “cache” query
As a method to compliment your existing approach, you might find this methodology quite interesting. The outcome will be a page by page URL list for your site, where Google cache data, SEOmoz custom crawl and XENU data will give you a cracking starting point for you to diagnose your indexation problems. These steps involve having a Mozenda account, although you could do the same (or similar) by building your own crawler or using 80Legs.
Collect Google cache data via a proxy
Fundamentally we’re going to be executing a series of Google cache queries via a proxy. With Mozenda you have to have a method of distributing queries to Google via a proxy. even then it gets unreliable quickly if you overcook the requests. If you use a simple PHP proxy and go very, very slowly, you’ll probably be alright.
Get a URL list
For this, you’ll need a list of all of the URLs your site can generate. The easiest way to get to this list is to extract all of the URLs from your XML sitemap(s) or ask your developer. Remember that, if you crawl your site with XENU you might miss orphaned pages.
Build a Mozenda agent scraper
Your crawler needs to execute the Google cache query and should be configured to capture the URL and cache date.
Would result with:
If no results are found, your agent needs to be able to record the alternative result. When you’re happy with the agent you’ve built, upload and run the agent. Execute this very slowly (proxy in image is a publicly available service – proceed with caution).
Mozenda running my cache agent:
Combine your new data with a few other sources
While your cache scraper is running, think about where else you could gain insight through combining your data sources. Let’s not forget we’re trying to locate pages that are not indexing. Some of the data points you could include may be;
- Click depth of URL from home page
- Internal links out from page
- Internal links in to page
- Meta robots
- X-Robots in server header
- Status code response
All of these data points can be gathered from two sources – Xenu’s link sleuth and the SEOmoz Custom Crawl tool. Xenu needs little introduction, but few know that click depth, internal links in and out of a page are part of the available data. SEOmoz’s Custom Crawl is awesome, and includes data on the server header response, contents of the X-Robots tag, meta title and rel canonical target.
Having a list of all URLs on your site, with a definitive answer on click depth, number of internal links and the Google Cache status is a very interesting piece of data to have, but (of course), it can be extended even further.
If you’re looking for a larger crawl of your site, but the same data, Adam from SEOmoz has pointed out you can get 10,000 pages + (depending on your membership level) crawled and exported from the SEOmoz Pro Account:
You can find this data via the “Crawl Diagnostics” tab in your campaign dashboard. Thanks Adam!
Most websites have a relatively simple approach to content types via their URL formation. This blog, for example, uses “/category/” in the URL to indicate the category content type. Paginated URLs might appear as “/page/*/”. If you’re a retail site, perhaps your product pages contain “/product/”.
By using an Excel query to group your contnet types, you’ll have the ability to get a sense of overall indexation in an area of your site, without having to group the sitemaps together. Try something like:
=NOT(ISERROR(SEARCH(“[URL CHUNK]“,Table3[[#This Row],[URL]],1)))
Where “[URL CHUNK]” could be “/page/”, “/products” or whatever. The outcome is “TRUE” if your URL belongs to a recognised group, and “FALSE” if it doesn’t.
Entries via Google to URL
With a simple VLOOKUP, you can combine traffic numbers by URL in your indexation data. This might help highlight pages that *should* have a little traffic from Google, but don’t – or at least you’ll have another point of reference for your investigations.
The end result
Here’s a screenshot of the example data I built while writing this post. You’ll see all of the data I’ve mentioned in this post, along with a number of “content groups” I found most relevant to my blog. There are some properly configured duplicated pages with SEOgadget which, I can confidently report, are not cached, nor are they generating traffic. My data tells me that the paginated URLs on the homepage, category and tag pages are properly set to noindex but that those pesky comment pages (where a blog post has more than a certain level of comments, we paginate them) are misbehaving (they should be set to noindex). Time to roll my sleeves up.
Click to enlarge…
I hope you can see from this screenshot how you might benefit from combining data into a single point to identify, diagnose and fix indexation issues on your site. Of course there are other data sources out there, and we’ve not touched on the visual aspect of representing this data, which I’m saving for another post.
In the meantime, I’d really like to hear your thoughts, particularly on the data you might choose to help diagnose your architecture and indexation issues.
Justin De La Ornellas