Measuring Indexation Levels in Site Architecture

A few weeks ago I got quite interested in measuring true indexation levels and potential value metric thresholds that may or may not affect the likelihood of a page being indexed in Google. After collecting some initial data I realised what a huge undertaking that passing interest actually was.

indexation?

Image by: Nancy Wombat

There’s no conclusion in this post, at least not yet.  This is more a discussion on what I’ve observed so far while working with, and collecting data that could answer some of the questions I have. The tweet below says it all – I think I’ve had as much fun getting the  data as I have working with it.

twitter conversation

What is indexation?

I became interested in collecting data that could help me understand indexation levels on a website. Actually defining the meaning of indexation, though, is an important first step. I’m of the opinion that “indexation” means the number of pages from a website that are included in Google’s index. “Indexation” shouldn’t mean “rank”, because other factors (authority metrics, relevance) play a role in any given URL ranking for a specific query in a search engine. A page can be indexed, but it might not rank in a position for a query that any normal search engine user (non-SEO) would ever see.

This idea begs the question – is indexation the number of pages that receive one or more entries from a search engine over a given period of time? Analytics data is only one source of information on the performance of any given URL and I’ve led myself to the conclusion that analytics numbers only become powerful when combined with other data sources.

Combining data sources for an overall impression of indexation in Google

In a quest to construct a better impression of indexation on my example site, I set about on a data collection mission. First, I’ll describe what data I’ve been collecting:

  • All URLs on a domain
  • All URLs that have an internal link (Google Webmaster Tools)
  • The response (positive or negative) to a Google cache: query for each URL
  • Analytics entries to each of the URLs
  • MozRank for each URL
  • PageRank for each URL

Methods to collect the data (for the non developer)

Getting your hands on a snapshot of all URLs on a site is relatively easy with a tool like Xenu’s Link Sleuth. Just be sure to make sure that URLs don’t time out during the crawl, and if they do, recrawl those values. If you have a site of say, less than 3000 pages you could give the Custom Crawl prototype a try at SEOmoz.

Google Webmaster Tools data can be very useful, particularly the internal links report. The data on all URLs with at least one internal link tells us that Google has discovered the URL with an internal link. A fair assumption would be that the URLs listed in this report have also been crawled, that’s the assumption I make in my data but I’m always really pleased to hear if you think this is correct.

To gather cache data from Google, I opted to recruit the new kid on the SEO tools block, Mozenda. In principle, you’re using Mozenda to scrape Google cached pages, recording the cache date, URL, cache time and taking note of what I call a fail safe. A fail safe in a Mozenda crawl is an item of text you’ll only find on a positive result for a cache query. For example “This is Google’s cache” only appears in text if the query result for a cached page is positive. I use a fail safe because I noticed the crawl agent was missing some data on occasional crawl cycles.

It’s really easy to construct an agent to do this kind of thing, and I suspect using 80Legs is quite simple too.

Mozenda and 80Legs LogoA quick note on 3rd party crawlers

If you’re going to crawl Google to scrape their data, execute the agent via your own proxy. PHP proxies are really easy to deploy.  Go easy on crawl rate too – with new capabilities for SEO data collection comes data greed, executing too many requests at once and at too fast a rate. If you’re doing this, you’re ultimately risking your own ability to collect data at all. If data scrapers are working from a handful of IP addresses, I’m quite sure they’ll be blocked from making requests by the big guys like Google, Amazon, et al, eventually.

If you want to do a serious site crawl, say 100,000 page load requests or more, expect to spend something in the region of a total of $249 for the bandwidth and $399 for the registration.

Back to the data collection

Analytics data plays a role in my data set, using the &limit= query string to ensure that all of the landing page data from “Traffic Sources > Search Engines > Google > Landing Page” is neatly extracted in as few CSV exports as possible.

MozRank can be scraped quite easily using Mozenda via the free SEOmoz API (or if you’re a developer, a quick PHP script should be quite easy). I captured PageRank in a similar manner.

A sample of the results so far

Here’s a sample chart of the data showing a selection of subfolder metrics:

chart

In this chart I’m looking at taxonomy subfolders such as category and tag based content. The chart shows the number of cached pages in each subfolder, the number of pages in the subfolder that have PageRank and a count of URLs that received one or more entry from Google organic search. The folders above are likely to attract few if any external links, and generate many URLS through the sheer number of tags assigned and large levels of paginated navigation. From an indexation point of view it feels like this type of URL is a great starting point to observe quirky or interesting indexation behaviour.

I found it quite fascinating that many pages in the tags subdirectory are cached, but proportionally fewer have PageRank and drive any traffic. Tag pages are not like normal web pages, in that there are many pages which are all slightly different by one or two words on each page. Regardless of a lack of diversity, you’d expect (or hope) that they’d be capable of generating more long tail traffic that they actually do. In reality (I’ve seen this many many times) default tag page templates tend to drive little traffic in real world applications.

The chart above makes more sense when you add the total number of URLs in each subfolder, although I apologise in advance that the colour scheme changes!

chart 2

An indexation ratio

Is a measure of indexation best described by a ratio? What role can a quality indicator play in this ratio? Certainly my initial thoughts are to take a folder by folder approach looking at the number of URLs vs indexed (cached) URLs. This is where I think analytics data can really play a role, in helping understand how “employed” all content is in any specific area of a site. I’m going to be thinking about this more in the near future.

What’s next?

There are some questions I’d like to continue to attempt to answer, most notably, is there a quality threshold below which the likelihood of a URL being “indexed” is much lower? The early data just tells me that I need to collect more data, studying a larger site with a higher levels of indexation issues. Certainly since the May Day update, I have a general sense that regardless of relevance, a page without the right quality signals might struggle to rank or stay visible in the main index. Getting a complete picture of those quality signals is very hard, particularly with the lack of granularity in PageRank values, and completeness in crawl with 3rd party link analysis tools.

I’d love to hear comments or suggestions on how you think indexation should be measured and, based on the data sources I mentioned above, how you would report site (or subfolder) indexation levels. My work here is far from complete and I’d be delighted to hear from anyone who has thoughts on the topic.

Comments

  1. Eduard Blacquière

    Great thinking, Richard. What did you learn from adding internal link data and MozRank?

    I’d love to see more data side by side as well. For example when and how often the url is crawled and if the specific url has external links. In my opinion this could also help indicate what makes an indexation treshold.

    I’m also thinking of Patrick Altoft’s post from March where he uses multiple xml sitemaps to calculate an indexation ratio per site level: http://www.blogstorm.co.uk/using-multiple-sitemaps-to-analyse-indexation-on-large-sites/

  2. jaamit

    Great first experiments here Richard. I was going to mention Patrick’s multiple sitemaps solution too, as this is in a sense as close to a “from the horses mouth” indexation metric as you can get.

    The number of cached URLs metric is interesting (especially the use of mozenda to scrape that figure!) – but this does include pages with a noindex metatag which may have been cached/crawled but not indexed. Still an indication of something though I guess.

    Moving beyond values of how many pages are cached/indexed, I recall an excellent idea from Hamlet Batista back in the day about measuring the cache date of a page – well perhaps you could use Mozenda to aggregate cache dates of all cached pages within a site / section and that could be a useful metric / KPI to improve. Just a thought, havent really worked it through in my mind yet… ;)

  3. richardbaxterseo

    Hey Jaamit

    The indexation numbers provided by Google in Sitemaps are useful – but the numbers are limited to the sitemap file as a whole. If you’ve divided your sitemaps up folder by folder, that’s useful – but the methodology outlined in this post is an attempt at a page by page view of indexation.

    Mozenda captures the date and time, and so over time you could recrawl the URL list to get fresh cache dates. That would be expensive though!

    I think there’s definitely scope to establish rules around page by page indexation metrics using automated / scraped cache queries, value metrics and traffic data. I think the data might become more relevant on a macro scale, with many sites covering hundreds of thousands of pages, possibly millions.

  4. richardbaxterseo

    Eduard, thanks for your comments!

    I’m working on a follow up to this that covers principles on “employment” – how hard is a content section working based on indexed / cached pages, traffic data and links will play a role in this. I have no idea what, if any answers I’ll get out of it :-)

  5. Carter Cole

    im working to build this kind of analysis directly into SEO Site Tools ive got oauth working pulling webmaster tools and Google analytics data as well as grabbing and parsing the link and query csv that you can export… i built my PR lookup with snowshoe technique allowing for even faster lookups

  6. Kris - Health Blog

    I use a plugin that puts the “noindex” attribute on tag pages. I believe it may help reduce duplicate content, but not sure if it matters though since I use excerpts anyways except in the post itself.

  7. Modi

    Nice post.

    Re scraping cached dates Scrapebox does a pretty good job. Combining scrapebox and Xenu one can easily figure out what the indexation rate is in each level as Xenu does split sites per level.

    Also, using Scrapebox again it’s very easy to detect which URLs have been indexed as not all URLs detected by Xenu may be indexed.