A few weeks ago I got quite interested in measuring true indexation levels and potential value metric thresholds that may or may not affect the likelihood of a page being indexed in Google. After collecting some initial data I realised what a huge undertaking that passing interest actually was.
Image by: Nancy Wombat
There’s no conclusion in this post, at least not yet. This is more a discussion on what I’ve observed so far while working with, and collecting data that could answer some of the questions I have. The tweet below says it all – I think I’ve had as much fun getting the data as I have working with it.
What is indexation?
I became interested in collecting data that could help me understand indexation levels on a website. Actually defining the meaning of indexation, though, is an important first step. I’m of the opinion that “indexation” means the number of pages from a website that are included in Google’s index. “Indexation” shouldn’t mean “rank”, because other factors (authority metrics, relevance) play a role in any given URL ranking for a specific query in a search engine. A page can be indexed, but it might not rank in a position for a query that any normal search engine user (non-SEO) would ever see.
This idea begs the question – is indexation the number of pages that receive one or more entries from a search engine over a given period of time? Analytics data is only one source of information on the performance of any given URL and I’ve led myself to the conclusion that analytics numbers only become powerful when combined with other data sources.
Combining data sources for an overall impression of indexation in Google
In a quest to construct a better impression of indexation on my example site, I set about on a data collection mission. First, I’ll describe what data I’ve been collecting:
- All URLs on a domain
- All URLs that have an internal link (Google Webmaster Tools)
- The response (positive or negative) to a Google cache: query for each URL
- Analytics entries to each of the URLs
- MozRank for each URL
- PageRank for each URL
Methods to collect the data (for the non developer)
Getting your hands on a snapshot of all URLs on a site is relatively easy with a tool like Xenu’s Link Sleuth. Just be sure to make sure that URLs don’t time out during the crawl, and if they do, recrawl those values. If you have a site of say, less than 3000 pages you could give the Custom Crawl prototype a try at SEOmoz.
Google Webmaster Tools data can be very useful, particularly the internal links report. The data on all URLs with at least one internal link tells us that Google has discovered the URL with an internal link. A fair assumption would be that the URLs listed in this report have also been crawled, that’s the assumption I make in my data but I’m always really pleased to hear if you think this is correct.
To gather cache data from Google, I opted to recruit the new kid on the SEO tools block, Mozenda. In principle, you’re using Mozenda to scrape Google cached pages, recording the cache date, URL, cache time and taking note of what I call a fail safe. A fail safe in a Mozenda crawl is an item of text you’ll only find on a positive result for a cache query. For example “This is Google’s cache” only appears in text if the query result for a cached page is positive. I use a fail safe because I noticed the crawl agent was missing some data on occasional crawl cycles.
It’s really easy to construct an agent to do this kind of thing, and I suspect using 80Legs is quite simple too.
A quick note on 3rd party crawlers
If you’re going to crawl Google to scrape their data, execute the agent via your own proxy. PHP proxies are really easy to deploy. Go easy on crawl rate too – with new capabilities for SEO data collection comes data greed, executing too many requests at once and at too fast a rate. If you’re doing this, you’re ultimately risking your own ability to collect data at all. If data scrapers are working from a handful of IP addresses, I’m quite sure they’ll be blocked from making requests by the big guys like Google, Amazon, et al, eventually.
If you want to do a serious site crawl, say 100,000 page load requests or more, expect to spend something in the region of a total of $249 for the bandwidth and $399 for the registration.
Back to the data collection
Analytics data plays a role in my data set, using the &limit= query string to ensure that all of the landing page data from “Traffic Sources > Search Engines > Google > Landing Page” is neatly extracted in as few CSV exports as possible.
A sample of the results so far
Here’s a sample chart of the data showing a selection of subfolder metrics:
In this chart I’m looking at taxonomy subfolders such as category and tag based content. The chart shows the number of cached pages in each subfolder, the number of pages in the subfolder that have PageRank and a count of URLs that received one or more entry from Google organic search. The folders above are likely to attract few if any external links, and generate many URLS through the sheer number of tags assigned and large levels of paginated navigation. From an indexation point of view it feels like this type of URL is a great starting point to observe quirky or interesting indexation behaviour.
I found it quite fascinating that many pages in the tags subdirectory are cached, but proportionally fewer have PageRank and drive any traffic. Tag pages are not like normal web pages, in that there are many pages which are all slightly different by one or two words on each page. Regardless of a lack of diversity, you’d expect (or hope) that they’d be capable of generating more long tail traffic that they actually do. In reality (I’ve seen this many many times) default tag page templates tend to drive little traffic in real world applications.
The chart above makes more sense when you add the total number of URLs in each subfolder, although I apologise in advance that the colour scheme changes!
An indexation ratio
Is a measure of indexation best described by a ratio? What role can a quality indicator play in this ratio? Certainly my initial thoughts are to take a folder by folder approach looking at the number of URLs vs indexed (cached) URLs. This is where I think analytics data can really play a role, in helping understand how “employed” all content is in any specific area of a site. I’m going to be thinking about this more in the near future.
There are some questions I’d like to continue to attempt to answer, most notably, is there a quality threshold below which the likelihood of a URL being “indexed” is much lower? The early data just tells me that I need to collect more data, studying a larger site with a higher levels of indexation issues. Certainly since the May Day update, I have a general sense that regardless of relevance, a page without the right quality signals might struggle to rank or stay visible in the main index. Getting a complete picture of those quality signals is very hard, particularly with the lack of granularity in PageRank values, and completeness in crawl with 3rd party link analysis tools.
I’d love to hear comments or suggestions on how you think indexation should be measured and, based on the data sources I mentioned above, how you would report site (or subfolder) indexation levels. My work here is far from complete and I’d be delighted to hear from anyone who has thoughts on the topic.