For a little while now I’ve been wrangling with an interesting problem in site architecture diagnostics. Do I have any orphaned pages on my site, and how do I find them?
Frequently, web developers will write code to generate an XML sitemap that will include all URLs in the database, linked to or no. How can we use this fact to our advantage?
I thought I’d share a strategy using Excel to extract all the URLs on your site, comparing them to the raw export from the amazing IIS Toolkit. (PS – word is that Daniel’s writing a “how to” for IIS, including the installation…)
Import your XML sitemap URLs into Excel
Did you know you can import an XML sitemap into Microsoft Excel? It can be a little crash-tastic, but oh-so rewarding when it works. Let’s use SEOgadget’s XML sitemap as the example:
Then, pop your XML URL into the dialogue box:
I’d recommend you only do this if you have reasonably powerful machines, we rock i7′s with 4gbs of ram as a system minimum. Big sitemaps take a while!
The end result is something like this:
By the way, you could import Twitter search results via the RSS feed into Excel using the same method, if you’re into that kind of thing.
Get your IIS site crawl data
Dan’s working on his next post, so I’m not going to steal any thunder, I promise. Assuming you already have IIS Toolkit, go ahead and export the URL listfrom your site crawl:
Import the CSV export into Excel, and turn the data into a table called IIS. If you haven’t played with exporting data from IIS, you should – it’s amazing.
There’s a ton of brilliant data in the IIS export, including the contents of the server header reponse, internal links counts and the directory the page is hosted in. We’ll talk more about that later on, use for now we’re interested in orphaned pages.
Use VLOOKUP to find pages in your sitemap that have no internal links
At the end of the day, a page in your website with no internal links is very unlikely to rank. Many pages with no internal links indicates some problems with your site architecture that you’re going to need to fix. Alas, how to find them? With a trusty VLOOKUP, that’s how.
We’re most interested in the “linked by” column, which contains the total number of internal links a page has. All you have to do is pull this data into your imported sitemap XML list. Assuming you name your IIS table “IIS”, the query will look like this:
=IFERROR(VLOOKUP(XML[[#This Row],[ns1:loc]],IIS,18,0),”Not crawled”)
Why is this good?
The end result will be a list of URLs in your XML sitemap with corresponding internal link numbers. That’s cool for doing the following:
- Finding URLs in your XML sitemap with no internal links, and are as a result, “Not crawled”
- Finding URLs in your sitemap with very low numbers of internal links
Have fun, and have a great Bank Holiday weekend