Search Engine Accessibility: Easy Checks That Make a Big Difference

spring flowers

Around 6 months ago now, I came across a website that was blocking the MSNbot crawler (that’s Bingbot since October 2010). No one at the company had any idea of the situation, and this particular situation had been costing them, traffic wise for nearly a year. Ouch.

This weekend, SEOmoz accidentally blocked Googlebot’s access to Open Site Explorer while trying to deal with some very heavy server load from a distributed bot net posing as a search engine crawler. Accidents are unavoidable, and the occasional slip can happen to the best of us, but what checks can we put in place to mitigate such risks?

This post is based on an article I wrote for State of Search in July 2010. I’ve refreshed and updated some of the advice and added a few new checks, too.

Get into the habit

Time to get the honesty box out: When was the last time you switched user agents? Checked your 304 not modified response? Made sure your canonical “www” redirects and trailing slashes were being added or removed correctly? Some checks are easily missed in todays “out of the box” code world.

Traffic dropped recently? Are you happy blaming the latest algorithm update or could it be a problem closer to home? Remember, web server configurations can change, often without the SEO being made aware.

Periodically browse your site with a different user agent setting

In the example at the beginning of this post I mentioned the problem with a search engine user agent being restricted from crawling a site. Episodes of traffic outage caused  by blocked crawler access are exceedingly rare, but as we’ve seen very recently, they can happen.

Browsing the internet with your user agent set to say, Bingbot can reveal some fascinating oversights, errors or dare I say, cloaking.

SEOmoz’s toolbar or User Agent Switcher both offer the capability to switch user agents in Firefox:

change user agent

Check your canonical redirects and old domain inventory

If you’re a seasoned SEO old timer, there’s nothing new here – but, be honest! When was the last time you checked your canonical redirects? Does your “www” redirect in, or out (depending on which you prefer) with a 301 server header response? This same tip applies to title case redirects, trailing slashes and even your old redirected domain inventory.

Make sure your rel=”canonical” is correct sitewide

A few mistakes with rel=”canonical” can lead to an unpleasant outcome. On larger sites though, it’s quite difficult to keep tabs on how rel=”canonical” is configured. Fortunately, SEOmoz’s Web App has a nifty export feature that gives you all of the crawl data for your site (you’re limited to however many URLs your subscription point allows – most are 10,000 but the limit is a million).

Compare your canonical with the SEOmoz web application

Some of the more interesting values you can extract from this tool are:

- Blocked by X-robots
- Blocked by meta-robots
- Rel Canonical
- http_status_code
- x_robots_tag_header
- meta_description_tag
- meta_robots_tag
- meta_refresh_tag
- rel_canonical_tag
- blocking_google
- blocking_yahoo
- blocking_bing

If you’re a bit of an Excel geek, you might be interested in the data export for SEOgadget’s most recent crawl from SEOmoz’s crawler. I know that some of this (but *not* all of it) is available via IIS SEO Toolkit (here’s the installation guide) and it looks like SEO Spider has a great deal to offer, too. Dan’s on the case with a feature by feature comparison of all three (and possibly more), so stay tuned.

Test your website is accessible with JavaScript disabled

Some websites won’t serve content to a JavaScript disabled browser. An age old problem that still crops up from time to time, and a nasty problem for search engines. I got an email from a concerned webmaster in Germany that, after migrating to a new website, he had lost all his search engine traffic. JavaScript was the problem. Disabling JavaScript is easy in Firefox, using the SEOmoz toolbar or Web Developer Toolbar. Browse around your site and make sure all is well.

javascript

Beyond 404?s – server header checks that get missed

Beyond checking that your error pages produce a 404 (and that Google Webmaster Tools isn’t reporting too many), you might want to consider digging in to your server header responses a little deeper. For example, a “304 not modified” is a response to an if-modified-since header field in the client request header. In English: some webservers will respond with a “not modified” if the page requested hasn’t changed since the last time it was crawled.

I’ve seen 304 responses handled really badly. In one situation, a web site was responding normally to all requests except when the if-modified-since header field was present. The server, instead of returning the correct 304 response, collapsed spectacularly with a 403 error. Oops! Test your site with Feed the Bot’s awesome 304 header checker tool (one of my favourite SEO tools).

not modified 304

Watch out for x-robots

Ever look out for the X-Robots tag? X-Robots is part of robots exclusion protocol (REP) and can be found in the server header response of a web page. You can noarchive, noindex, nofollow with an X-Robots tag, so it’s probably worth checking to see if something unexpected is lurking. You could even try checking for X-Robots with (and without) your user agent configured as a search engine…

http header in firefox

Keep an eye on the status of your top pages

The top pages report on Open Site Explorer is excellent for making the most of all linked to pages on your site. Look out for any 404 errors – a simple 301 redirect could rescue some valuable link juice:

top pages

Pro tip – export all of your Open Site Explorer data and run a Xenu crawl on the top pages list to get the latest, most up to date status codes. That way, you’ll know the data is fresh and the server response codes are bang up to date.

A few pages into this report, I found a URL that produces a 404 with 6 root domain links to the URL. I hadn’t looked at Open Site Explorer’s report for my own site for a long time. I’m pretty glad I added this section now…

Sudden performance changes

Has your site suddenly started performing slowly? It might be worth keeping an eye on site performance (page load times), just in case. While page load is still very much undefined in terms of its impactfulness in the serps, it’s certainly better for customers to have a slick, well optimised for page load experience.

time spent downloading a page

Watching your site for errors and general housekeeping

Fortunately for all of us, an SEO’s work is never done. If you follow every item of advice in this post, you’ll only arrive at a point where you know your site is ok at a single point in time. General housekeeping activities, like monitoring for errors in Webmaster Tools and keeping an eye on the state of your redirects is an endless task. Something that surprises me (especially about webmaster tools) is that there doesn’t seem to be an alerts based feature to keep us informed about any sudden changes to our sites – a big spike in 404s, a significant change to pages with internal links or drops in external links are all signals I’d like to be warned about if they change.

Over to you. What are your oft-overlooked but seriously handy search engine accessibility checks?

Comments

  1. richardbaxterseo Post author

    Right! Raven has a nifty email link change alert which I sue to keep an eye of our best links, but aside from that I’m struggling to think of a service taht has this.

  2. Carl Hendy

    Completely agree with these checks, its something that is often over looked. If you are working on a large site where you have multiple departments having input to the web dev team its important that these checks are scheduled regularly.

    One scenario I keep seeing is where poor version control is kept within the web development team and 301 redirects or canonical tags are being removed with an update of a page which can have a huge impact on rankings, traffic and conversions.

  3. Ian

    I recently picked up a site that blocked all URLs with query strings, and given that their main navigation all linked to URLs featuring query strings, it was quite worrying. They couldn’t figure out why their canonical tags to the SEO friendly version URLs weren’t working.

    Great article; Xenu plus IIS SEO Toolkit are great. Both of which I got far more use out of having read blog posts here, so thanks.