On Friday, Rand at SEOmoz asked: “Are 404 Pages Always Bad for SEO?“
Recently, though, Lindsay and I were faced with a tough call on a consulting project. The client has a site that receives a ton of search queries, many of which map to their category and subcategory level pages (which are more landing pages than search query pages, but also serve to address the search keywords). The client also has a number of search pages that have no content (either because they’re for mis-typed, nonsense or mis-spelled searches or because they simply don’t have content for those terms). Some of these pages earn links, some get a moderate amount of traffic and up until recently, they’ve essentially existed as error pages that resolve with a 200 code.
We don’t want the search engines wasting bandwidth crawling and indexing junk pages (especially since the site is monstrous and needs that crawl/index power to flow to the right sections). We also don’t want users to have a bad experience and while the error pages effectively communicate the right message (there’s no results for this query), semantically the pages should really 404.
I started writing a comment on this post but very quickly realised that it was worth a blog post all on its own. I find this subject really interesting in large site architecture SEO so it’s great to have the inspiration to put pen to paper fingers on keyboards. Let me summarise my findings from the excerpt above.
1) We’re talking about a large website where crawl bandwidth used by search engines is an issue
2) There are a lot of “junk” pages – blank / expired content pages (I’m going to assume there was content available at these URLs once)
3) Some of those junk pages may once have earned links
Are 404 Pages Always Bad for SEO?
When a web page returns a “404 not found” response, the web server is saying the page no longer exists. Do that enough and you’ll quickly see your page drop out of the index of most search engines. Google’s been round the block enough times to see many sites returning 404 errors but I’ve always felt that a large number of 404 errors are the last thing you want if you’re trying to give a an indicator to the search engines on the quality of your website.
Let’s remember that 404 errors stick around for a long time in Webmaster Tools – just because they appear to have dropped out of the index doesn’t mean they’re not being requested and continuing to soak up bandwidth.
Can the total number of 404 errors being returned be some kind of quality indicator to search engines?
Think signal to noise ratio. What if, suddenly, 40% of your indexed pages return a 404 error? What does that say about the way you manage your website? Could a large increase in error state pages give a signal to Google that you’re having problems with your website? Could the total number of error state pages influence crawl rate? I’m not saying that you’ll see your rankings or traffic drop in this scenario, but I am saying that in my opinion, we should avoid 404 errors if there’s an alternative.
An alternative to 404 errors on a large architecture site
Let’s return to the scenario outlined in the SEOmoz post and quotes summarised above. We have a large website with a content churn problem. I define content churn as the process of large amounts of content pages expiring, and subsequently returning 404 errors or large numbers of pages that respond with a 200 response but serve only some or none of the original content, including meta code. I covered this at SMX London 2009 in “Diagnosing Website Architecture Issues“.
Content churn in action
It’s pretty easy to find examples of content churn out in the wild. Particulary in the recruitment and retail industries. Try using the search query: intitle:”expired vacancy” or intitle:”product no longer available” and you’ll see what I mean. Because of obvious SEO and user experience issues, Rand is totally right to seek out a solution for pages like this one:
Solving the content churn and links problem
My advice is, don’t expire the pages and keep the original content live on the site. Of course you have the user experience problem, but you also have a data base full of items, jobs or general content that is in some way relevant to the user’s search query. Build the ability to display related or popular items on the “expired page” – perhaps in the form of a suggestion that keeps the visitor happy and interested in what your site has to offer:
If you’re linking to currently stocked, related items from your expired content pages then the links you’ve earned can pass value to deep sections of your architecture, via the anchor text you’ve specified.
Crawl bandwidth – If-Modified-Since
Has the page not changed for a long time? Are those links you’ve added not changing very frequently? Google’s request header contains the If-Modified-Since header which sends the date and time of the last crawl of that URL. If your webpage hasn’t been updated, then your server can be configured to respond with a 304 not modified response, and all of that crawl bandwidth is saved for another page. If you have a lot of indexed URLs, this could save a fortune in bandwidth costs!
More on If-Modified-Since
If-Modified-Since (Conditional Get) has been ably covered in the SEO, PHP developers and ASP .net developers blogosphere. Here are some resources to find out more:
Http/1.1 Definition at W3.org
Save bandwidth costs: Dynamic pages can support If-Modified-Since – Sebastians Pamphlets
Patrick Sexton at SEOish.com explains: What is a If-Modified-Since HTTP Header?
Google asking us to “make sure” the response is implemented in their Webmaster Guidelines
Conditional GET and ETag implementation for ASP.NET by J.A. García Barceló
The HttpWebRequest.IfModifiedSince Property – the .net Framework class library that Gets or sets the value of the If-Modified-Since HTTP header.
Implementing HTTP 304: Not Modified in PHP
This should keep your development team happy for a while. Good luck!