I’ve been interested in web page and site performance for some time, and I think there are some amazing minds in performance research that should be given a lot of respect. One such expert is Steve Souders. Steve works at Google on web performance and open source initiatives. His book, High Performance Web Sites, explains his best practices for performance and he’s created a bunch of performance tools including YSlow, Cuzillion, Jdrop, SpriteMe, ControlJS, and Browserscope. For data junkies and web performance analysts alike, his latest incarnation – HTTParchive, is a very intersting new source of data on how the internet is built.
HTTParchive.org is a data set generated from a crawled list of sites seeded from lists provided by Alexa, Quantcast, the Fortune 500 and this global list of top companies. Their data can be downloaded and independently analysed and it’s a historic database, so you can compare month on month performance stats. My understanding is that there around 17,000 websites in the index currently.
There’s definitely some interesting info for the budding info graphics enthusiast too;
What are the most popular image formats across the web?
How many pages have errors across the web?
According to this data, almost a third of the pages produced an error. This could be the result of the crawl methodology (some servers may block suspicious activity, or very heavy multiple requests) – but it does give an indication of how the web might decay, something that Rand has mentioned as an observation from their Linkscape index.
Total pages with redirects (3xx)
More than half of this seed set is redirected! That’s quite amazing.
The most popular scripts in use on the web
Wow – over half of the seed list are using Google Analytics. If half the internet is using Google Analytics (we don’t have the data for that) then, wow. They’d know a lot!