HTTP Archive: new code, new charts
The HTTP Archive is a permanent record of web performance information started in October 2010. The world’s top 17,000 web pages are analyzed twice each month to collect information such as the number and size of HTTP requests, whether responses are cacheable, the percent of pages with errors, and the average Page Speed score. The code is open source and all the data is downloadable.
The next big step is to increase the number of URLs to 1 million. The biggest task to get to this point is improving the database schema and caching. This past week I made some significant code contributions around caching aggregate stats across all the web sites. Even with only 17K URLs the speed improvement for generating charts is noticeable.
The new stats cache allows me to aggregate more data than before, so I was able to add several trending charts. (The increases/decreases are Nov 15 2010 to Oct 15 2011.)
- percent of sites using Google Libraries API – up 6%
- percent of sites using Flash – down 2%
- percent of responses with caching headers – up 4%
- percent of requests made using HTTPS – up 1%
- percent of pages with one or more errors – down 2%
- percent of pages with one or more redirects – up 7%
Most of the news is good from a performance perspective, except for the increase in redirects. Here’s the caching headers chart as an example:
I dropped the following charts:
- popular JavaScript libraries – I created this chart using handcrafted regular expressions that attempted to find requests for popular frameworks such as jQuery and YUI. Those regexes are not always accurate and are hard to maintain. I recommend people use the JavaScript Usage Statistics from BuiltWith for this information.
- popular web servers – Again, BuiltWith’s Web Server Usage Statistics is a better reference for this information.
- sites with the most (JavaScript | CSS | Images | Flash) – These charts were interesting, but not that useful.
- popular scripts – This was a list of the top 5 most referenced scripts based on a specific URL. The problem is that the same script can have a URL that varies based on hostnames, querystring parameters, etc.
The new stats cache is a great step forward. I have a few more big coding sessions to finish but I hope to get enough done that we can start increasing the number of URLs in the next run or two. I’ll keep you posted.