HTTP Archive: 2011 recap

February 1, 2012 5:23 pm | 11 Comments

I started the HTTP Archive back in October 2010. It’s hard to believe it’s been that long. The project is going well:

The number of websites archived has grown from ~15K to ~55K. (Our goal for this year is 1M!)
In May we partnered with Blaze.io to launch the HTTP Archive Mobile.
In June we merged with the Internet Archive.
Joining the Internet Archive allowed us to accept financial support from our incredible sponsors: Google, Mozilla, New Relic, Oâ€™Reilly Media, Etsy, Strangeloop, and dynaTrace Software. Last month Torbit became our newest sponsor.
As of last week we’ve completely moved to our new data center, ISC.

I’m pleased with how the WPO community has contributed to make the HTTP Archive possible. The project wouldn’t have been possible without Pat Meenan and his ever impressive and growing WebPagetest framework. A number of people have contributed to the open source code including Jonathan Klein, Yusuke Tsutsumi, Carson McDonald, James Byers, Ido Green, Mike Pfirrmann, Guy Leech, and Stephen Hay.

This is our first complete calendar year archiving website statistics. I want to start a tradition of doing an annual recap of insights from the HTTP Archive.

2011 vs 2012

The most noticeable trend during 2011 was the size of websites and resources. Table 1 shows the transfer size of content types for the average website. For example, “379kB” is the total size of images downloaded for an average website. (Since the sample of websites changed during the year, these stats are based on the intersection trends for 11,910 websites that were in every batch run.)

	Jan 2011	Jan 2012	change
Table 1. Transfer Size by Content Type
HTML	31kB	34kB	+10%
JavaScript	110kB	158kB	+44%
CSS	26kB	31kB	+19%
Images	379kB	459kB	+21%
Flash	71kB	64kB	-10%
total	638kB	773kB	+21%

One takeaway from this data is that images make up a majority of the bytes downloaded for websites (59%). Also, images are the second fastest growing content type for desktop and the #1 fastest growing content type for mobile. These two observations highlight the need for more performance optimizations for images. Many websites would benefit from losslessly compressing their images with existing tools. WebP is another candidate for reducing image size.

A second takeaway is the tremendous growth in JavaScript size – up 44% over the course of the year. The amount of JavaScript grew more than twice as much as the next closest type of content (images). Parsing and executing JavaScript blocks the UI thread and makes websites slower. More JavaScript makes the problem worse. Downloading scripts also causes havoc with website performance, so the fact that the number of scripts on the average page grew from 11 to 13 is also a concern.

On a positive note, the amount of Flash being downloaded dropped 10%. Sadly, the number of sites using Flash only dropped from 44% to 43%, but at least those swfs are downloading faster.

Adoption of Best Practices

I personally love the HTTP Archive for tracking the adoption of web performance best practices. Some trends year-over-year include:

The percent of resources that had caching headers grew from 42% to 46%. It’s great that the use of caching is increasing, but the fact that 54% of requests still don’t have any caching headers is a missed opportunity.
Sites using the Google Libraries API jumped from 10% to 16%. Using a CDN with distributed locations and the ability to leverage caching across websites make this a positive for web performance.
On the downside, websites with at least one redirect grew from 59% to 66%.
Websites using custom fonts quadrupled from 2% to 8%. I’ve written about the performance dangers of custom fonts. Just today I did a performance analysis of Maui Rippers and discovered the reason the site didn’t render for 6+ seconds was a 280K font file.

It’s compelling to see how best practices are adopted by the top websites as compared to more mainstream websites. Table 2 shows various stats for the top 100 and top 1000 websites, as well as all 53,614 websites in the last batch run.

	Top 100	Top 1000	Â Â All
Table 2. Best Practices for Top 100, Top 1000, All
total size	509kB	805kB	962kB
total requests	57	90	86
caching headers	70%	58%	42%
use Flash	34%	49%	48%
custom fonts	6%	9%	8%
redirects	57%	69%	65%

The overall trend shows that performance best practices drop dramatically outside of the Top 100 websites. The most significant are:

Total size goes from 509 kB to 805 kB to 962 kB.
Total number of HTTP requests is similar growing from 57 to 90 and a small decrease to 86 requests.
The use of future caching headers is high for the top 100 at 70%, but then drops to 58% and even further to 42%.

The Web has a long tail. It’s not enough for the top sites to have high performance. WPO best practices need to find their way to the next tier of websites and on to the brick-and-mortar, mom-and-pop, and niche sites that we all visit. More awareness, more tools, and more automation are the answer. I can’t wait to read the January 2013 update to this blog post and see how we did. Here’s to a faster and stronger Web in 2012!

11 Responses to HTTP Archive: 2011 recap

Josh Fraser | 02-Feb-12 at 12:47 am | Permalink |

I love the work you are doing with this project. Thanks for letting us be a part of it.
Joakim Westin | 02-Feb-12 at 1:21 am | Permalink |

Thank you for this excellent aggregation of data.

It should help developers set priorities for improving their websites!

Cheers,
Joakim
Andy Davies | 02-Feb-12 at 1:38 am | Permalink |

You might want to be a bit careful over the Google CDNs bit…

A while back @spjwebster (works for LoveFilm in the UK, I think) did some digging into the usage of libraries hosted on Google’s CDN and came to the conclusion that it’s not really a rosy picture at all – http://statichtml.com/2011/google-ajax-libraries-caching.html

Main problem is that many sites rely on specific versions of a library (say jQuery) and due to this it’s unlikely the version they need is already in the cache.

I’ve also seen some examples where the cost of going to the Google CDN i.e. DNS lookup, TCP connection setup, is actually slower than serving jQuery form the origin site.

@andydavies
Steve Souders | 02-Feb-12 at 10:30 am | Permalink |

@Andy: Great to hear from you! Steve Webster’s article is great but doesn’t provide evidence that using Google Libraries API is faster or slower – he just comments on the lower potential cache hit rate for a specific file version. (Which makes sense.)

My point still stands: using the Google Libraries API provides benefits of a geo distributed CDN and caching across websites.

Regarding geo distributed CDN: If your website already has a good CDN this might not make a difference, but if you’re serving from a single geo location you’re probably better off using a CDN.

Regarding caching across sites: It’s definitely true that the probability is less for a single version of a file, but the probability is still higher than if you host it only on your own server. For example, looking at yesterday’s batch run, “http://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js” is requested on 114 of the top 10K websites. That significantly increases the probability of it already being in the cache compared to requesting it from your own server, especially if you’re not in the top 10K.

I agree with you that this might not make a huge difference, but I still think it’s a good idea especially for sites that don’t have a CDN and are outside the top 10K.
Sergey Chernyshev | 02-Feb-12 at 2:28 pm | Permalink |

Steve, did you do any analysis regarding the source of JS size increase?

Can it be because jQuery size growth or is it because of in-house code-base grew bigger?

Couple quick links on the topic:
http://mathiasbynens.be/demo/jquery-size
http://docs.jquery.com/Downloading_jQuery#Current_Release
Sergey Chernyshev | 02-Feb-12 at 2:29 pm | Permalink |

And I forgot to thank you for the review – I was lazily waiting for the numbers ;)
Steve Souders | 02-Feb-12 at 2:49 pm | Permalink |

@Sergey: Thanks for the comment. I love analyzing the HTTP Archive data, but I also love the idea of putting the data out there for *other* people to analyze (such as Steve Webster’s article highlighted in Andy’s comment). I encourage you to download the data and slice & dice it to answer your question.
Andy Davies | 04-Feb-12 at 1:46 am | Permalink |

Agree that a geo distributed CDN will provide benefits to visitors as the get ‘futher’ from the origin.

I want you to be right about the shared caching benefits – I get the theory – but I’m not sure that just looking at how sites are built gives us enough information.

The piece we’re missing is ‘how long do the libraries live in the browser cache’ – the Yahoo work showed how often even a high profile sites assets drop out of the cache.

Now in theory files that are shared between sites should stay in the cache longer but doesn’t that heavily depend on how often and in what order people visit sites?

I think the ideas you’ve had on how caches should work could be really beneficial for people using the library CDNs.

When it arrives the ResourceTiming API may also offer site owners an insight into what benefits they’re users are getting (or not) from hosting on a library CDN. (partly depends on whether the API is opt in or out too)

I guess the advice for site owners might be “use the location/version of jQuery that’s most commonly used by the most popular sites” but of course that also comes with tradeoffs!

Web performance is such a great world to be in…

(BTW planning on borrowing your graph showing how transfer size has changed over the last year for a presentation on performance in late Feb/early Mar – will forward the link when it’s done and will of course credit you)
Fabio Buda, Netdesign | 08-Feb-12 at 6:12 am | Permalink |

Steve, your work with the httparchive is a milestone to study the approach of web developers on performances and optimization.

I appreciate all your advices to make the web faster.

It’s however incredible that the whole transfer size of a web page has grown by 1.22 times in a year (http://httparchive.org/trends.php?s=intersection&minlabel=Jan+31+2011&maxlabel=Jan+15+2012), is there any way to see how many websites use minified css and js files?
Steve Souders | 08-Feb-12 at 9:52 am | Permalink |

@Fabio: If you mean “gzipped” or “compressed” the answer is yes – those HTTP headers are recorded. The exact definition of “minified” means white space has been removed if this is what you mean the answer is no – right now the HTTP Archive does *not* analyze nor store response bodies. There’s an open ticket to extend WebPagetest to allow analyses on response bodies that would address this want.
Fabio Buda, Netdesign | 08-Feb-12 at 10:38 am | Permalink |

I mean “minified” as “css and js without any white space or newlines”.
I’ve seen the open ticket and would be amazing being able to see analyses on response bodies in the httparchive.

SteveSouders.com

HTTP Archive: 2011 recap

2011 vs 2012

Adoption of Best Practices

11 Responses to HTTP Archive: 2011 recap