Comparing RUM & Synthetic Page Load Times
Yesterday I read Etsy’s October 2012 Site Performance Report. Etsy is one of only a handful of companies that publish their performance stats with explanations and future plans. It’s really valuable (and brave!), and gives other developers an opportunity to learn from an industry leader. In this article Etsy mentions that the page load time stats are gathered from a private instance of WebPagetest. They explain their use of synthetically-generated measurements instead of RUM (Real User Monitoring) data:
You might be surprised that we are using synthetic tests for this front-end report instead of Real User Monitoring (RUM) data.  RUM is a big part of performance monitoring at Etsy, but when we are looking at trends in front-end performance over time, synthetic testing allows us to eliminate much of the network variability that is inherent in real user data. This helps us tie performance regressions to specific code changes, and get a more stable view of performance overall.
Etsy’s choice of synthetic data for tracking performance as part of their automated build process totally makes sense. I’ve talked to many companies that do the same thing. Teams dealing with builds and code regressions should definitely do this. BUT… it’s important to include RUM data when sharing performance measurements beyond the internal devops team.
Why should RUM data always be used when talking beyond the core team?
The issue with only showing synthetic data is that it typically makes a website appear much faster than it actually is. This has been true since I first started tracking real user metrics back in 2004. My rule-of-thumb is that your real users are experiencing page load times that are twice as long as their corresponding synthetic measurements.
RUM data, by definition, is from real users. It is the ground truth for what users are experiencing. Synthetic data, even when generated using real browsers over a real network, can never match the diversity of performance variables that exist in the real world: browsers, mobile devices, geo locations, network conditions, user accounts, page view flow, etc. The reason we use synthetic data is that it allows us to create a consistent testing environment by eliminating the variables. The variables we choose for synthetic testing matches a segment of users (hopefully) but it can’t capture the diversity of users that actually visit our websites every day. That’s what RUM is for.
The core team is likely aware of the biases and assumptions that come with synthetic data. They know that it was generated using only laptops and doesn’t include any mobile devices; that it used a simulated LAN connection and not a slower DSL connection; that IE 9 was used and IE 6&7 aren’t included. Heck, they probably specified these test conditions. The problem is that the people outside the team who see the (rosy) synthetic metrics aren’t aware of these caveats. Even if you note these caveats on your slides, they still won’t remember them! What they will remember is that you said the page loaded in 4 seconds, when in reality most users are getting a time closer to 8 seconds.
How different are RUM measurements as compared to synthetic?
As I said a minute ago, my rule-of-thumb is that RUM page load times are typically 2x what you see from synthetic measurements. After my comment on the Etsy blog post about adding RUM data and a tweet from @jkowall asking for data comparing RUM to synthetic less than 24 hours later, I decided to gather some real data from my website.
Similar to Etsy, I used WebPagetest to generate synthetic measurements. I chose a single URL: https://stevesouders.com/blog/2012/10/11/cache-is-king/. I measured it using a simulated DSL connection in Chrome 23, Firefox 16, and IE 9. I measured both First View (empty cache) and Repeat View (primed cache). I did three page loads and chose the median. My RUM data came from Google Analytics’ Site Speed feature over the last month. As shown in this chart of the page load time results, the RUM page load times are 2-3x slower than the synthetic measurements.
There’s some devil in the details. The synthetic data could have been more representative: I could have done more than three page loads, tried different network conditions, and even chosen different geo locations. The biggest challenge was mixing the First View and Repeat View page load times to compare to RUM. The RUM data contains both empty cache and primed cache page views, but the split is unknown. A study Tenni Theurer and I did in 2007 showed that ~80% of page views are done with a primed cache. To be more conservative I averaged the First View and Repeat View measurements and call that “Synth 50/50” in the chart. The following table contains the raw data:
Chrome 23 | Firefox 16 | IE 9 | |
---|---|---|---|
Synthetic First View (secs) | 4.64 | 4.18 | 4.56 |
Synthetic Repeat View (secs) | 2.08 | 2.42 | 1.86 |
Synthetic 50/50 (secs) | 3.36 | 3.30 | 3.21 |
RUM (secs) | 9.94 | 8.59 | 6.67 |
RUM data points | 94 | 603 | 89 |
In my experience these results showing RUM page load times being much slower than synthetic measurements are typical. I’d love to hear from other website owners about how their RUM and synthetic measurements compare. In the meantime, be cautious about only showing your synthetic page load times – the real user experience is likely quite a bit slower.
Joseph Scott | 14-Nov-12 at 7:47 pm | Permalink |
I’d be curious to know if instead of focusing on the raw numbers from synthetic benchmarks if percentage change over time lines up more with real user monitoring changes. In other words if my synthetic benchmarks show 15% reduction in page load times over the last 3 months, does that 15% hold true for real users as well?
Steve Souders | 14-Nov-12 at 8:06 pm | Permalink |
Joseph: In my experience the trends in RUM and synthetic move in the same direction, but the percentage change is not the same, eg, synthetic might drop 8% and RUM would drop 15%. In a few instances I’ve seen the trends *not* move in step, eg, synthetic showed a speedup that doesn’t show up in RUM and vice versa. It would be great if a few website owners would share what they see.
Ronnie Kwok | 14-Nov-12 at 11:42 pm | Permalink |
Would like to share some figures. I’ve got a RUM vs Synthetic data from two location (New York and London). Data is collected over a week’s time on a particular URL. The behaviour is very different.
New York : 2.74s (RUM) vs 2.08s (Synthetic)
London : 4.63s (RUM) vs 2.24s (Synthetic)
Things to consider is the amount of data points used for analysis, especially for location sitting in another continent (or behind the China Great Firewall!). Also, the performance distribution is another important figure to monitor as well. For the same case above,
Over 67% of total synthetic traffic from New York has a download time of 2 seconds but only 55% of traffic from London is having the same performance.
Besides, figures from different RUM provider differs too. There’s a different of 1-2x between GA and another provider I am using.
Chris Adams | 15-Nov-12 at 5:06 am | Permalink |
One potential pitfall: did you use Google Analytics’ averages or pull the data directly from the performance histogram? The use of averages is a huge error by the GA team because RUM data is notoriously prone to outliers – awhile back I detailed an example of a single user in a small town whose 3 browser-reported hour+ page loads raised the average of hundreds of thousands of loads by over 2 seconds: http://chris.improbable.org/2012/05/18/google-analytics-deceptive-site-speed-report/
Patrick Meenan | 15-Nov-12 at 6:46 am | Permalink |
Steve, for your GA data did you do any filtering (looking at the same page, from the US)? It would also be interesting to see the histograms to see if the synthetic was out of line with the data entirely.
Granted, part of the reason for having RUM data in the first place is to be able to see the whole picture. It usually requires a fair bit of drill-down to extract useful data though.
Looking at the RUM data for WebPagetest, the global average load time for Chrome 23 is 3.95 Seconds. The average in the US for Chrome 23 is 2.38 seconds.
Looking at the histograms for Chrome 23 in the US, 56% loaded under 1.5 seconds.
I guess the short version is that there is no 1 number that represents your site’s performance. Synthetic is representative of exactly the one configuration it tests but if you want to really improve things for users you should look at all of the data, chase down the long tails, etc. John Rauser’s Velocity presentations at work :-)
Drit Suljoti | 15-Nov-12 at 12:58 pm | Permalink |
I agree with previous comments on this post, there are various reasons for the difference between RUM and Synthetic – and it all boils down to:
– what is downloaded. It is hard to replicate in synthetic the caching behavior of what happens in real life. And in real life not all users behave the same.
– the network path between client and server(s) (distance, packet loss, bandwidth). This is hard to exactly emulate in a global scale and have to be smart about dealing with it. Can you make your site faster for someone connected on a WiFi router that has 20% packet loss? Should you even try to solve that problem?
At the end of the day the two sources of numbers are different and have very different purposes. Neither of them is wrong or right – they are just different tools. RUM helps you understand the experience of the end user on the site and is impacted by last mile noise and user behavior on the web(ads) and on the site(cache). Synthetic helps with understand the performance of the website from outside in a more of a controlled lab scenario, excluding or limiting last mile noise.
Each of them has the pros/cons – people just need to be smart about which tool to use when.
Paul Roy | 15-Nov-12 at 8:59 pm | Permalink |
Steve, our experience at MSN is that RUM numbers are 2-3x higher than Synthetic numbers, depending on the network profile used for Synthetic. But, more important to me than the absolute numbers, is the evidence that I use to drive performance improvements. For this we find Synthetic measurements to be excellent – esp. because you get the detailed waterfall that allows for diagnosis. (Of course, W3C Resource Timing in theory will allow this too for RUM, once it’s available.) The other critical thing with Synthetic is having a very large number of samples, so that you get a good degree of statistical representation. All this said, RUM still is the ticket for exposing various perf problems that could be a function of geography, browser versions that you’re not testing with Synthetic, etc., in addition to being a more accurate representation of the true end user experience as you point out.
Nico | 16-Nov-12 at 1:48 am | Permalink |
Excellent article. We rely extensively on RUM for frontend performance monitoring on our site. However, it has been really difficult to produce stable statistical results on pages with fewer samples. We have increased sampling but still it is difficult to act on the results unless they hold for a couple of days.
Steve Souders | 16-Nov-12 at 9:19 am | Permalink |
Chris: I used the average. I agree: median & 90th percentile would be better. One challenge is GA doesn’t have a real histogram – they have preset buckets and you have to do the math yourself. I see that I can get something close to a histogram under the “Performance” tab, but how do I get a histogram for IE 9 for a single page? And how do I download the data?
Pat: How do I filter? I agree – median would be better, but it’s hard to weasel out of GA.
I’m not a GA expert by any means, but I clearly don’t know how to exercise some key features. I’d love to see a “how to use GA for web performance” article. (I’ll admit – I haven’t checked the docs.)
Chris Adams | 17-Nov-12 at 2:22 pm | Permalink |
Steve: agreed on the data challenge. The performance tab is the most reliable but one of my holiday projects is seeing whether the API allows you to receive the data in a more useful form.
Seth Walker | 19-Nov-12 at 11:10 am | Permalink |
At Etsy we have RUM and WebPagetest graphs broken out by page and annotated with deploy lines on our wall monitors and deploy dashboard. As Patrick says, the value of RUM is in the details, but in seeking to avoid information overload we focus on a few metrics for our dashboards and performance reports. We try to pick metrics that will be representative enough of performance as a whole that we can detect regressions (or validate improvements!) soon after a deploy (currently ~30 deploys/day). Publishing our RUM numbers to our community in a way that provides enough detail and context to be meaningful without overwhelming is our challenge for the next site performance report!
In the meantime we’re pulling together some data to compare our RUM and WebPagetest results, look for that post soon!
Ian Withrow | 03-Dec-12 at 3:58 pm | Permalink |
Pardon if this is a little bit thread necromancy, but after reading this excellent dialogue an important question came to mind. What are the major causes of difference between RUM and Synthetic?
Yes I know everyone has an anecdotal/intuitive sense of the types of things that cause the delta. However, a rigorous study of this question would make the two types of measurement approaches easier to use in conjunction.
Is most of it browser related?
What portion is bad LAN environment, e.g. high packet loss at Starbucks?
If this problem space was more rigorously understood I suspect creative ideas will present themselves.
Charlie Clark | 04-Jan-13 at 7:03 am | Permalink |
I think there is a problem in equating RUM with the info that Google provides, partly because you need to define RUM. You might want to look at something like Cedexis, and I think Neustar has something similar, which allows the collation purely of performance metrics and provides analysis based on network type and location and browser.
My experience of the websites I run is that WebPageTest is significantly slower with higher variance than real measurements or comparisons with other services. This is to be expected given the nature of the service. The great advantage of automated testing is a real test of the application and elements (such as server or CDN) under your control. With a large enough sample size, you will get representative data on the speed of your site and certainly of the effects of any changes you make.
Jonathan Drake | 11-Feb-13 at 10:56 pm | Permalink |
first comment and just wanted to let people know that they people over at etsy have done the test and posted the startling results http://codeascraft.etsy.com/2012/11/29/measuring-front-end-performance-with-real-users/