Frontend SPOF in Beijing
This past December I contributed an article called Frontend SPOF in Beijing to PerfPlanet’s Performance Calendar. I hope that everyone who reads my blog also read the Performance Calendar – it’s an amazing collection of web performance articles and gurus. But in case you don’t I’m cross-posting it here. I saw a great presentation from Pat Meenan about frontend SPOF and want to raise awareness around this issue. This post contains some good insights.
Make sure to read PerfPlanet – it’s a great aggregator of WPO blog posts.
Now – flash back to December 2011…
I’m at Velocity China in Beijing as I write this article for the Performance Calendar. Since this is my second time to Beijing I was better prepared for the challenges of being behind the Great Firewall. I knew I couldn’t access popular US websites like Google, Facebook, and Twitter, but as I did my typical surfing I was surprised at how many other websites seemed to be blocked.
Business Insider
It didn’t take me long to realize the problem was frontend SPOF – when a frontend resource (script, stylesheet, or font file) causes a page to be unusable. Some pages were completely blank, such as Business Insider:
Firebug’s Net Panel shows that anywhere.js
is taking a long time to download because it’s coming from platform.twitter.com
– which is blocked by the firewall. Knowing that scripts block rendering of all subsequent DOM elements, we form the hypothesis that anywhere.js
is being loaded in blocking mode in the HEAD. Looking at the HTML source we see that’s exactly what is happening:
<head> ... <!-- Twitter Anywhere --> <script src="https://platform.twitter.com/anywhere.js?id=ZV0...&v=1" type="text/javascript"></script> <!-- / Twitter Anywhere --> ... </head> ... <body>
If anywhere.js
had been loaded asynchronously this wouldn’t happen. Instead, since anywhere.js
is loaded the old way with <SCRIPT SRC=...
, it blocks all the DOM elements that follow which in this case is the entire BODY of the page. If we wait long enough the request for anywhere.js
times out and the page begins to render. How long does it take for the request to timeout? Looking at the “after†screenshot of Business Insider we see it takes 1 minute and 15 seconds for the request to timeout. That’s 1 minute and 15 seconds that the user is left staring at a blank white screen waiting for the Twitter script!
CNET
CNET has a slightly different experience; the navigation header is displayed but the rest of the page is blocked from rendering:
Looking in Firebug we see that wrapper.js
from cdn.eyewonder.com
is “pending†– this must be another domain that’s blocked by the firewall. Based on where the rendering stops our guess is that the wrapper.js
SCRIPT tag is immediately after the navigation header and is loaded in blocking mode thus preventing the rest of the page from rendering. The HTML confirms that this is indeed what’s happening:
<header> ... </header> <script src="http://cdn.eyewonder.com/100125/771933/1592365/wrapper.js"></script> <div id="rb_wrap"> <div id="rb_content"> <div id="contentMain">
O’Reilly Radar
Everyday I visit O’Reilly Radar to read Nat Torkington’s Four Short Links. Normally Nat’s is one of many stories on the Radar front page, but going there from Beijing shows a page with only one story:
At the bottom of this first story there’s supposed to be a Tweet button. This button is added by the widgets.js
script fetched from platform.twitter.com
which is blocked by the Great Firewall. This wouldn’t be an issue if widgets.js
was fetched asynchronously, but sadly a peek at the HTML shows that’s not the case:
<a href="...">Comment</a> | <span class="social-counters"> <span class="retweet"> <a href="http://twitter.com/share" class="twitter-share-button" data-count="horizontal" data-url="http://radar.oreilly.com/2011/12/four-short-links-6-december-20-1.html" data-text="Four short links: 6 December 2011" data-via="radar" data-related="oreillymedia:oreilly.com">Tweet</a> <script src="http://platform.twitter.com/widgets.js" type="text/javascript"></script> </span>
The cause of frontend SPOF
One possible takeaway from these examples might be that frontend SPOF is specific to Twitter and eyewonder and a few other 3rd party widgets. Sadly, frontend SPOF can be caused by any 3rd party widget, and even from the main website’s own scripts, stylesheets, or font files.
Another possible takeaway from these examples might be to avoid 3rd party widgets that are blocked by the Great Firewall. But the Great Firewall isn’t the only cause of frontend SPOF – it just makes it easier to reproduce. Any script, stylesheet, or font file that takes a long time to return has the potential to cause frontend SPOF. This typically happens when there’s an outage or some other type of failure, such as an overloaded server where the HTTP request languishes in the server’s queue for so long the browser times out.
The true cause of frontend SPOF is loading a script, stylesheet, or font file in a blocking manner. The table in my frontend SPOF blog post shows when this happens. It’s really the website owner who controls whether or not their site is vulnerable to frontend SPOF. So what’s a website owner to do?
Avoiding frontend SPOF
The best way to avoid frontend SPOF is to load scripts asynchronously. Many popular 3rd party widgets do this by default, such as Google Analytics, Facebook, and Meebo. Twitter also has an async snippet for the Tweet button that O’Reilly Radar should use. If the widgets you use don’t offer an async version you can try Stoyan’s Social button BFFs async pattern.
Another solution is to wrap your widgets in an iframe. This isn’t always possible, but in two of the examples above the widget is eventually served in an iframe. Putting them in an iframe from the start would have avoided the frontend SPOF problems.
For the sake of brevity I’ve focused on solutions for scripts. Solutions for font files can be found in my @font-face and performance blog post. I’m not aware of much research on loading stylesheets asynchronously. Causing too many reflows and FOUC are concerns that need to be addressed.
Call to action
Business Insider, CNET, and O’Reilly Radar all have visitors from China, and yet the way their pages are constructed delivers a bad user experience where most if not all of the page is blocked for more than a minute. This isn’t a P2 frontend JavaScript issue. This is an outage. If the backend servers for these websites took 1 minute to send back a response, you can bet the DevOps teams at Business Insider, CNET, and O’Reilly wouldn’t sleep until the problem was fixed. So why is there so little concern about frontend SPOF?
Frontend SPOF doesn’t get much attention – it definitely doesn’t get the attention it deserves given how easily it can bring down a website. One reason is it’s hard to diagnose. There are a lot of monitors that will start going off if a server response time exceeds 60 seconds. And since all that activity is on the backend it’s easier to isolate the cause. Is it that pagers don’t go off when clientside page load times exceed 60 seconds? That’s hard to believe, but perhaps that’s the case.
Perhaps it’s the way page load times are tracked. If you’re looking at worldwide medians, or even averages, and China isn’t a major audience your page load time stats might not exceed alert levels when frontend SPOF happens. Or maybe page load times are mostly tracked using synthetic testing, and those user agents aren’t subjected to real world issues like the Great Firewall.
One thing website owners can do is ignore frontend SPOF until it’s triggered by some future outage. A quick calculation shows this is a scary choice. If a 3rd party widget has a 99.99% uptime and a website has five widgets that aren’t async, the probability of frontend SPOF is 0.05%. If we drop uptime to 99.9% the probability of frontend SPOF climbs to 0.5%. Five widgets might be high, but remember that “third party widget†includes ads and metrics. Also, the website’s own resources can cause frontend SPOF which brings the number even higher. The average website today contains 14 scripts any of which could cause frontend SPOF if they’re not loaded async.
Frontend SPOF is a real problem that needs more attention. Website owners should use async snippets and patterns, monitor real user page load times, and look beyond averages to 95th percentiles and standard deviations. Doing these things will mitigate the risk of subjecting users to the dreaded blank white page. A chain is only as strong as its weakest link. What’s your website’s weakest link? There’s a lot of focus on backend resiliency. I’ll wager your weakest link is on the frontend.
[Originally posted as part of PerfPlanet’s Performance Calendar 2011.]
Sean Hogan | 28-Mar-12 at 4:35 pm | Permalink |
Hi Steve,
Great heads-up. I would also use this as a supporting argument for the concept of “Real Content First (and Fast)” which seems to be the purpose of Scott Jehl’s http://filamentgroup.com/lab/ajax_includes_modular_content/, Michal Migurski’s http://mike.teczno.com/notes/bandwidth.html, Mark Nottingham’s project http://github.com/mnot/hinclude and mine http://github.com/shogun70/HTMLDecor.
For Mark and my projects, putting the javascripts on the same server as the loading page (or async loading them) further decreases the chance of the failure you’ve documented here.
Steve Souders | 28-Mar-12 at 10:39 pm | Permalink |
@Sean: Yes, the potential of frontend SPOF is reduced when the resource (typically a script) is loaded from the same server as the main page. If the HTML document returned successfully then it’s likely the script will, too. Inlining the content (if it’s small) is another way to avoid frontend SPOF.
Mehdi Daoudi | 29-Mar-12 at 6:25 am | Permalink |
Steve,
Great article like always! Thank you for sharing your experience in China with the rest of the community.
Obviously, companies need to pay more attention to the third party tags and their front end code in general – and not just for China. The issues experienced by a Chinese user, could happen to anyone if a third party service is down or there are connectivity issues. I have experienced at home, and I am sure many others here have seen it.
I believe one major reason why websites are not addressing SPOF is simply the development cost associated with fixing it – especially when they are dealing with legacy code or 3rd party tags. The other problem is Marketing wants this widget and that tag on the page and in a lot of companies IT does not have much say.
I firmly believe that if we want to solve this problem, we need to ensure that the business execs understand the problem and its impact and make it a priority for IT to address. Otherwise, the issue will be always in the backburner of IT. There will always be some higher priority revenue generating project jumping in front of it (even one that could slow pages down).
One quick correction on the synthetic monitoring in China, it does get impacted by the firewall just like any other machine connected to the internet in that region. see this screenshot from our node in Beijing http://www.screencast.com/t/7okUZSxzn
Some other suggestions based on our experience:
Publishers should detect if users are from China and deploy dynamically a page that does not contain social plugins since in China most of those are blocked (Twitter, Facebook…) Also regarding CDN, US publishers should make sure to use a CDN that has PHYSICAL presence in CHINA – NOT HK. Most CDN’s send traffic to Japan, HK, SG… the rules are very strict in China. Another finding is that anything hosted on Amazon platforms (S3, EC2, Cloudfront…) is very slow with a lot of failures – Connect or loading.
Mehdi
Steve Souders | 29-Mar-12 at 7:30 am | Permalink |
@Mehdi: Great comments and suggestions. Thanks for the waterfall screenshot. Why is it only delayed ~10 seconds, instead of the ~100 seconds I experienced?
Mehdi Daoudi | 29-Mar-12 at 8:34 am | Permalink |
@Steve: The case It provided it managed to load in 25 seconds, how often at time it fails will need to run it very often to get that. But just doing instant tests, i can get it to fail very often: http://www.screencast.com/t/VYGoBDCnL
Now remember depending on your OS the timeout is different. on my mac i can keep waiting for 100 seconds, on a PC the timeout is less generous. At Catchpoint we timeout a test if it takes more than 30 seconds to complete. No page should take that long to load… :-)
From data I have seen in Catchpoint, usual availability of a site monitored from China shows a 40-50% availability mostly because of 3rd parties timing out, bizarre network conditions. if you do a packet capture from China you can see the weird stuff that happens on the network .
In China as you know from your Google’s experience, we all end up going trough the same tunnels at some point and it’s not like you have a lot of choices from an ISP perspective either. Another thing to note for readers regarding China, from an internet perspective you have to treat that country as many countries with different internet topographies per region or sometimes even cities.
Mehdi
Ben Daniel | 02-Apr-12 at 7:51 am | Permalink |
Particularly enjoyed this post (and Pat’s earlier post on testing for SPOFs using WPT – http://blog.patrickmeenan.com/2011/10/testing-for-frontend-spof.html).
Guys like CNET should always be concerned about this problem, particularly given the number of 3rd party calls that are made!
http://img845.imageshack.us/img845/3938/domainbyrequestswwwcnet.png
Ben