Deploying at Wanelo tends to be high-frequency and low-stress, since we have most aspects of our systems performance graphed in real time. We can roll out new code to a percentage of app servers, monitor app server and db performance, check error rates, and then finish up the deploy.
On the other hand, many sites are moving more and more functionality client-side these days, so it’s becoming increasingly important to know when there are problems in the browser.
I have yet to see a great solution to this problem, so I try to ask about other companies’ client-side error tracking whenever I can. I usually hear one of two answers: A.) We don’t track them (but we’d like to), or B.) We built our own in-house tracking system; sometimes it helps us catch issues, but usually it’s a firehose of random errors that we can’t trace back to a particular issue.
There’s a middle path between these two answers that I think will end up being the “just right” solution for us: client-side error rate tracking. Essentially, ignore all error messages and calculate the total count of client side errors per minute relative to “page views." The goal of this sort of tracking isn’t to pinpoint each new client-side issue, but just to answer the question: did we break something during this deploy that’s going to prevent our users from having a good experience on the site?
We wanted to deploy a quick solution without a lot of work, to see if it would be a fit; we didn’t want to embark on a client-side metrics yak shave. We also came up with a few other nice-to-haves:
- tracking should not require any new services or infrastructure, or add much load to our current infrastructure
- data needs to be graphable alongside our backend metrics
- data needs to be available fast, ideally in around 5 minutes or less
Most of our backend metrics are graphed in Circonus, and we display the most relevant ones on a big dashboard in the office. The graphs and meters on the dashboard are calibrated well enough that they go red infrequently, and when they do, we notice (and usually take action). My aspiration for our client-side error tracking is for it to be on this dashboard and work the same way — updating in real-time and going red if (and only if) there’s an actionable issue. Luckily, Circonus can consume data from a lot of different sources, including metrics from hosted third-party tools like New Relic — in fact, it can do custom checks to any JSON endpoint that is public, or that allows authentication with a URL param or request header.
The custom-check capability is a win, because it lets us use Fastly (our CDN) to serve our client-side error beacon and then set up a Circonus check to the Fastly stats API to get a count of errors. Fastly isn't exactly intended for this use case, so it requires a bit of configuring to set up, but there are a few advantages: first, a spike in error beacon traffic won’t increase load on any of our infrastructure; second, we make extensive use of Fastly for serving assets and pages, so we already have a reason to track Fastly day to day (and, of course, error pixels load fast : )
1. Add an on-error handler that loads the error beacon
2. Deploy the beacon to Fastly
Second, we deployed our error pixel on Fastly. For Fastly, we set the pixel up on its own domain so it would be easy to track all of our client-side error traffic as a separate service. We also had to update the default VCL slightly — we wanted the backend TTL to be long, so Fastly never needs to fetch a new version of the asset from our servers, but set the response cache control headers on the asset to be short (max-age=1), so new errors will always trigger a request to Fastly (i.e., the asset shouldn’t be loaded from the browser cache).
3. Set up a Circonus JSON check to the Fastly API
If we want to see this data immediately, without the 30-minute delay, we can -- we just need to log into Fastly and look at the real-time graphs on the Fastly dashboard.
And that's it!
We’re successfully collecting this data now, and we’re graphing it in Circonus. This is the "errors last week" view:
So far, this experiment looks like a qualified success. We were able to roll out tracking with less than a day of work, and, if something major gets broken, we’ll see a spike in the graph in 30 minutes. However, since this metric is a bit noisier than others we collect, it takes a fairly widespread issue before it’s obvious on the graph. So, there are some ways we want to improve this tracking in the future...
1. Segmented rates by browser
The top priority for future work here is adding error rates by browser/operating system (and possibly browser version). Right now our graphs will tell us if there’s an issue on a popular page in all (or most) browsers, but if we had an issue in just one browser, like Internet Explorer 9, it might get lost in the noise. It would be great to have a composite graph that’s composed of stacked rates for each major browser, so we could see if one was growing out of the proportion to the others.
2. Closer-to-real-time data
The data we pull from the Fastly API is a bit older than the data in most of our Circonus graphs, so there’s some mental overhead to remembering to offset the one graph from the others by 30 minutes to think about overall systems health. It would be great to be able to pull fresher data from the Fastly API without the rate limits, or be able to offset the graphs in Circonus by a set amount of time.
Other collection options: Google Analytics
On the other hand, a site as large as Wanelo no longer qualifies for full data collection on the free Google Analytics service, so we use 5% sampling. Clearly, sampling client-side errors is not the best idea :)
Where this won’t work
For us, I think graphing a client-side error rate is the happy medium between having no visibility into client-side exceptions and having a high-maintenance error-tracking service. However, there are use cases where this technique wouldn’t be a great fit: for example, if your company does large, infrequent releases, error rates may not be enough information. Our releases tend to be relatively small, so usually, if we see error rates climbing, we know which code changes are the likely culprits, and where to start looking for the issue.