A few years back while working for a B2B company that was compensated by an attributable sales, I got on a phone call early in the morning to discuss fixing a client side display issue. The previous night, after a release a integration engineer modified a config that broke our service from rendering on almost every page at a single customer. The bug was fairly subtle, allowing what he was working on to display correctly, but breaking every other div on the site. This was pushed in early Spring, at off hours, and caught at the beginning of the day on the east coast.
At 9am PST, we held a post-mortem with all of our engineers. We discussed the impact of the issue on our revenue, which fortunately was pretty small, and laid out the timeline. Immediately, we discussed whether this was a testing issue or a monitoring failure. The CEO came back, and said while it was understandable that we missed the failure, our goal as an Ops team should be to catch any rendering issue within 5 minutes of a failure. I was a little annoyed, but agreed to build a series of test to try and catch this.
Why Other Metrics Failed Us
We had a fairly sophisticated monitoring setup at this point. We tracked daily revenue per customer, and we would generally know within 30 minutes if we had a problem. Our customers were very sites, but typically for US only sites had almost no traffic between 0:00 PST/PDT and 6:00 PST/PDT; in that time period it wasn’t unusual to have 0-2 sales. Once we got into a busier sales period the issue was spotted withing 30 minutes, and we were alerted. During this time period it turns out our primary monitoring metric for this type of failure was useless.
QA Tools Can Solve Ops Problems Too
I was familiar with Selenium from some acceptance tests I helped our QA guys write. I began to put together a test suite that met my needs(can’t provide the code for this, sorry). It consisted of:
- rendering the main page
- navigating to a page which we displayed content
- clicking a link we provided
- verifying that we displayed our content on a new page
This worked fairly well, but I had to tweak some timings to make sure the page was “visable”. I rigged this up to run through jUnit, and left a selenium server running all the time. Every 5 minutes the test suite would execute, leaving behind a log of successes and failures. We eventually built a test suite for every sizable customer. Every 5 minutes we checked the output of the jUnit with a custom Nagios test, that would tell us which customers had failures, an send an individual alert for each one.
I was really annoyed when I first had this conversation with the CEO; I thought this was a boondoggle that ops should not be responsible for. Within the first month my annoyance turned to delight as I started getting paged when our customers had site issues. I typically called them before their NOC had noticed, and most of the time these were issue they introduced to their site. I’d do it again in a heartbeat, and recommend that anyone else give it a try.