Monitoring Your Customers with Selenium and Nagios

In a brief conversation with Noah Sussman at DevOps Days, when discussing the challenges of continious deployment for B2B services with SLAs, we got side tracked discussing using Selenium and Nagios in production.

A few years back while working for a B2B company that was compensated by an attributable sales, I got on a phone call early in the morning to discuss fixing a client side display issue. The previous night, after a release a integration engineer modified a config that broke our service from rendering on almost every page at a single customer. The bug was fairly subtle, allowing what he was working on to display correctly, but breaking every other div on the site. This was pushed in early Spring, at off hours, and caught at the beginning of the day on the east coast.

At 9am PST, we held a post-mortem with all of our engineers. We discussed the impact of the issue on our revenue, which fortunately was pretty small, and laid out the timeline. Immediately, we discussed whether this was a testing issue or a monitoring failure. The CEO came back, and said while it was understandable that we missed the failure, our goal as an Ops team should be to catch any rendering issue within 5 minutes of a failure. I was a little annoyed, but agreed to build a series of test to try and catch this.

Why Other Metrics Failed Us

We had a fairly sophisticated monitoring setup at this point. We tracked daily revenue per customer, and we would generally know within 30 minutes if we had a problem. Our customers were very sites, but typically for US only sites had almost no traffic between 0:00 PST/PDT and 6:00 PST/PDT; in that time period it wasn’t unusual to have 0-2 sales. Once we got into a busier sales period the issue was spotted withing 30 minutes, and we were alerted. During this time period it turns out our primary monitoring metric for this type of failure was useless.

QA Tools Can Solve Ops Problems Too

I was familiar with Selenium from some acceptance tests I helped our QA guys write. I began to put together a test suite that met my needs(can’t provide the code for this, sorry). It consisted of:

rendering the main page
navigating to a page which we displayed content
clicking a link we provided
verifying that we displayed our content on a new page

This worked fairly well, but I had to tweak some timings to make sure the page was “visable”. I rigged this up to run through jUnit, and left a selenium server running all the time. Every 5 minutes the test suite would execute, leaving behind a log of successes and failures. We eventually built a test suite for every sizable customer. Every 5 minutes we checked the output of the jUnit with a custom Nagios test, that would tell us which customers had failures, an send an individual alert for each one.

Great Success!

I was really annoyed when I first had this conversation with the CEO; I thought this was a boondoggle that ops should not be responsible for. Within the first month my annoyance turned to delight as I started getting paged when our customers had site issues. I typically called them before their NOC had noticed, and most of the time these were issue they introduced to their site. I’d do it again in a heartbeat, and recommend that anyone else give it a try.

Comments

4 responses to “Monitoring Your Customers with Selenium and Nagios”

Kenneth

July 3, 2012

I’m curious on your integration to Nagios. Do you have any details you can share on how you’re integrating the selenium infrastructure to Nagios?
1. papilion
  
  July 6, 2012
  
  Kenneth, the setup I used wasn’t at all complicated. Nagios only ran a log check against the output of jUnit. I’d could recreate this, but I looked for this line:
  errors=”0″ failures=”0″
  with a custom nagios test and returned an error if it didn’t match. The tests were kicked of from cron.
  
  Does that help?
william beckler

July 13, 2012

Why didn’t you have nagios check the selenium logs? Could you skip junit?
1. papilion
  
  July 16, 2012
  
  The way we initiated the tests were through junit. So, yes it could have been skipped, but at the time we decided it was an easy place to look for failures.