Monitoring – The Challenge of Small Ops – Part 2

So your building or have built a web service, you’ve got a lot of challenges a head. You’ve got to scale software and keep customers happy. Not surprisingly that likely involves keeping your web service up, and that typically starts by setting up some form of monitoring when something goes wrong. In a large organization this may be the job of an entire team, but in a small organization its likely everyone’s job. So, given that everyone will tell you how much monitoring sucks, how do you make it better and hopefully manageable.

A Subject Mater Expert Election

Elect someone as your monitoring expert. I’d suggest someone with an Ops background, since they probably used several tools, and are familiar with configuration and options for this type of software. Its important that this person is in the loop with regard product decisions, and need to be notified when applications are goring to be deployed.

What Gets Monitored?

The simplistic answer you’ll hear over and over again is “everything”, but you’re a small organization with limited headcount. Your monitoring expert should have an idea since he’s in the product loop. They should be making suggestions as to what metrics the application should expose, and getting the basic information from the development team for alerting (process names, ports, etc..). They should prioritize the list of tests and metrics, starting with what you need to know that the application is up, and finishing with nice to haves.

Who Builds It?

Probably everyone. If that’s all you want you monitoring expert to do, then have them do it, but its far more efficient to spread the load. If people are having difficulty integrating with particular tools like Nagios or Ganglia, that’s where you expert steps in. They should know the in’s and out’s of the tools your using, and be able to train someone to extend a test, or get a metric collected within a few minutes.

These metrics and tests should be treated as part of the product, and should follow the same process you use to get code in production. You should consider making them goals for the project, since if you wait till after shipping there will be some lag between pushing code and getting it monitored.

Wrapping It Up

Its really not that hard; you just need someone to take a breath and plan. The part of monitoring that sucks is the human part, and since your app will not tell people when its up and down, you need a person to think about that for you.

Take a look at Part 1 if you liked this article.