Rally Cars and Redunancy: Understand Your Failure Boundaries

I occasionally watch rally car racing, and if you haven’t seen it before its worth a watch. Guys drive small cars very fast down dirt roads, and while this is going on a passenger is reading driving notes to the driver. Occasionally these guys hit rocks, run off the road, and do all sorts of damage to their cars. The do a quick assessment of the damage determine if they can continue, and the carry on or pull out. Generally the calls the are making is to whether the particular failure has compromised the safety, or performance of the car so much that they cannot complete the race.

If you want to build a redundant system, you’ll need to take a look at each component and ask yourself what happens if it fails. Do you know how many components it will affect? Will it bring the site down, or degrade performance so much that the site will practically be down? Will your system recover automatically, or require intervention for an operator? Think through this is a very detailed manner, and make sure you understand the components.

Enter the Scientific Method

Develop a hypothesis, and develop a checklist of what you expect to see during a system failure. This should include how you expect performance to degrade, what you expect to do to recover, and what alerts should be sent to your people on call. Put time frames in which you expect things to happen in, and most importantly note any place you expect there is a single point of failure.

Break It

Once you’ve completed your list, start shutting off pieces to test your theories. Did you get all the alerts you expected in a timely manner? More importantly, did you get extra alerts you didn’t expect? These are important because they may mislead, or obscure a failure. Did anything else fail that you didn’t expect? And, lastly did you have to do anything you didn’t expect to have to recover?

Are You a Soothsayer

Summarize the differences, and document what happened. If you got too many alerts, see if you can develop a plan to deal with them.Then document what the true failure boundaries are. For example if you’re firewall fails, do you loose access to your management network? After doing all this decide if there is anything you can do to push failure boundaries back into a single component, and if you can minimize the effect on the rest of your system. Basic infrastructure components like switches and routers usually have surprising failure boundaries, when coupled with infrastructure decisions such as putting all of your database servers in a single rack.

This process takes time, and its hard to do at small and medium scales. Once you have infrastructure up and running, its difficult to run these types of tests, but you should probably still advocate for it. Its better to discover these type of problems under controlled conditions then in the middle of the night. You may be able to test parts using virtualization, but sometimes you’ll need to pull power plugs. Concoct a any type of failure you can imagine, and look for soft failures(mysql server is loosing 10% of its packets) since they are the most difficult to detect. Remember, You’ll never really have a redundant system until you start understanding how each component works with other components in your system.