I realize that the world is changing, and that operations teams need to adjust to the changing world. We have skilled development teams, cloud services, APIs for almost everything you need to build a moderately sized web service. Its no wonder that some smaller(and some large) organizations are beginning to question the need for ops teams. So, I’m going to write a series of articles discussing the challenges that exist in operations of small and medium sized teams, and how an operations expert can help solve these issues.
Ops as the Firefighter
In my discussion with a fellow conference-goer about this topic, when he said the general feeling was that you push responsibility to your vendors and eliminate ops, I suggested that perhaps we should think of ops as a firefighter.
Small towns still have firefighting teams, and they may be volunteers, but I’ll bet they were trained by a professional. You should think of an Operations Engineer as your companies trainer. You should lean on them for the knowledge that can only be gained working in an operational environment.
Failure is the only constant for web services, and your should expect them to happen. You will need to respond to failures in a calm and organized manner, but this is likely too much for a single individual. You’ll need a better approach.
A mid-level or senior operations engineer should be able to develop an on-call schedule for you. They should be able to identify how many engineers you need on-call in order to meet any SLA response requirement. In addition they can train your engineers how to respond, and make sure any procedure is followed that you might owe to customers. They can make everyone more effective in an emergency.
Amazon, Heroku, and their friends all provide excellent reliable platforms, but from time to time they fail. Vendors typically like to restrict communications to as few people as possible, since it makes it easier for them to communicate. If you’re not careful you may find yourself spreading responsibility for vendors across your organization, as individuals add new vendors.
I believe it makes more sense to consolidate the knowledge in an operations engineer. An operations engineer is used to seeing vendors fail, and will understand the workflow required to report and escalate a problem. They understand how to read your vendors SLA, and hold them accountable to failures. Someone else can fill this role, but this person needs to be available at all hours, since failure occur randomly, and they will need to understand how to talk to the NOC on the other end.
Your platform provides a service, and you have customers that rely on you. Your engineering team often becomes focused on individual systems, and repairing failures in those systems. It is useful if someone plays the role of the advocate for the service, and I think operations is a perfect fit. A typical ops engineer will be able to determine if the service is still failing, and push for a resolution within the organization. They are generally familiar with the parts of the service and who is responsible for them.