Operations On-Call Go-Bag

A go-bag (or bug-out bag) is commonly discussed in terms of emergency preparedness; it’s a bag containing things you would need to use to survive for 72 hours. For survival, you would be looking at things like food, water, a medical kit, and some basic tools. The idea being that you may have to leave quickly in an emergency, and may not have time to gather supplies. We can borrow ideas from this useful concept to help build out a kit that you should be carrying with you when you’re on-call.

The Basic On-Call Go Bag

Cell Phone

I know I shouldn’t have to say this, but if you’re on-call you’re going to need a cell phone. I would recommend setting it to an audible ring-tone, something obnoxious that you wouldn’t miss if it rings.

You should have any numbers you’ll need when on-call programmed into your phone. Obvious numbers you should have are things like your datacenter’s NOC, your data carrier’s NOC, and any vendor who is involved in the delivery of your service. This should probably include any developers, team leads, and managers you may need to call.

Mobile Host Spot or Tethering Plan

There is a chance you may be somewhere without WiFi, plan ahead and make sure you can keep in contact. It’s a terrible feeling to get an SMS and not be able to do anything for 20 -30 minutes.

Charger for Cellphone

If you need to use your phone to talk, you’re probably going to use more battery time. If you’re also planning to use your phone to tether or a mobile hotspot, you’ll be eating through your batter very quickly, so it best to carry one of these.

Headset for Cell Phone

If you end up on the phone, you’ll probably want your hands for other things. It helps you keep both hands on the keyboard, of allows you to easily take notes, without putting your smart phone at risk.

Computer

Well you are going to fix servers, right? Make sure you can connect to your production instances, your documentation servers, VPN, and your mobile internet. You should try all of these things first, so that you’re not caught trying to figure it out on the go. Make sure you have the following:

  • Production Access (VPN or bastion host access)
  • SSH keys, if needed
  • Offline copies of some documentation in case your documentation servers are inaccessible.
  • Links to bugtracking, and monitoring servers/services.

Laptop Power Adapter

You should be prepared to be on your computer for a while. Its can be pretty easy to run out of power when you’re on the go

Paper Notebook

Sometimes it helps to write things down. You may not be in a good place to file tickets, and communicate what is happening. It can also be a little faster to scribble a note on a piece of paper, and it may help you reconstruct things a little later. As a extra bonus you can take down a phone number, if you need to.

Pen

A notebook is pretty useless without it.

Data Center Badge (if needed)

If you have physical hardware, you may have to drive to the datacenter. Its better to have this with you in an emergency, then to have to stop at the office, or home first.

Conclusion

Its not a ton of equipment, but its enough to get the job done. What you’re looking to do is eliminate any reason why you’d to eliminate any thing that might stop you from resolving the issue.

 

Trouble Tickets: Annoying, but Useful

If you work in operations, you probably have used a ticketing system or two. They are common across the industry, and every organization has its own particular workflow. In my younger days I loathed them, since they seemed to be an impediment to me doing my job. Today, I’d describe myself as a reluctant fan. My goal in their use is to meet a few simple operational needs, without introducing undue bourdon on you development and ops teams.

Change Control

The first and most import use of your ticket system should be to track changes. Do you need to tweak a router configuration, or perhaps move a install 48 additional GB of ram in you DB server? There should be a ticket.

These types of tickets help you construct what happens when something changed. In a small environment (less than 30 servers, and fewer than 2 team members), it is easy to keep track of changes, since the volume of activity is lower than in large environment. But, once you add a few team members, and a couple hundred servers it becomes very difficult to track down what happened with a few email, particularly if there are a few time zones involved.

These types of tickets should be concise; they should contain the purpose of the change, when it was made, and any expected side effects. Ideally you should have a link to you revision control system (your commit message should include the ticket id), with the current and previous versions listed incase someone needs to revert your change.

Post-Mortem Evidence

When a critical production issue happens, step one should be filing a ticket. The initial ticket should be brief, for example DB Master Down, but be able to convey the issue. The goal is to start a clock that can be used in determining response time, and help you reconstruct events when running a post-mortem on the production failure. These tickets should be created for all unexpected production events. They should be periodically updated, and should contain the resolution to the issue.

It feels wrong to file a ticket when “Rome is burning”, but keep in mind you have commitments you’ve made. Most service level agreements (SLAs) contain language requiring that a ticket be created within X minutes of an incident. It also gives your support folks or users a way to communicate the scope of the problem, without interrupting work on a resolution. Once things are recovered, you should have the timeline of events for your post-mortem.

Making Sure It Happens

If your team is anything like most operations or devops groups, you have too much work to get done. Users may interrupt you, developers may want you to change the production environment, but not all of it is critical. Some work just has to be put off until there someone has enough time to do it properly.

Tasks that are longer than 5 minutes, and that you aren’t going to do immediately should be ticketed. The issue should have a priority, and you may want to assign a due date if needed (things like replace SSL cert, can be very low priority but become critical once the due date has passed). Give a brief outline of what you expect needs to be done, so if you’re lucky enough to have a colleague or two, they might be able to take the ticket from you.

If your team can manage, you should triage these types of tickets once per day, ideally at the start of the day. This should allow you to come up with a plan of attack for the day, and help you shift expectations on projects that are being derailed by interrupt driven or higher priority work.

Things Tickets Aren’t For

As soon as people start using issue-tracking software, there is a desire to use it to determine employee effectiveness. While tempting, this should be avoided since most tickets assigned to operations groups have different levels of difficulty.

A co-worker of mine tells a story of working at a large ISP, and being asked to generate this type of report. The newest, and most junior employees were completing the largest numbers of tickets, since they handled reverse DNS requests, and the most senior employees had the fewest since their tickets often took several days to solve. Management was on the verge of reprimanding their entire senior network staff, until they realized the difference in the pattern.

Similarly you shouldn’t hold a task open forever in your ticketing environment. While you may tell yourself this is being done to “track” the issue you are creating noise. These types of issues should be closed won’t fix, or scheduled for competition within a quarter.

Conclusion

Trouble ticket software makes sense, and you should use it. Try to think of the reason you’re creating the ticket, and see if you can keep the content of the tickets aligned with the purpose of the tickets.

Two Quick Chef Gotchas

Configuration management is a hot topic these days. Chef is one of the more popular choices, and does a fairly good job helping you maintain consistent configuration across your environment. That said it isn’t fool proof. I’ve outlined two common scenarios in which you might introduce a configuration issue.

Removing a File, Package, User, or and Chef Managed Resource

There are a few cases when using Chef where you will end up with an unintentionally installed package, user, file, or other resource. Typically this will happen when modifying a recipe to remove a resource. Lets say you have a recipe that installs three packages:

package "a" do
  action :install
end

package "b" do
  action :install
end

package “c” do
  action :install
end

You may want to remove “package b”, so you might remove it from the recipe:

package "a" do
  action :install
end

package "c" do
  action :install
end

This however will leave you with “package b” installed, and unmanaged across all the nodes running this recipe. Chef is no longer responsible for “package b”, and won’t take any action once its been removed from the recipe. In cloud instances, new instances and old instances will now have mismatched configurations, and you may see issues with dependencies across instances.

The proper way to remove a previously chef managed packages is to do the following:

package "a" do
  action :install
end

package "b" do
  action :remove
end

package "c" do
  action :install
end

If you want to remove the “package b” code from your recipe, wait until you have confirmed all nodes have removed the desired package, and then delete the lines from you recipe.

IMPORTANT:

If you are using chef to manage users, make sure chef removes your users for you, otherwise they will continue to have access. The same goes for any chef managed resource(cron jobs, files, etc…), once chef is in control, let chef remove/uninstall the resource.

Resources Definitions in Loops

I see people use loops to create resources in recipes. Most of the time these are being done for file creation, or execution of an external process. I came across something a few weeks back that was strange:

servers = %w{ "server-a", "server-b"}
servers.each { |server|
 execute "server-command-add" do
   not_if "/usr/bin/add-server-to-something exists #{server}"
   command "/usr/bin/add-server-to-something add #{server}"
  end
}

Chef will do something fairly unexpected here; the second command will not execute because a not_if condition is met on the second resource always. This is because the execute resource for “server-b” has two not_if conditions, (“/usr/bin/add-server-to-something exists server-a”, “/usr/bin/add-server-to-something exists server-b”). Chef copies attributes from the first execute resource defined, and concatenates the additional not_if conditional into and array. Because not_if and only_if are defined as arrays, ruby copies an array reference from the first resource to the second resource.

It is unclear whether this is intentional, but you should be aware of this issue when writing chef recipes. The best way to execute this pattern is to give each resource a unique name, like so:

servers = %w{ "server-a", "server-b"}

servers.each { |server|
  execute "server-command-add-#{server}" do
    not_if "/usr/bin/add-server-to-something exists #{server}"
    command "/usr/bin/add-server-to-something add #{server}"
  end
}

Conclusion

These are just two examples, and I’m sure there are plenty others. When using automation tools remember to check to see if it achieved the results you expected; never blindly trust the tool.

Physical Infrastructure for the Win

Tired of cloud infrastructure performance? Wishing you could get a couple SSDs to solve your IOps issues in EC2? Trying to reduce your operating expenses, and are ok with the capital expense? There are plenty of reasons that you should consider physical, or managed hardware but managing it presents its own challenges. In order to do it right you’ll need to keep a few things in mind.

1. Start by turning out the lights.

I don’t know why but many people love trekking to the datacenter; I hate it. It is like working in a boiler room; there are thousands of fans, cold and hot moments, and a lot of physical security. When something goes wrong at two in the morning, and you’re at least an hour away, you’ll wish you didn’t have to call the smart hands service, or worse yet hop in the car to press a power button. So, making your physical hardware easy to deal with starts by building and then utilizing your lights out management resources.

You should only consider hardware that is meant for the datacenter. You’re looking for equipment with an IPMI card, and ideally with a virtual console exposed through a web ui, if you can’t get the virtual console, try to setup serial over lan access, since its better than nothing. When you install your equipment, your first step should be setting up your IPMI access, and connecting them to your management network(I’ll coverer setting up a management network in another blog post).

Often overlooked due to price, but totally worthwhile are managed PDUs. They often add to the underlying price, but will save you money in the long run, since a smart hands call will cost you about 200 dollars to pull a power plug.

2. Physical Tracking

Often overlooked in early small installations is the importance of tracking your hardware, wires, switches, DIMMs, and power. There will be times that you have to explain to someone else what to do in the data center, and the more you have documented the easier communicating with someone else is.

I would personally recommend installing something like Racktables (http://racktables.org/). Its fairly easy to setup, but makes life much simpler. If you think that’s too much, you can use a wiki, a spreadsheet, or Visio. You’re looking for something that you can send to someone in an emergency.

You should record what system interface is plugged into which port on your switch. You should give each Ethernet cable an id, and record the cable id associated with each connection. You should give each power cable an id, and you should record which socket you used on your PDU. Label your servers, and record the labels in your tracking solution. When you make changes update what you’ve changed.

3. Plan Your Trips

If you’re going to the datacenter, make a plan and try to stick to it. It’s easy to get distracted, and leave things in a partially configured state, or not update your documentation. I always spend an hour or so planning my day at the datacenter.

You should arrange your day starting with the most critical tasks, and finishing with the least critical tasks. If you get stuck on a task, you can skip things that aren’t as crucial, and also helps you make sure that if you’ve started a task you can finish it.

Think through what you’ll need on your trip, and think about what you’ll need. You may or may not be close to a place you can buy an Ethernet cable, or a power adapter for you laptop. If you need something, its best to have it with you in your bag, or worst case in you car. You can easily loose 2 or 3 hours going on a shopping trip. If you’re missing something, and its not a high priority, schedule the task for the next trip.

The most important thing to remember is that you need to have discipline. Keep track of things, plan ahead, and set your self up for success. Physical infrastructure has different challenges, but they can be solved easily.