Trouble Tickets: Annoying, but Useful

If you work in operations, you probably have used a ticketing system or two. They are common across the industry, and every organization has its own particular workflow. In my younger days I loathed them, since they seemed to be an impediment to me doing my job. Today, I’d describe myself as a reluctant fan. My goal in their use is to meet a few simple operational needs, without introducing undue bourdon on you development and ops teams.

Change Control

The first and most import use of your ticket system should be to track changes. Do you need to tweak a router configuration, or perhaps move a install 48 additional GB of ram in you DB server? There should be a ticket.

These types of tickets help you construct what happens when something changed. In a small environment (less than 30 servers, and fewer than 2 team members), it is easy to keep track of changes, since the volume of activity is lower than in large environment. But, once you add a few team members, and a couple hundred servers it becomes very difficult to track down what happened with a few email, particularly if there are a few time zones involved.

These types of tickets should be concise; they should contain the purpose of the change, when it was made, and any expected side effects. Ideally you should have a link to you revision control system (your commit message should include the ticket id), with the current and previous versions listed incase someone needs to revert your change.

Post-Mortem Evidence

When a critical production issue happens, step one should be filing a ticket. The initial ticket should be brief, for example DB Master Down, but be able to convey the issue. The goal is to start a clock that can be used in determining response time, and help you reconstruct events when running a post-mortem on the production failure. These tickets should be created for all unexpected production events. They should be periodically updated, and should contain the resolution to the issue.

It feels wrong to file a ticket when “Rome is burning”, but keep in mind you have commitments you’ve made. Most service level agreements (SLAs) contain language requiring that a ticket be created within X minutes of an incident. It also gives your support folks or users a way to communicate the scope of the problem, without interrupting work on a resolution. Once things are recovered, you should have the timeline of events for your post-mortem.

Making Sure It Happens

If your team is anything like most operations or devops groups, you have too much work to get done. Users may interrupt you, developers may want you to change the production environment, but not all of it is critical. Some work just has to be put off until there someone has enough time to do it properly.

Tasks that are longer than 5 minutes, and that you aren’t going to do immediately should be ticketed. The issue should have a priority, and you may want to assign a due date if needed (things like replace SSL cert, can be very low priority but become critical once the due date has passed). Give a brief outline of what you expect needs to be done, so if you’re lucky enough to have a colleague or two, they might be able to take the ticket from you.

If your team can manage, you should triage these types of tickets once per day, ideally at the start of the day. This should allow you to come up with a plan of attack for the day, and help you shift expectations on projects that are being derailed by interrupt driven or higher priority work.

Things Tickets Aren’t For

As soon as people start using issue-tracking software, there is a desire to use it to determine employee effectiveness. While tempting, this should be avoided since most tickets assigned to operations groups have different levels of difficulty.

A co-worker of mine tells a story of working at a large ISP, and being asked to generate this type of report. The newest, and most junior employees were completing the largest numbers of tickets, since they handled reverse DNS requests, and the most senior employees had the fewest since their tickets often took several days to solve. Management was on the verge of reprimanding their entire senior network staff, until they realized the difference in the pattern.

Similarly you shouldn’t hold a task open forever in your ticketing environment. While you may tell yourself this is being done to “track” the issue you are creating noise. These types of issues should be closed won’t fix, or scheduled for competition within a quarter.

Conclusion

Trouble ticket software makes sense, and you should use it. Try to think of the reason you’re creating the ticket, and see if you can keep the content of the tickets aligned with the purpose of the tickets.