hypergeometric: a place for me

  • SSH Do’s and Don’ts

    Do Use SSH Keys

    When ever you can use a key for SSH. Once you create it, you can distribute the public side widely to enable access where ever you need it. Generating one is easy:


    ssh-keygen -t dsa

    Don’t Use a Blank Passphrase on Your Key

    This key is now your identity. Protect it. Select a sufficiently safe password, and enter it when prompted. This is basic security, plus allows you to “safely” move your keys between hosts without compromising the key security.

    Do Use Multiple Keys

    Its probably best to use a few keys when setting up access from different hosts. This makes it possible to shutdown a key without locking your self out.

    Don’t Copy Your Private Key Around

    Remember this is your identity, and authorization to access systems. Its never a good idea to copy it from system to system.

    Do Use SSH Agents

    Enabling the ssh agent on you laptop or desktop can save you from the tedium of password entry. Launching the agent is easy, then you just need to add key files to it.


    # starts the agent, and sets up your environment variables
    exec ssh-agent bash
    # add your identities to the agent by using ssh-add
    ssh-add

    Don’t Leave You Agents Running After You Log Out

    If you leave your agent running, this is like leaving your keys in a running car. Anyone can now assume your identity if they can gain access to your agent.

    Do Make A Custom ~/.ssh/config

    You’ll find from time to time that you’ll need special settings. You have a few options, like entering a very long command string, or creating a custom ~/.ssh/config file. I use this for short hostnames when I’m on a VPN, or when my username on my system doesn’t match my account on the remote system.


    # A wild card quick example
    Host *.production
    User geoffp
    IdentityFile ~/.ssh/prod_id_dsa
    ForwardAgent yes

    # Shortening a Host’s Name
    # so ssh my-short-name will work
    Host my-short-name
    User gpapilion
    ForwardAgent yes
    Hostname my.fully.qualified.hostname.com

    Do Use ForwardAgent

    This approximates single sign-on using ssh keys. As long as you are forwarding agent requests back to your original host, you should never be prompted for a password. I set my ~/.ssh/config to do this, but I also will use ssh -a on remote systems to keep from reentering password information.

    *** EDIT ***

    I’ve received a lot of feed back about this point. Some people have pointed out that this should not be used on untrusted systems. Essentially your agent will always respond when prompted to a agent forward request with the response to a challenge. If an attacker has compromised the system or the file systems enforcement of permissions is poor, your credential can be used in a sophisticated man in the middle attack.

    Basically, don’t ever SSH to non-trusted systems with this option enabled, and I’d extend this an say don’t ever login to non-trusted systems.

    This article does a good job of explaining how agent forwarding works. This article on Wikipedia explains the security issue.

     

     

     

    Don’t Only Keep Online Copies of Your Keys

    Keep an offline backup. You may need to get access to a private key, and it always good to keep an offline copy for an emergency.


  • Techincal Debt Better Than Not Doing It

    Its time to admit that sometimes it’s okay to incur technical debt, particularly when it comes to getting it done. So many times, I’ve run into to places that have constipated operations environments, or automation processes because something is hard to do automatically.

    If you can’t automated it, don’t block all other tasks because of one issue. It better to have a partially automated solution, than none at all. Just make sure you can document it, and come back later when you have more time. Don’t let your tools be your excuse for not doing it, it only makes you look bad.


  • User Acceptance Testing for Successful Failovers

    Things fail, we all know that. What most people don’t take into account is that things fail in combination and unexpected ways. We spend time and effort planning redundancy and failover schemes to seamlessly continue operations, but often neglect to fully test these plans before rolling services and equipment into production. What inevitably happens is that the service fails, because the fail-over plan never worked, or had not considered what issues might arise while failing over. So, borrowing the concept of User Acceptance Testing (UAT) from software development, we can develop a system of tests where we can feel confident out redundancy plans will work when we need them.

    Test Cases

    Build a test plan, its that simple. Start by identifying the dependent components of your system, then look all the typical failure scenarios that may happen in those components. If you have two switches, what happens if one dies? Bonded network interfaces, what happens if you loose an uplink on one of your switches?

    After you identify the failure scenarios, specify the expected behavior in for the scenario. If a switch dies, network traffic should continue to be sent through the remaining switch. If interface one looses its ability to route traffic, interface two should be the primary interface in the bond.

    Combining the two pieces should give you a specification of how you expect the system to behave in the case of these failures. You can really organize these anyway you want, but I typically use a user-story like format to describe the failure and expected outcome.

    Example Test case:

    • Switch 1 stops functioning
      • Switch 2 takes over VRRP address
      • Switch 2 passes traffic with minimal interruption, within 3 seconds.
      • Nagios alerts that switch 1 has failed
    • App server looses DB connection
      • load-balancer detects error, and removes host
      • load-balancer continues to pass traffic to other app-servers
      • Nagios alerts that app-server has failed

    Once you’ve completed your plan, get buy-in for it. You’ll want a few of your peers to review it, and look over it for any failures you may have missed. Once you have agreement on this being the right test set, its time for the next step.

    Writing Artificial Tests

    Start brainstorming ways to test failure modes. Simple non-destructive tests are best; emulate a switch failure by unplugging a switch. A hosts network interface fails, block its port on the switch. A system freezes, block the load balancer from connecting to it via a host level firewall. You may want to take things a step farther, like pulling a disk to test raid recovery.

    Remember you’re trying to test your failover plans, and you should no be terribly concerned if you break a configuration in the process, because this may happen when you something goes down. Write all the steps to test down, and its also a good idea to write down how you get back to the know state.

    Review your test cases, and make sure you have tests that address each failure mode. If its impossible to test a scenario, note it, and exclude it from your UAT. Once you’ve done that, your ready to test.

    Performing the Tests

    Any one involved in the day to day technical operations should be able to run through the tests. Its not a bad idea to have a whole team participate, so that people can get used to seeing how the system behaves when components are failing. Step through the test methodically, and record whether the test passed or failed, and how the system behaved during the process. For example, if you’re testing the failure of an app server, did any errors show up on http clients, and if so for how long?

    Failing

    This is going to happen, and when it does it is time to figure out why. Firstly, was this a configuration error, or the artifact of a previous test? If so, fix it, update your test plan, and start testing again. Did you redundancy plan have a fatal flaw? Thats ok too, that’s why we test. If you missed something in your plan, address the issue, and restart the test from scratch. You’re much better off catching problems in UAT then after you’ve pushed the service to production.

    Passing

    Keep a copy of the UAT somewhere, so if questions come up later you can discuss it. I use wikis for this, but any document will do. Once you have that sorted, you can roll your fancy new service into production.

    Summary

    UAT is a useful concept for software development, and also useful for production environments. Take your time and develop a good plan, and you’ll endup with longer up-times, and meeting you’re SLA requirements. As an added bonus, you gain experience seeing how your equipment on instances behave when something has gone wrong.


  • Solr Query Change Beats JVM Tuning

    I’ve been spending the last few days at work trying to improve our search performance, and have been banging my head against the dismax query target and parser in Solr. For those not familiar with the Dismax, its a simplified parser for Solr that eliminates the complexity from the Standard query parser. Instead of search terms like “field_name:value” you can simple enter “value”, but you can no longer search for a specific term in a specific field.

    Our search index has grown in the last few months by 20% and our JVM and Solr setups were beginning to groan under the weight of the data. I went through a few rounds of JVM tuning, which reduced garbage collection time to less than 2%, and with some Solr configuration options managed to bring our typical query back under 5 seconds. This felt like a major win, until I adjusted the query.

    Looking at our query parameters on search I noticed we were using the “fq” parameter to specify the id of the particular site we were looking for. These queries were taking anywhere from 5-15 seconds across our 360GB index, and I suspected that we were pulling in data to the JVM only to filter it away. The garbage collection graphs seemed to indicate this as well, since we had a very slow growing heap, and our eden space was emptying very quickly even with 20G allocated to it. When I changed from dismax to the standard target and specified the site id, I noticed my search time went from 5 seconds to .06 seconds, so started reading, and came across an article on nested queries. My idea was that this would allow me to apply a constraint to the initial set of data returned, using the standard search target, and then perform a full text search using dismax and achieve the same results.

    Original Query(grossly simplified):
    http://search-server/solr/select?fl=title%2Csite_id%2Ctext&qf=title%5E7+text&qt=dismax&fq=site_id:147&timeAllowed=2500&q=SearchTerm+&start=0&rows=20"

    Becomes the following nested query:
    http://search-server/solr/select?fl=title%2Csite_id%2Ctext&qf=title%5E7+text&timeAllowed=2500&q=site_id:147+_query_:%22{!dismax}SearchTerm%22&start=0&rows=20

    Original Query Time : 5 seconds
    Nested Query Time : 87 milliseconds

    Both return identical results. So, if performing a query against a large index and you want to use dismax, you should try using a nested search. You’re likely see much better performance, particularly if you’re filtering based on a facet. And this gives you a relatively easy way to specify the value of a field, and still want to use a dismax query.

     


  • Language Importance for DevOps Engineers

    First and foremost this is a biased article. These are all my opinions, and come from my working experience.

    Bash(or Posix shell)

    Importance 10/10

    If you’re working with *nix and can’t toss together a simple init.d script in 5 minutes, you haven’t done enough bash. It’s everywhere, and it should still be your first automation choice. It has a simple syntax, and is designed specifically to execute programs in a non-interactive manner. You’ll be annoyed that it lacks unit tests, and complex error handling, but its purpose built to automate administrative tasks.

    Perl

    Importance 9/10

    This is the language that you will run into if you work in operations. There will be backup scripts, nagios tests, and a large collection of digital duck tape written by co-workers, that do very important jobs. Its syntax is ugly, and you may find yourself writing a eval to handle exceptions, but its everywhere. CPAN makes it fairly easy to get things done, and you can’t beat this for string handling.

    C/C++

    Importance 5/10

    This is the latin of the *nix world, and is basically portable assembly language. I refrain from writing C whenever possible, since I rarely need the raw performance, and the security and stability consequences are pretty severe. You should understand the syntax(its ALGO right), and be able to read a simple application. It would be great if you could submit a patch to a open-source project, but I would never turn down an ops hire because they didn’t know C well enough.

    PHP

    Importance 7/10

    PHP more important than C?! Yep. Like perl its everywhere, people use it for prototyped webapps, and full blow production systems. Its another ALGO syntax language, except you can put together a simple web page in 2-3 minutes; its almost as magical as the Twilio API. You’ll find yourself poking at it on more than one occasion, so you might as well know what you’re doing.

    Ruby

    Importance 6/10

    Doing something with puppet or chef? You probably should know some ruby, and in fact it probably more important to know ruby than chef of puppet. Its relatively easy to pick up, and so many of the automation tools people love are written int it. As an extra bonus, you could write rails and sinatra apps. It’s good to have in your back pocket.

    Python

    Importance 4/10

    People love to love python, but the truth is that its a bit of a diva. Its a language that favors reading over writing, and has a very bloated standard library with lots of broken components(which is the right http library to use?). It wants to be a simpler perl, but I never find it as useful, and it always takes longer. I know a lot of companies say they want to use it as their “scripting” language, but in practice I’ve not seen the value(i stil want to rewrite everyones code).

    Chef/Puppet

    Importance 2/10

    These are DSLs for configuration management. They are supposed to be simple to learn, and if you can’t figure them out with a web browser and a few minutes, they are failing.

    Java

    Importance 6/10

    More ALGO syntax, and more prevalent in high scale web applications. Minimally you should be able to read this language, but its useful to be able to pound out a few lines of Java. It has many rich frameworks, and you’ll likely find it sneaking into your stack where you need something done fast. Also, it is really useful when it comes time to tune the JVM.

    Haskel

    Importance 0/10

    When I’ve run into it running someplace serious I’ll update its score.

    Javascript

    Importance 8/10

    I hate this language, but I can’t deny its growing importance. Its more common to see in a web browser, but its starting to creep into the backend with things like node.js. If you can understand javascript, you can help resolve whether the issue is a frontend or backend problem; you will have total stack awareness.

    SQL

    Importance 10/10

    You have to know SQL. You will work with SQL databases, and you will want to move things in and out of them. You may want to know a dialect like MySQL very well, but you should understand the basics, and at a minimum be able to join a few tables, or create an aggregate query.


  • Stupid Bash Expansion Trick

    I got asked a question regarding filename expansion in bash the other day, and was stumped. It turns out to be something I should have considered a long time ago, and will always keep in mind when writing a script.

    Question 1:

    What does the following script do if there is a file abc in the current directory?

    #!/bin/bash
    for i in a*
    do
      echo $i
    done

    Answer:

    This a* matches abc and expands to abc, and the script outputs:
    abc

    Question 2:

    What if you run the same script in a directory without any files?

    Answer:

    The script outputs:

    a*

    Why?

    According to The Bash Reference Manual:

    Bash scans each word for the characters ‘*’, ‘?’, and ‘[’. If one of these characters appears, then the word is regarded as a pattern, and replaced with an alphabetically sorted list of file names matching the pattern. If no matching file names are found, and the shell option nullglob is disabled, the word is left unchanged.

    So bash will output ‘a*’, because that is how filename expansion works.

    Question 3:

    What if you run the following script and in a directory with no filename beginning with a:

    
    #!/bin/bash
    for i in a*
    do
      echo /usr/bin/$i
    done
    

    Answer:

    The script outputs:


    /usr/bin/a2p /usr/bin/a2p5.10.0 /usr/bin/a2p5.8.9 /usr/bin/aaf_install /usr/bin/aclocal /usr/bin/aclocal-1.10 /usr/bin/addftinfo /usr/bin/afconvert /usr/bin/afinfo /usr/bin/afmtodit /usr/bin/afplay /usr/bin/afscexpand /usr/bin/agvtool /usr/bin/alias /usr/bin/allmemory /usr/bin/amavisd /usr/bin/amavisd-agent /usr/bin/amavisd-nanny /usr/bin/amavisd-release /usr/bin/amlint /usr/bin/ant /usr/bin/applesingle /usr/bin/appletviewer /usr/bin/apply /usr/bin/apr-1-config /usr/bin/apropos /usr/bin/apt /usr/bin/apu-1-config /usr/bin/ar /usr/bin/arch /usr/bin/as /usr/bin/asa /usr/bin/at /usr/bin/atos /usr/bin/atq /usr/bin/atrm /usr/bin/atsutil /usr/bin/autoconf /usr/bin/autoheader /usr/bin/autom4te /usr/bin/automake /usr/bin/automake-1.10 /usr/bin/automator /usr/bin/autoreconf /usr/bin/autoscan /usr/bin/autospec /usr/bin/autoupdate /usr/bin/auval /usr/bin/auvaltool /usr/bin/awk

    Why?

    Because you’re re-evaluating ‘/usr/bin/$i’ which is now ‘/usr/bin/a*’, which expands to the order list above due to shell filename expansion rules. If you want to avoid this you need to protect your variables using quotes. Here is the safe version of the script:

    
    #!/bin/bash
    for i in a*
    do
      echo /usr/bin/"$i"
    done
    

    Just something simple to think about when writing your bash scripts. Expect to enter loops on globs that don't match anything, always protect your variables, and consider setting the failglob option in your scripts.


  • Dealing with Outages

    No matter what service you’re building, at some point you can expect to have an outage. Even if your software is designed and scaled perfectly, one of your service providers may let you down, leading to a flurry of calls from customers. Plus the internet has many natural enemies in life (rodents, backhoes, and bullets), and you may find yourself cut off from the rest of the net with a small twist of fate. Don’t forget, even the best detection and redundancy schemes fail, and it not unusual to have your first notification of an outage come from a rather upset customer. Your site will go down, and you should be prepared for it.

    Initial Customer Contact

    You’re customer is upset. You’ve promised to provide some service that is now failing, and they are likely loosing money because of your outage. They’ve called your support line, or sent an email, and they are looking for a response. What do you do?

    Give your self a second

    Outages happen on their own schedules, and you may be at a movie, sleeping, the gym, or eating dinner at the French Laundry for example. You need to give you’re self 2-3 minute to compose yourself, find internet access, and call the customer back. If you have an answering service you’ve likely met the terms of your SLA, if you don’t figure out how much time you can take. I think this is a better option than voicemail, since it handles any issues you may have communicating with a customer in the first few minutes of the call. They may even be able to file a ticket for you with the basic information you need. This can cost a fair bit of money, and if this option is too pricey for your service, consider a voicemail number that will page your on-call team. It gives your team a small buffer, but they have to be prepared to talk to the customer quickly since this may add up to 5 minutes between the initial call and page. As the last resort, have your customer support number dial someone who is on-call.  If you have the time and resources, make the email address you use for  outage reports follow the same workflow as calls, so you don’t need a second process.

    Promises Can Be Hard to Keep

    Track your customer’s complaint; make sure its recorded in your ticketing system. You want to start keeping a record from the moment they called you, and be able to reconstruct the incident later. This will also help you determine a start time for any damages clause that may be in your SLA. I’d make sure the following things are done:

    • Get a call back number.
    • Let them know you are looking into the issue.
    • Let them know when you expect to call them back.
    • Let them know the ticket / incident number you are using to track the issue.
    • And most importantly, don’t promise anything that you can’t guarantee happens.

     

    Have you met the terms of your SLA?

    You only have one SLA agreement, right? If not, hopefully the basics are the same. Keep in mind what you’ve agreed to with your customers, and as early as possible identify if you’ve not met the terms of the service agreement. This is really just for tracking, but it can be useful if you have to involve an account manager and discuss any damage claims.

    Houston, we don’t have a problem.

    You’ve talked with the customer, you’ve created a ticket, you’ve managed expectations, now its time to figure out if there is an issue.

    • Check your internal monitoring systems.
    • Check your external monitoring systems.
    • Check you logging.
    • Check your traffic.
    • Give our customer’s use-case a try?

    Does your service look ok, or do you see a problem? At this point you want to figure out if you have an issue, or not. If you can’t figure it out quickly, you need to escalate the issue to someone who can. If you don’t have an issue, call the customer and see if they still have any issues, and if they’ll agree to close the issue. If they are still having issues escalate, and if you have doubts as to wether your service is working, escalate. If you know you have an issue, its time to move on to resolving it.

     Who Needs to Know?

    Its important to let everyone on your team know your service is having issues. Before anything happens, you should know who you need to contact when there is an issue. This will save time, and help minimize duplication of work(in larger organizations, two people may be receiving calls about the same issue). A mail group, or centralized chat server is an ideal solution since it fairly low latency, and you can record the communication that can be review later. You should be clear as to what the problem is, and provide a link to the ticket.

    Who has your back?

    The next thing you should be working out is who do you need to solve your issue. You product could be simple, or fairly complex. You may be the right person to address the problem, or you may need to call for backup. If you have an idea of who you need get in-touch with them now. Get them ready to help you solve your problem. It takes quite a bit of time to get people online, so if you possibly need their help its better to call them sooner than later.

    Herding Cats

    Finally, now that you’ve let everyone know, and you have a team assembled to solve the issue, figure out how you’re going to communicate. The method should be low latency, and low effort. I prefer conference calls, but a chat server can work just as well plus you can cut and paste errors into the chat. You should have this figured out well in advance of an incident.

    Come on you tiny horses!

    You’re ready to fix the problem. Just a few more things your should have figured out:

    • Who is doing the work?
    • Who is communicating with your customer?
    • Who is documenting the changes made?
    • Who will gather any additional people needed to resolve the issues?

    This could be an easy answer if you only have one person, but working through almost any issue is much easier with two people. Ideally one person will act at the project manager, getting extra help, talking to the customer, while the other types furiously in a terminal to bring the service back up. I fyou have this worked out beforehand you’ll save some time, but if you don’t,  come to an agreement quickly, and stick to your roles. You don’t need 2 people talking to your customer, telling them different things, or worse two people bringing up an down a service.

     

    So you’re finally back up…

    Great only a few more things to do.

    Open a ticket for the post-mortem. Link it to your outage ticket, and begin filling in any information that might be helpful. Be as detailed as possible, and even if its inconvenient take a little time to document the issue and resolution. You should also schedule a meeting immediately for the post-mortem that takes place in the next 24 hours. People are beginning to forget what they did, and you need to capture as much of it as possible.

    Once you’ve completed your meeting, produce a document explaining the outage. This should be as brief as possible with very little internal information included. Document the timeline leading to the issue, how the issue was discovered, and how it was resolved. Also, build a list of future actions to prevent a repeat of the incident. If your customer asks for it, or their SLA includes language promising it, send them the document to explain the outage.

    So, spend time thinking about talking to your customer when your down. Think through the process, so when they call you won’t have to make it up. I’ve setup several of these processes, and I’ve found that these are the issues that always need to be looked at. It worth the planning, and its always important to look at what happened, so that you can improve the process.

     

     

     

     

     

     

     

     

     


  • Operations On-Call Go-Bag

    A go-bag (or bug-out bag) is commonly discussed in terms of emergency preparedness; it’s a bag containing things you would need to use to survive for 72 hours. For survival, you would be looking at things like food, water, a medical kit, and some basic tools. The idea being that you may have to leave quickly in an emergency, and may not have time to gather supplies. We can borrow ideas from this useful concept to help build out a kit that you should be carrying with you when you’re on-call.

    The Basic On-Call Go Bag

    Cell Phone

    I know I shouldn’t have to say this, but if you’re on-call you’re going to need a cell phone. I would recommend setting it to an audible ring-tone, something obnoxious that you wouldn’t miss if it rings.

    You should have any numbers you’ll need when on-call programmed into your phone. Obvious numbers you should have are things like your datacenter’s NOC, your data carrier’s NOC, and any vendor who is involved in the delivery of your service. This should probably include any developers, team leads, and managers you may need to call.

    Mobile Host Spot or Tethering Plan

    There is a chance you may be somewhere without WiFi, plan ahead and make sure you can keep in contact. It’s a terrible feeling to get an SMS and not be able to do anything for 20 -30 minutes.

    Charger for Cellphone

    If you need to use your phone to talk, you’re probably going to use more battery time. If you’re also planning to use your phone to tether or a mobile hotspot, you’ll be eating through your batter very quickly, so it best to carry one of these.

    Headset for Cell Phone

    If you end up on the phone, you’ll probably want your hands for other things. It helps you keep both hands on the keyboard, of allows you to easily take notes, without putting your smart phone at risk.

    Computer

    Well you are going to fix servers, right? Make sure you can connect to your production instances, your documentation servers, VPN, and your mobile internet. You should try all of these things first, so that you’re not caught trying to figure it out on the go. Make sure you have the following:

    • Production Access (VPN or bastion host access)
    • SSH keys, if needed
    • Offline copies of some documentation in case your documentation servers are inaccessible.
    • Links to bugtracking, and monitoring servers/services.

    Laptop Power Adapter

    You should be prepared to be on your computer for a while. Its can be pretty easy to run out of power when you’re on the go

    Paper Notebook

    Sometimes it helps to write things down. You may not be in a good place to file tickets, and communicate what is happening. It can also be a little faster to scribble a note on a piece of paper, and it may help you reconstruct things a little later. As a extra bonus you can take down a phone number, if you need to.

    Pen

    A notebook is pretty useless without it.

    Data Center Badge (if needed)

    If you have physical hardware, you may have to drive to the datacenter. Its better to have this with you in an emergency, then to have to stop at the office, or home first.

    Conclusion

    Its not a ton of equipment, but its enough to get the job done. What you’re looking to do is eliminate any reason why you’d to eliminate any thing that might stop you from resolving the issue.

     


  • Trouble Tickets: Annoying, but Useful

    If you work in operations, you probably have used a ticketing system or two. They are common across the industry, and every organization has its own particular workflow. In my younger days I loathed them, since they seemed to be an impediment to me doing my job. Today, I’d describe myself as a reluctant fan. My goal in their use is to meet a few simple operational needs, without introducing undue bourdon on you development and ops teams.

    Change Control

    The first and most import use of your ticket system should be to track changes. Do you need to tweak a router configuration, or perhaps move a install 48 additional GB of ram in you DB server? There should be a ticket.

    These types of tickets help you construct what happens when something changed. In a small environment (less than 30 servers, and fewer than 2 team members), it is easy to keep track of changes, since the volume of activity is lower than in large environment. But, once you add a few team members, and a couple hundred servers it becomes very difficult to track down what happened with a few email, particularly if there are a few time zones involved.

    These types of tickets should be concise; they should contain the purpose of the change, when it was made, and any expected side effects. Ideally you should have a link to you revision control system (your commit message should include the ticket id), with the current and previous versions listed incase someone needs to revert your change.

    Post-Mortem Evidence

    When a critical production issue happens, step one should be filing a ticket. The initial ticket should be brief, for example DB Master Down, but be able to convey the issue. The goal is to start a clock that can be used in determining response time, and help you reconstruct events when running a post-mortem on the production failure. These tickets should be created for all unexpected production events. They should be periodically updated, and should contain the resolution to the issue.

    It feels wrong to file a ticket when “Rome is burning”, but keep in mind you have commitments you’ve made. Most service level agreements (SLAs) contain language requiring that a ticket be created within X minutes of an incident. It also gives your support folks or users a way to communicate the scope of the problem, without interrupting work on a resolution. Once things are recovered, you should have the timeline of events for your post-mortem.

    Making Sure It Happens

    If your team is anything like most operations or devops groups, you have too much work to get done. Users may interrupt you, developers may want you to change the production environment, but not all of it is critical. Some work just has to be put off until there someone has enough time to do it properly.

    Tasks that are longer than 5 minutes, and that you aren’t going to do immediately should be ticketed. The issue should have a priority, and you may want to assign a due date if needed (things like replace SSL cert, can be very low priority but become critical once the due date has passed). Give a brief outline of what you expect needs to be done, so if you’re lucky enough to have a colleague or two, they might be able to take the ticket from you.

    If your team can manage, you should triage these types of tickets once per day, ideally at the start of the day. This should allow you to come up with a plan of attack for the day, and help you shift expectations on projects that are being derailed by interrupt driven or higher priority work.

    Things Tickets Aren’t For

    As soon as people start using issue-tracking software, there is a desire to use it to determine employee effectiveness. While tempting, this should be avoided since most tickets assigned to operations groups have different levels of difficulty.

    A co-worker of mine tells a story of working at a large ISP, and being asked to generate this type of report. The newest, and most junior employees were completing the largest numbers of tickets, since they handled reverse DNS requests, and the most senior employees had the fewest since their tickets often took several days to solve. Management was on the verge of reprimanding their entire senior network staff, until they realized the difference in the pattern.

    Similarly you shouldn’t hold a task open forever in your ticketing environment. While you may tell yourself this is being done to “track” the issue you are creating noise. These types of issues should be closed won’t fix, or scheduled for competition within a quarter.

    Conclusion

    Trouble ticket software makes sense, and you should use it. Try to think of the reason you’re creating the ticket, and see if you can keep the content of the tickets aligned with the purpose of the tickets.


  • Two Quick Chef Gotchas

    Configuration management is a hot topic these days. Chef is one of the more popular choices, and does a fairly good job helping you maintain consistent configuration across your environment. That said it isn’t fool proof. I’ve outlined two common scenarios in which you might introduce a configuration issue.

    Removing a File, Package, User, or and Chef Managed Resource

    There are a few cases when using Chef where you will end up with an unintentionally installed package, user, file, or other resource. Typically this will happen when modifying a recipe to remove a resource. Lets say you have a recipe that installs three packages:

    package "a" do
      action :install
    end
    
    package "b" do
      action :install
    end
    
    package “c” do
      action :install
    end

    You may want to remove “package b”, so you might remove it from the recipe:

    package "a" do
      action :install
    end
    
    package "c" do
      action :install
    end

    This however will leave you with “package b” installed, and unmanaged across all the nodes running this recipe. Chef is no longer responsible for “package b”, and won’t take any action once its been removed from the recipe. In cloud instances, new instances and old instances will now have mismatched configurations, and you may see issues with dependencies across instances.

    The proper way to remove a previously chef managed packages is to do the following:

    package "a" do
      action :install
    end
    
    package "b" do
      action :remove
    end
    
    package "c" do
      action :install
    end

    If you want to remove the “package b” code from your recipe, wait until you have confirmed all nodes have removed the desired package, and then delete the lines from you recipe.

    IMPORTANT:

    If you are using chef to manage users, make sure chef removes your users for you, otherwise they will continue to have access. The same goes for any chef managed resource(cron jobs, files, etc…), once chef is in control, let chef remove/uninstall the resource.

    Resources Definitions in Loops

    I see people use loops to create resources in recipes. Most of the time these are being done for file creation, or execution of an external process. I came across something a few weeks back that was strange:

    servers = %w{ "server-a", "server-b"}
    servers.each { |server|
     execute "server-command-add" do
       not_if "/usr/bin/add-server-to-something exists #{server}"
       command "/usr/bin/add-server-to-something add #{server}"
      end
    }

    Chef will do something fairly unexpected here; the second command will not execute because a not_if condition is met on the second resource always. This is because the execute resource for “server-b” has two not_if conditions, (“/usr/bin/add-server-to-something exists server-a”, “/usr/bin/add-server-to-something exists server-b”). Chef copies attributes from the first execute resource defined, and concatenates the additional not_if conditional into and array. Because not_if and only_if are defined as arrays, ruby copies an array reference from the first resource to the second resource.

    It is unclear whether this is intentional, but you should be aware of this issue when writing chef recipes. The best way to execute this pattern is to give each resource a unique name, like so:

    servers = %w{ "server-a", "server-b"}
    
    servers.each { |server|
      execute "server-command-add-#{server}" do
        not_if "/usr/bin/add-server-to-something exists #{server}"
        command "/usr/bin/add-server-to-something add #{server}"
      end
    }

    Conclusion

    These are just two examples, and I’m sure there are plenty others. When using automation tools remember to check to see if it achieved the results you expected; never blindly trust the tool.


Got any book recommendations?