hypergeometric: a place for me

  • Just Enough Ops of Devs

    A few weeks ago I was reading through the chef documentaion and I came across the page “Just Enough Ruby for Chef”. This inspired me to put together a quick article, on how much linux a developer needs to know. I’m going to be doing this as a series, and putting out one of these a week.
     

    How to use and generate SSH keys

    I’ve covered how to create them here, but you should know how to create distrubute, and change ssh keys. This will make it easier to discuss access to production servers with your ops team, and likely make it easier when you use things like Github.
     

    How to use |

    If you’ve used unix for sometime, you might be familar wih this. The pipe or | can be used to send the output from one process to another. Here’s a good example of its usage:

     

    How to use tar

    Tar is one of those basic unix commands that you need to know. Its the universal archiving tool for *nix systems(similar to Zip for windows). You should know how to create an archive and expand an archive. I’m only covering this with compression enabled, if you don’t have gzip, or don’t want it ommit the z option.


     

    The file command

    File is magic. It will look at a file and give you its best guess as to what it is. Usage is:

     

    The strings command

    Ever want to read the strings from a binary file? The strings command will do this for you. Just run “strings ” and you’ll get a dump of all the strings from that file. This is particularly useful when looking for strings in old PCAP files, or if a binary file has been tampered with.

     

    How to use grep

    Grep can be used to extract a lines of text from a file or stream matching a particular pattern. This is a really rich command, and should have a whole article dedicated to it. Here are some very simple use cases.

    To match a pattern:

    To pull all lines not matching a pattern:

     

    How to count lines in a file

    The wc commands will count the lines, words, and bytes in a file. The default options will return all three, if you ony want to count the lines in a file, use the -l option that will output only the lines in a file. Here is an example:

     

    Count the unique occurrences of something

    It might seem like its out of the reach of bash, but you can do this with a simple one liner. You just need to type:

    This counts all the unique line occurrences, and then sorts them numerically.

     

    Following the output of a file with tail

    Tail is a very useful command; it will output the last 10 lines of a file by default. But sometimes you need to want to continiously watch a file. Fortunately tail can solve this for you. The -f option will print new lines as they’re added. Example:

     

    I’ll follow this up a week from now, with more linux for devs. Hopefully you found this useful.


  • Thanks Mr. Jobs, But it Seems I Can Use Linux Laptop Now

    So, back in 1997 I installed my first copy of FreeBSD. I had to do some major research to get X Windows up and running, and the next computer I bought I very carefully selected a video card to make things easier. I was happy, I was able to use gcc, but getting online via 56k modem could be a bit of a chore.

    So long little devil…

    In early 1998 I started using RedHat Linux. I could play mp3s, and easily run things like RealPlayer and Mathematica. My copy of Netscape Navigator was just every bit as good as my Windows copy. However, I was too young to aperciate LaTeX, and needed a word processor to write papers. I tried to use every word processor I could find, but allas they all sucked. So, I had to dual boot linux and windows.

    The Sun Also Rises

    In 1999, I had a Sun Sparc 5 Workstation. I used it for a few years, with little difficulty. At the time I used mutt for email, netscape when I needed a browser. Cut and paste was still questionable, and viewing a Word or an Excel doc took more work than I cared to admit. But the world was starting to change.

    The Sun Also Sets

    I was getting HTML email, constantly. I got more and more attachments, and my boss was asking for better calendaring. I would go to websites, and get a plesant Javascript pop up saying, I needed IE.

    By 2001, I was using Windows full time. I needed Outlook, Word, and Excel. I wasn’t wild about it, but I could get things done.

    And, we have a new Contender

    In spring of 2001 I bought my first Mac. It was a beuatiful Titanium Macbook G4 running OS 9. I could run my productivity apps, connect to my windows shares, and still ssh to any unix system that needed my attention.

    For the next 11 years I used Macs for a personal computer, and I used windows PCs for work. In 2008 I got my first work Mac and I found my happy place. I described it as having a linux computer without the hassle of trying to run linux on a laptop.

    In 2010 and 2011, I still used a Mac and told my co-workers who install Ubuntu they were wasting there time. They suffered with wireless problems, things like bluetooth never worked, and battery life suffered. I couldn’t understand why anyone wouldn’t want to use OS X.

    Nothing is forever

    Two days ago I got my Dell XPS 13 as part of a Dell beta progam called project Sputnik. I got a special version of Ubuntu, with some kernel patches, and some patched packages for sleep and hibernation. After an hour of struggling with making a bootable USB drive from my Mac for my Dell(turns out it was an issue with the USB drive), I had a working computer. By 8pm I had my development enviroment setup, I had chef up and running, and even my VPN was working. I was amazed.

    So, far its been good; most apps I use are web apps. I spend 70% of my time in a terminal, and 30% of my time in a web browser. Honestly its the perfect computer for me right now. So, I’m waving goodbye to the ecosystem Mr. Jobs built, and moving to the world of linux full time.

    ***EDIT***

    I relized this after posting the article, and watching the response that I sould have included I recieved a discount from Dell for the laptop(roughly 20%).


  • The Pitfalls of Web Caches

    At Wikia we’ve written a lot of code and used many different tools to improve performance. We have profilers, we’ve used every major linux webserver, APC, xcache, memcache, redis, and not to mention all the work we’ve put into the application. But nothing get us as much bang for our buck as our caching layer built around Varnish. We’ve used it to solve big and small problems, and for the most part it does its jobs and saves us hundreds of thousand per year in server costs, but that does not mean there aren’t problems. If you’re looking at using something like Varnish here a few things to keep in mind.

    Performance Problems

    When deploying your caching layer, remember you really haven’t “fixed” anything. You’re page loads will improve because you’ll skip the high cost of rendering a raw page, but if you have a cache miss the page will be as slow as it ever was. This can be important if you pass sessioned clients to your backend application servers, since they will suffer the slowest load times. So, use caching to cut load times, but also try to fix the underlying performance problems.

    Rewrites

    We use a lot of rewrites. They are extremely useful for normalizing URLs for purging, and to hide ugly URLs from end users; no one wants to go to wiki/index.php?title=Foo when the can just go to wiki/Foo. Unfortunately, applying this pattern leads to very complex logic, that relies on using the request URL as an accumulator. This makes for very difficult to understand code, since you can have multiple changes apply to one URL. Rewrites are also very difficult to test, since they are built into your caching server. Don’t forget that you’re estentially making you’re cache part of your application; there isn’t much difference between a rewrite in varnish and a route in rails.

    If you have the choice, don’t put rewrites in your caching layer; it is very difficult to debug, and can be very fragile. Use redirects if at all possible, they may be slower, but are easier to figure out when something goes wrong.

    Complicated Logic

    Rewrites aren’t the only place you can make things complicated. Its very easy to build complex logic in something like Varnish. You can have several levels of if statements, and several points which you move on to the next step in the document delivery. This, just like rewrites, can lead to very difficult problems to debug, since you may have several conditions applied to the same request.

    If you find yourself doing this, ask yourself if its needed? You may find that while its fast to implement in a caching layer, you may be better off building this into the application itself. Remember that caching layers are difficult to test, and you may not know that you logic is full working for at least one eviction cycle.

    Wrapping it Up

    My advice for using a web caching server is easy, keep it simple. Try and keep as much logic out of it as possible, don’t ignore your performance problems, and use it for what its best at, caching.


  • Infrastructure – The Challenge of Small Ops – Part 3

    Infrastructure is hard to build. This is true when putting together compute clusters, or when dealing with roads or power lines. Typically this involves both increases in operating expenses and capital expenses, and a small mistake can be quite costly.
     

    Limited Resources

     
    All organizations have goals. Sometimes these goals are built around reliability, and sometimes they are build around budgets, but most of the time both are important. In large organizations a few extra servers, don’t usually carry a material cost impact, but in a small organization one extra servers can double the cost of a project. If you’re missing a large budget it can make some reliability goals quite challenging.
     

    Reliability

     
    Engineering is the art of making things as weak as they need to be to survive. When putting together infrastructure in a small environment its helpful to really give someone the job of reliability engineering. They should look at your application, and outline what is required to provide the basic redundancy your organization needs. Then they should see how they can line up the budget, and the requirements and get you a solution that meets your up-time needs as well as your pocketbook.
     

    N+1

     
    This is batted around often when discussing the redundancy. In very large organization this n may equal 1000, so n + 1 is 1001, but in a small organization N is often 1, making N + 1 equal to 2. This is often a hard problem to work around when you may only be allowed to buy 2 severs for a three tiered application, but you can work around it. Virtualization can really help out, but it increases the planning demands. You will need to insure that you have the capacity in each piece of physical equipment to meet your needs, and that you’ve made sure that system roles always exist such that they can fail independently. While this sounds simple, it really needs an owner to keep track of this, just to make sure you don’t loose your primary and backup service at the same time.

    The second issue with n + 1 redundancy, when N equals 1 you need to plan capacity carefully. The best solution in this case is to use an active-passive setup. If you use active-active setups you need to be careful that you don’t exceed 50% of your total capacity, since a failure will remove 50% of your capacity.
     

    Wrapping it Up

     
    Infrastructure is one of the harder things to get right in a small org. Take you time and think about it. Always keep an eye on your budget, and reliability goals.

    This is the final installment in this series. Take a look at Part 2 and Part 1 if you liked this article.


  • Solr Upgrade Surprise and Using Kill To Debug It

    At work, we’ve recently upgraded to the latest and greatest stable version of Solr (3.6), and moved from using the dismax parser to the edismax parser. The initial performance of Solr was very poor in our environment, and we removed the initial set of search features we had planned to deploy trying to get the CPU utilization in order.

    Once we finally, rolled back a set of features Solr seemed to be behaving optimally. Below is what we were seeing as we looked at our search servers CPU:
    Solr CPU Usage Pre and Post Fix
    Throughout the day we had periods where we saw large CPU spikes, but they didn’t really seem to affect throughput or average latency of the server. None the less we suspected there was still an issue, and started looking for a root cause.
     

    Kill -3 To The Rescue

     
    If you’ve never used kill -3, its perhaps one of the most useful Java debugging utilities around. It tells the JVM to produce a full thread dump, which it will then print to the STDOUT of the process. I became familiar with this when trying to hunt down treads in a Tomcat container that were blocking the process from exiting. Issuing kill -3 would give you enough information to find the problematic thread, and work with development to fix it.

    In this case, I was hunting for a hint as to what went wrong with our search. I issued kill -3 during a spike, and got something like this:

     

    Looking at the the output, I noticed that we had a lot threads calling FuzzyTermEnum. I thought this was strange, and sounded like an expensive search method. I talked with the developer, and we expected that the tilde character was being ignored by edismax. At the very least being escaped by our library, since it was included in the characters to escape. I checked the request logs, and we had people looking for exact titles that contained ~. This turned a 300ms query into a query that timed out, due to the size of our index. Further inspection of the thread dump revealed that we were also allowing the * to be used in query terms as well. Terms like *s ended up being equally problematic.
     

    A Solr Surprize

     
    We hadn’t sufficiently tested edismax, and we’re surprised that it ran ~,+,^, and * when escaped. I didn’t find any documentation that stated this directly, but I didn’t really expect to. We double checked our Solr library to see if that it was properly escaping the special characters in the query, but they we’re still being processed by Solr. On a hunch we tried double escaping the characters, which resolved the issue.

    I’m not sure if this is a well known problem with edismax, but if you’re seeing odd CPU spikes this is definitely worth checking for. In addition, when trying to get to a root of a tough problem kill -3 can be a great shortcut. It saved me a bunch of painful debugging, and really eliminated almost all my guess work.


  • What I Wish Some Had Told Me About Writing Cron Jobs

    Much like Doc Brown and Marty McFly, cron and I go way back. It is without doubt one of thing single most valuable tools you can use in linux system management. Though what I’ve learned over the years is that it can be hard to write jobs that reliably produce the results I want. I wish when I started writing these jobs executed by cron in the 1990s someone had told me a few of these things.
     

    Don’t Allow Two Copies of Your Job to Run at Once

     
    A common problem with cron jobs is that the cron daemon will launch new jobs while the old job is running. Sometimes this doesn’t cause a problem, but generally you expect only one job to launch at a time. If you’re using cron to control jobs that launch every 5 or 10 minutes, but only want one to run at a time its useful to implement some type of locking. A simple a method is to use something like this:

    You can get more complicate using flock, or other atomic locking mechanisms. For most purposes this is good enough.
     

    Sleep for a Bit

     
    Ever have a cron job overload a whole server tier because logs rotate at 4am? Or, got a complaint from someone that you were overloading there application by having 200 server contact them at once? A quick fix is to have the job sleep for a random time after being launched. For example:

    This does a good job of spreading the load out for expensive jobs, and avoid thundering herd problems. I generally pick an interval long enough so that my servers will be distributed throughout the period, but still meets my goal. For example, I might spread an expensive once a day job over an hour, but a job that runs ever 5 minutes may only be spread over 90 seconds. Lastly, this should only be used for things that you can except a loose time window around.
     

    Log It


    I’ll be the first to admit I do this all the time. I hate getting emails from cron, but in general you should avoid doing this. When everything is working this isn’t a big deal, but when something goes wrong, you’ve thrown away all the data that told you what happened. So, redirect to a log file, and overwrite or rotate that file.

    Hopefully these tips help you out, and solve some of your cron pains.


  • 6 Phone Screen Questions for an Ops Candidate

    My company is hiring, and I’ve been thinking a lot more about what types of question are appropriate for a phone interview, but still give enough detail to lead me to a conclusion as to whether I think the person on the other end is competent. Having been on both sides of the table I thought I might share what I think are a few good questions.

     

    1. Tell me about a monitoring test you’ve written?

    I decided long ago that I don’t want to hire anyone who has never written a monitoring test. I don’t care how simple or complicated the test was, but I want to make sure they’ve done it. Throughout my career, I’ve come across so many specialized pieces of code or infrastructure, that I take it for granted sooner or later your going to need to do this. I find that the people who care about uptime do it earlier in their career. Its good to follow up with several more questions about their specific implementation, and then ask if they had any unexpected issues with the test.

     

    2. How would you remove files from a directory when ‘rm *’ tells you there are too many files?

    Back in the 1990’s when Solaris shipped with a version of Sendmail that was an open-relay, it wasn’t unusual for me to have to wipe a mail queue directory for a customer. If someone had been really aggressive sending mail to it, it wasn’t too unusual to be confronted with the message that the * expansion was too long to pass to rm.  I can think of a few ways to do this:

    1. for i in `ls /dir/`; do rm $i ; done
    2. find /dir/ -exec rm {} \;
    3. rm -rf /dir; mkdir /dir

    And I’m sure there are plenty more. After I get the answer I like to cover if they think there is any issue with the method they’ve chosen.

    I like this question since it show a candidates understanding of how the command-line works, and if they can think around some of its limitations.

     

    3. How would you setup and automated installation of linux?

    A good candidate should have done this, and they should imediately be talking about setting up  FAI or Kickstart. I like to make sure they cover the base pieces of infrastructure, like DHCP, tftp, and PXE. Generally I will follow up, and ask when they think it makes sense to setup this type of automation, since it does require quite a bit of initial infrastructure.

     

    4. How would you go about finding files on an installed system to add to configuration management?

    This question is straight forward and quick, and I’m looking for two things from the candidate. First, I want them to tell me about using the package management system to locate modified config files, and second I want to hear them tell me about talking to the development team as to what was copied on the system.

    This question tells me they’ve looked for changes on systems, and have a basic understanding of what the package management tools provide. But, that they know there is a human component, and it might be quicker to ask the dev team what they installed then building a tool to find it.

     

    5. If I gave you a production system with a PHP application running through Apache, what things would you monitor?

    I like using this question because it give you an idea of the thoroughness of the candidates thought process. The easy answer is the URL the application is running on, but I like to push candidates for more. I generally looking for a list like:

    • The URL of the application
    • The Port Apache is running on
    • The Apache Processes
    • PING
    • SSH
    • NTP
    • Load Average / CPU utilization
    • Memory Utilization
    • Percentage of Apache Connections used
    • Etc..

    I’m looking for both the application specific and the basic probes. I cannot tell you how many times in my career, I’ve started a job and found out SSH wasn’t monitored. Since it wasn’t part of the application, people didn’t think it was needed.

    This question tests the candidates attention to detail. Monitoring is an important part of any production environment, and I want candidates who state the obvious.

    6. If I asked you to backup a servers local filesystem, how would you do it?

    Backups are, unfortunately, the bread and butter of operations work. A candidate should really have some experience running a backup, and so they should know the basics. Unfortunately, this is a really open ended question. There are endless ways it can be done, and that makes it a little tough on both the candidate and interviewer. One example a candidate could choose would be to use the tar command, but they could also choose to use tar with an LVM snapshot, or they could use rsync to a remote server. Its really the follow up question that makes this worthwhile; what are the disadvantages of your method, and can you think of another way you might do this to address those issues? Again, since its the bread and butter of operations work, they should know the strengths and weakness of the scheme they select, and they should know at least one alternative.

    This question checks to see if a candidate has performed typical operations work, but also if they have thought through the problems with it.

     


  • Rally Cars and Redunancy: Understand Your Failure Boundaries

    I occasionally watch rally car racing, and if you haven’t seen it before its worth a watch. Guys drive small cars very fast down dirt roads, and while this is going on a passenger is reading driving notes to the driver. Occasionally these guys hit rocks, run off the road, and do all sorts of damage to their cars. The do a quick assessment of the damage determine if they can continue, and the carry on or pull out. Generally the calls the are making is to whether the particular failure has compromised the safety, or performance of the car so much that they cannot complete the race.

    If you want to build a redundant system, you’ll need to take a look at each component and ask yourself what happens if it fails. Do you know how many components it will affect? Will it bring the site down, or degrade performance so much that the site will practically be down? Will your system recover automatically, or require intervention for an operator? Think through this is a very detailed manner, and make sure you understand the components.

    Enter the Scientific Method

    Develop a hypothesis, and develop a checklist of what you expect to see during a system failure.   This should include how you expect performance to degrade, what you expect to do to recover, and what alerts should be sent to your people on call. Put time frames in which you expect things to happen in, and most importantly note any place you expect there is a single point of failure.

    Break It

    Once you’ve completed your list, start shutting off pieces to test your theories. Did you get all the alerts you expected in a timely manner? More importantly, did you get extra alerts you didn’t expect? These are important because they may mislead, or obscure a failure. Did anything else fail that you didn’t expect? And, lastly did you have to do anything you didn’t expect to have to recover?

    Are You a Soothsayer

    Summarize the differences, and document what happened. If you got too many alerts, see if you can develop a plan to deal with them.Then document what the true failure boundaries are. For example if you’re firewall fails, do you loose access to your management network? After doing all this decide if there is anything you can do to push failure boundaries back into a single component, and if you can minimize the effect on the rest of your system. Basic infrastructure components like switches and routers usually have surprising failure boundaries, when coupled with infrastructure decisions such as putting all of your database servers in a single rack.

    This process takes time, and its hard to do at small and medium scales. Once you have infrastructure up and running, its difficult to run these types of tests, but you should probably still advocate for it. Its better to discover these type of problems under controlled conditions then in the middle of the night. You may be able to test parts using virtualization, but sometimes you’ll need to pull power plugs. Concoct a any type of failure you can imagine, and look for soft failures(mysql server is loosing 10% of its packets) since they are the most difficult to detect. Remember, You’ll never really have a redundant system until you start understanding how each component works with other components in your system.


  • Two Helpful Data Concepts

    I’ve been batting around a couple terms while talking about various technical solutions for years, and I’ve found them useful while selecting and constructing technical solutions for managing data. They’ve helped me both build, and provide input on what I need from a technical solution.

    1. Real-Timey-ness

    When considering solutions for your data needs, you may say you need something real-time, but you’re unlikely to ever find such a solution. Most of them will incur some latency while persisting a record, and the availability of these items will be based on the latency it takes to write a record. Aggregate data will be further delayed, since it will depend on the latency of the first record. There are rarely any true real-time solutions, most solutions are real-timey; they’ll have some time measured in milliseconds, seconds, minutes,  or hours before that data becomes useable. In addition, most data is not actionable for sometime after it become available; a single data point is rarely enough to make a decision.

    2. Correcty-ness

    Is your answer correct? This is a fundamental question in dealing with data, and most people I’ve dealt with make one fundamental incorrect assumption. For most metrics, the data you collect is mostly correct. You will loose some transactions, you will over count others, your software will have bugs, and these will all lead to inaccuracy in your data. To improve your correcty-ness you’ll likely have to trade resolution or latency in order to push yourself closer to your achievable accuracy. Correcty-ness unlike real-timeness can be improved over time. You can look at higher resolution data, and adjust your initial measurement for a time period.

    A Practical Example

    Scalability of you’re application will generally be determined by your requirements around real-timey-ness and correcty-ness. Take for example a visits counter on a web-page. The most straight forward implementation would be increment a counter for each page render and store it in a database(UPDATE counter_table SET views_count = views_count+1 WHERE page_id = 5). This solution will have a high degree of accuracy, but comes with limited scalability since we’d be locking a row in order to increment its value. Furthermore the accuracy of this solution may actually degrade as usage increases since aborted page loads may fail to increment the counter. This solution would have a high real-timey and correcty-ness values, at the expense of scalability.

    A more complex solution would be to look at the request logs and increment the row by batch using an asynchronous process. This solution will update the counter in whatever time it takes to aggregate the last set of logs. The correcty-ness of the count at any given point will be less than that of the above solution, as will the real-timey-ness since you will only have the answer since the last aggregation. However, the solution will support larger requests volumes, since the aggregation of requests will take place out-side of the page render.

    The first solution presented is perfect for a small web-application. The small number of requests you receive at small scale can make asynchronous solutions look broken, and the latency incurred per page render is relatively small. In the small scale its probably better to favor a higher-degree of correcty-ness.

    The second solution will perform much better at larger scale. Its lack of correcty-ness and real-timey-ness will be hidden by the hit counter incrementing by large numbers with each refresh. This solution would generally be called eventually consistent, but you can never really achieve consistency without looking at a fixed time window that is no longer being updated.

    A Third Solution

    During each page render a UDP packet could be sent to an application that increments an in memory counter. The page could then pull this count from the secondary application, and display the current count. To achieve consistency the request logs could be aggregated on a given interval, that then replaces the base value of the counter.

    This solution will have a high degree of real-timey-ness since page views will be aggregated immediately. However the correcty-ness of the application will be less than the first solution, since the data transmission method is less reliable. This is a fairly scalable solution, that would balance actionable real-time data with the ability to correct measuring errors. That said it is likely less scalable then the two previous solutions.

    Great, What Now?

    When designing an application take the time to think about the acceptable real-timey-ness and correcty-ness. In general high correcty-ness and  real-timey-ness create slow applications at larger scale. So, when spec’ing out an application consider assigning a real-timey-ness value to data you present to users. I would typically define it as time windows, i.e. data presented in the UI will be at least 5 seconds old, and no more that 2 minutes. As for correcty-ness, I would define it as the acceptable accuracy within a given time period. For example, data must have an error no larger than 50%  within 5 seconds, 10% within 5 minutes, and 0.1% within 7 minutes.

    Deciding what these numbers should be is a different problem. You can generally address real-timey-ness at the expense of correcty-ness, but its hard to improve correcty-ness and real-timey-ness at scale. I would generally look at the scale of the application to decide how important either is. Solutions that are low scale won’t generally have contention issues, so I would favor high real-timey-ness and correcty-ness. In addition, people are more likely to notice issues when the numbers increase by small amounts(i.e. the counter not going from 2 to 3 for 4 page views, and are more likely to complain about problems in the software. For large scale solutions I would take an approach looking at how long the data takes to become actionable. Being accurate within 1 to 5 minutes may be enough to help you derive a result from your data, but it may take 24 hours before you can conclude anything. Think about the amount of time this will take, and then build your specification accordingly.

    Hopefully these concepts are useful, and can be put  to use when designing solutions for data.

     

     


  • Keep it Simple Sysadmin

    I’ve been thinking about what I hate about my configuration management system. I seem to spend a lot of time when I want to make a change looking at the various resources in chef, and sometimes I end up using providers like the ops code apache2 cookbook to manage enabling and disabling of modules. A few days ago while in a rush I did the following:

    A few days later, a co-worked decided this was verbose and replaced it with, using the previously mentioned apache2 cookbook syntax:
    This seemed reasonable, and frankly is the approach I would have taken if I had taken time to figure it out if we had a cookbook providing this resource(this was an emergency change). The only problem was that we’re using a broken version of that module that didn’t actually do anything(I’ve still not dug in, but I found a similar bug that was fixed some time ago). So, no errors, no warning, and no configuration was applied or removed.

    I’ve come to the decision that both are probably the wrong approach. Both of these approaches we’re trying to use the a2enmod command available in Ubuntu or provided similar functionality. It seems reasonable since it will manage dependencies, but why should I use this? The only reason would be to maintain compatibility with Ubuntu and Debians’ configuration management utilities, but I’ve already decided to outsource this to Chef which does that pretty well. I’ve come to believe that I should have just done this:

    Why?

    The right approach to configuration management is to use the minimal features of your tool. Chef(in this case), provides awesome tools for file management, but when I use higher level abstractions I’m actually introducing more obfuscation. The example given with the Apache module is painful, because when I look at what is really happening, I’m copying a file. Chef is really good at managing files, why would I want to abstract away? Do people really not understand how the a2enmod works? Is this really a good thing if you operations team doesn’t know how Apache configs work?

    Cron is another great example:
    Do we really find this simpler than creating a cron file, and having it copied to /etc/cron.d? Isn’t that why we introduced cron.d; to get out of having cron jobs installed in user’s crontabs? Its also difficult to ensure Chef removes the job, since a user can screw up that file. Not to mention that this has introduced a DSL to mange a DSL with a more verbose syntax than the original, which seems absurd.

    KISS – Keep it Simple Sysadmin

    My frustration here is that for the most part we are just copying files around. The higher level abstractions we use actually decrease clarity, and understanding of how the underlying configuration is being applied. If there is a logical flaw(bug) in the implementation, we’re stuck suffering and needing to address the issue, and sort out who is at fault.  So, use the simple resources, and stay away from the magic ones, and you’ll actually be able to understand the configuration of your systems and hopefully get to the root of your problems faster.


Got any book recommendations?