Getting Unique Counts From a Log File

Two colleagues of mine ask a very similar question for interviews. The question is not particularly hard, nor does it require a lot of thought to solve, but its something that as a developer or a ops guys you might find yourself needing to do. The question is, given a log file of a particular format, tell me how many times something occurs in that log file. For example tell me the number of unique IP addresses in an access log, and the number of times each IP had visited this system.

Its amazing how many people don’t know what to do with this. One of my peers ask people to do this using the command line, the other tells the candidate they can do this anyway then want. I like this question because its VERY practical; I do tasks like this everyday, and I expect the people I work with to be able to do.

A More Concrete Exmaple

I like the shell solution, because its basically a one liner. So lets walk through it using access logs as an example.

Here is a very basic sample of a common access_log I threw together for this:

Lets say you want to count the number of times a unique IP addresses who’ve visited this system. Using nothing more than awk, sort, and uniq you can find the answer. What you’ll want to do is pull the first field with awk, then pipe that through sort, and then uniq. This isn’t fancy, but it returns the result very quickly without a whole lot of fuss.

Like so:


This gives you each hostname or IP, and the number of times they’ve contacted this server.

Upping the Complexity


Now for something more complex lets say you want to get the most commonly requested document that returns a 404. So, again we can do this all in a shell one-liner. We still need awk, sort, uniq, but this time we’ll also use tail. This time we can use awk to examine the status field(9), then print the URL field(7) if the status returned was 404. We can then use sort, uniq, and sort to order the results. Finally we’ll use tail to only print the last line, and awk, to print the requested document.

So here is what this looks like:

Of course there are many other ways to do this. This is a totally simple way to do it, and the best part of this is that you can count on these tools being on almost every *nix system.



Posted

in

by

Tags:

Comments

6 responses to “Getting Unique Counts From a Log File

  1. Mina Naguib Avatar

    Fairly certain your second example is missing a sort -n after uniq -c for it to do what you’re trying to do

    1. papilion Avatar
      papilion

      Totally right. I’ve corrected the gist, and thanks.

  2. P. S. Avatar
    P. S.

    your second example should use sort -rn | head -1 instead. it is more efficient by far.

  3. Jon Jenkins Avatar
    Jon Jenkins

    My friend (and Pinterest co-worker) John Rauser wrote this guide to ad hoc data analysis from the Unix command line. You might find it interesting.

    http://en.wikibooks.org/wiki/Ad_Hoc_Data_Analysis_From_The_Unix_Command_Line

  4. P. S. Avatar
    P. S.

    Also, more idiomatic and extensible awk would be:

    awk ‘$9~/^404$/ { print $7}’

    That is, having a match condition around the code to execute.

    And then also use a regex because you could easily look for all 4xx or 5xx codes as:
    awk ‘$9~/^[45][0-9][0-9]/ { print $7}’

    Awk is terrible for real code but incredibly useful for command one-liners.

  5. papilion Avatar
    papilion

    @P.S. Thanks for the comments. This is intended to be more of a sketch than a how to, but you have good points. The primary purpose here is to let people know that a standard *nix box has a rich set of tools which is in your interest to learn.