The Joel on Software Discussion Group (CLOSED)

A place to discuss Joel on Software. Now closed.

This community works best when people use their real names. Please register for a free account.

Other Groups:
Joel on Software
Business of Software
Design of Software (CLOSED)
.NET Questions (CLOSED)
Fog Creek Copilot

The Old Forum

Your hosts:
Albert D. Kallal
Li-Fan Chen
Stephen Jones

Awk/Gawk help?

I have a file of data that looks like the following (this is a list of message I sent out, one per line):


I’d like to build a summary report that tells me how many messages I sent in a given second.  The messages are key/value tagged, and tag 52 is the sending time with seconds and milliseconds.  So I’d like to produce a report that looks like:

20080624-12:43:38 - 2 message(s)
20080624-12:43:39 - 1 message(s)

So I need to go through the file, extract the timestamp from key 52, remove the milliseconds, then build a hash that maps a timestamp to a count of messages at that second.  I could do this with Perl, but I’m trying to learn more about gawk/sed.  Can this be done with those tools?  If so, can I get some pointers?
Hockey Player Send private email
Wednesday, June 25, 2008
Yes, awk can be used for this. An awk "program" consists of a series of patterns which are applied to each line of input, and a series of actions associated with actions which execute on matches. It's tailor-made for this sort of work.

As for pointers, man pages and Google will give you far more comprehensive and useful help than you're likely to get through a forum.

Hope this helps!
BrotherBeal Send private email
Wednesday, June 25, 2008
This should get you started

essentially you can create a pattern that matches

and use the captured time to create an associative array and bump the count value by one.

then in the END clause of the awk script print out the keys (times) and their count (number of messages)

It should be a pretty short script.
Bart Park
Wednesday, June 25, 2008
Ahh, this brings back memories working with FIX.  Have fun handling repeating groups in awk.
none Send private email
Wednesday, June 25, 2008
_The Unix Programming Environment_ (Kernihan and Pike) also has a great introduction to awk.
Wednesday, June 25, 2008
I ended up with:

# find lines that have a timestamp like 52=20080624-12:43:39.066
match($0,/52=........-..:..:../) {

  for (timestamp in freq)
    printf "%s\t%d\n", timestamp, freq[timestamp]

Thanks all!
Hockey Player Send private email
Wednesday, June 25, 2008
Bear in mind that Perl was written as a _replacement_ for awk/sed.  There really are very few situations these days where those are appropriate tools to use.
Wednesday, June 25, 2008
There really are very few situations these days where those are appropriate tools to use.

I find gawk to be faster to write and easier to read than Perl for short programs which don't have to do anything fancy (pull in parsing libraries, etc).  Its great for one-off text processing tasks.

For heavier duty things, I turn to Ruby.  (Or Perl, if you put a gun to my head and tell me that there is more than one way to kill me.)
Patrick McKenzie (Bingo Card Creator) Send private email
Thursday, June 26, 2008
"More than one way to kill me"

Lally Singh Send private email
Thursday, June 26, 2008
"More than one way to kill me"

I liked that too, lol...
Thursday, June 26, 2008
Hockey Player's awk program is far more complicated than it needs to be.

The trick is a pre-processing step that replaces | with a newline. This could be done in sed or awk but let's use tr.

This gives (based on the data  in the OP:



Then we can tell awk to act only on lines starting with 52= and split fields on - and . so the time without milliseconds is the second field (or $2 in awk):

tr '|' '\012' < data |
awk -F'-|\\.' '
  /52=/ { Times[$2]++ }
  END { for (i in Times) print Times[i], i }
which gives:
2 12:43:38
1 12:43:39
John L Send private email
Friday, June 27, 2008

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz