AWK: compare apache dates without using regex

Question

AWK: compare apache dates without using regex

I am writing a loganalysis application and want to capture Apache log entries between two specific dates. Let's assume the date is formed as such: 22 / Dec / 2009: 00:19 (day / month / year: hour: minute)

I am currently using regex to replace the month name with its numeric value, remove the delimiters so the specified date is converted to: 221220090019

makes date comparison trivial .. but ..

Running a regex on every record for large files, say one containing a quarter of a million records, is extremely costly .. is there any other method that doesn't involve regex replacement?

Thanks in advance

Edit: here's a function doing the conversion / comparison

function dateInRange(t, from, to) {
    sub(/[[]/, "", t);
    split(t, a, "[/:]");
    match("JanFebMarAprMayJunJulAugSepOctNovDec", a[2]);
    a[2] = sprintf("%02d", (RSTART + 2) / 3);
    s = a[3] a[2] a[1] a[4] a[5];

    return s >= from && s <= to;
}

"from" and "to" are intervals in the above format and "t" is the original date / time field of the apache file (eg [22 / Dec / 2009: 00: 19: 36)

+2

regex logging awk apache

smallmeans May 15 '10 at 20:34

a source to share

4 answers

Roland Illig · Answer 1 · 2010-05-15T22:24:04+0000

I had the same problem of a very slow AWK program that included regular expressions. When I translated the entire program to Perl, it ran much faster. I guess it was because the GNU AWK compiles the regex every time it interprets the expression, where it perl

just compiles the expression once.

serg · Answer 2 · 2010-05-15T22:31:00+0000

Well, here's the idea, assuming the journal entries are ordered by date.

Instead of running a regex on each line of the file and checking if that entry is within the required range, do a binary search .

Get the total number of lines in a file. Read the line from the middle and check its date. If it is older than your range, then anything before this line can be ignored. Divide the remainder in half and check the line again from the middle. And so on until you find the boundaries of the range.

Dennis williamson · Answer 3 · 2010-05-15T22:50:11+0000

Here is a Python program I wrote to do a binary search through a date based log file. It can be adapted to work for your use.

It looks for the middle of the file, then syncs to a newline, reads and compares the date, repeats the process dividing the previous half in half, doing this until the date matches (greater or equal), rewinds to make sure there is no longer the same dates as before, then reads and prints the lines to the end of the desired range. It's very fast.

I have a more advanced version at work. I will eventually complete it and post an updated version.

smallmeans · Answer 4 · 2010-05-16T22:08:27+0000

Shredding files just to define a range sounds a little hard for such a simple task (binary search is worth considering though)

here's my modified function which is obviously much faster since the regex is highlighted

BEGIN {
    months["Jan"] = 1
    months["Feb"] = 2
    ....
    months["Dec"] = 12
}
function dateInRange(t, from, to) {
    split(t, a, "[/:]");
    m = sprintf("%02d", months[a[2]]);
    s = a[3] m a[1] a[4] a[5];
    ok = s >= from && s <= to;
    if(!ok && seen == 1){exit;}
    return ok;
}

The array is defined and subsequently used to index months. It ensured that the program would not keep checking records as soon as the date is out of range (the variable is seen set on the first match)

Thank you all for your input.

AWK: compare apache dates without using regex

More articles: