AWK: compare apache dates without using regex
I am writing a loganalysis application and want to capture Apache log entries between two specific dates. Let's assume the date is formed as such: 22 / Dec / 2009: 00:19 (day / month / year: hour: minute)
I am currently using regex to replace the month name with its numeric value, remove the delimiters so the specified date is converted to: 221220090019
makes date comparison trivial .. but ..
Running a regex on every record for large files, say one containing a quarter of a million records, is extremely costly .. is there any other method that doesn't involve regex replacement?
Thanks in advance
Edit: here's a function doing the conversion / comparison
function dateInRange(t, from, to) {
sub(/[[]/, "", t);
split(t, a, "[/:]");
match("JanFebMarAprMayJunJulAugSepOctNovDec", a[2]);
a[2] = sprintf("%02d", (RSTART + 2) / 3);
s = a[3] a[2] a[1] a[4] a[5];
return s >= from && s <= to;
}
"from" and "to" are intervals in the above format and "t" is the original date / time field of the apache file (eg [22 / Dec / 2009: 00: 19: 36)
a source to share
I had the same problem of a very slow AWK program that included regular expressions. When I translated the entire program to Perl, it ran much faster. I guess it was because the GNU AWK compiles the regex every time it interprets the expression, where it perl
just compiles the expression once.
a source to share
Well, here's the idea, assuming the journal entries are ordered by date.
Instead of running a regex on each line of the file and checking if that entry is within the required range, do a binary search .
Get the total number of lines in a file. Read the line from the middle and check its date. If it is older than your range, then anything before this line can be ignored. Divide the remainder in half and check the line again from the middle. And so on until you find the boundaries of the range.
a source to share
Here is a Python program I wrote to do a binary search through a date based log file. It can be adapted to work for your use.
It looks for the middle of the file, then syncs to a newline, reads and compares the date, repeats the process dividing the previous half in half, doing this until the date matches (greater or equal), rewinds to make sure there is no longer the same dates as before, then reads and prints the lines to the end of the desired range. It's very fast.
I have a more advanced version at work. I will eventually complete it and post an updated version.
a source to share
Shredding files just to define a range sounds a little hard for such a simple task (binary search is worth considering though)
here's my modified function which is obviously much faster since the regex is highlighted
BEGIN {
months["Jan"] = 1
months["Feb"] = 2
....
months["Dec"] = 12
}
function dateInRange(t, from, to) {
split(t, a, "[/:]");
m = sprintf("%02d", months[a[2]]);
s = a[3] m a[1] a[4] a[5];
ok = s >= from && s <= to;
if(!ok && seen == 1){exit;}
return ok;
}
The array is defined and subsequently used to index months. It ensured that the program would not keep checking records as soon as the date is out of range (the variable is seen set on the first match)
Thank you all for your input.
a source to share