Can I get the MD5sum of a directory from Perl?
I am writing a Perl script (on Windows) that uses File :: Find to index a network filesystem. It works great, but it takes a very long time to scan the file system. I thought it would be nice to somehow get the checksum of the directory before traversing it and the checksum matches the checksum that was done in the previous run, don't traverse the directory. This would eliminate most of the processing since the files on this filesystem do not change frequently.
In my AIX box, I use this command:
csum -h MD5 /directory
which returns something like this:
5cfe4faf4ad739219b6140054005d506 /directory
The command takes very little time:
time csum -h MD5 /directory
5cfe4faf4ad739219b6140054005d506 /directory
real 0m0.00s
user 0m0.00s
sys 0m0.00s
I have searched CPAN for a module that will do this, but it looks like all modules will give me MD5sum for every file in a directory, not the directory itself.
Is there a way to get the MD5sum for a directory in Perl, or even on Windows, since I could call a Win32 command from Perl?
Thanks in advance!
a source to share
From what I know, you cannot get the md5 of the directory. md5sum on other systems complains when you provide a directory. csum will most likely give you a hash of the contents of the directory file in the top-level directory, not traversing the tree.
You can grab the modified times for the files and hash them however you like by doing something like this:
sub dirModified($){
my $dir = @_[0];
opendir(DIR, "$dir");
my @dircontents = readdir(DIR);
closedir(DIR);
foreach my $item (@dircontents){
if( -f $item ){
print -M $item . " : $item - do stuff here\n";
} elsif( -d $item && $item !~ /^\.+$/ ){
dirModified("$dir/$item");
}
}
}
Yes, it will take some time to start.
a source to share
In addition to other good answers, let me add the following: if you want a checksum, please use the checksum algorithm instead of a ( broken! ) Hash function .
I don't think you don't need a cryptographically secure hash function in your file indexer. Instead, you need a way to see if there are changes in the directory lists without saving the entire list. Checksum algorithms do this: they return a different output when the input changes. They can do it faster as they are simpler than hash functions.
It is true that the user can change the directory in a way that would not have been detected by the checksum. However, the user will have to change the filenames as it is on purpose, since normal changes in filenames will (most likely) produce different checksums. Should this "attack" be defended then?
You should always consider the consequences of each attack and choose the appropriate tools.
a source to share