Python MD5 Hash Faster Calculation
I will try my best to explain my problem and my thought on how I think I can solve it.
I am using this code
for root, dirs, files in os.walk(downloaddir):
for infile in files:
f = open(os.path.join(root,infile),'rb')
filehash = hashlib.md5()
while True:
data = f.read(10240)
if len(data) == 0:
break
filehash.update(data)
print "FILENAME: " , infile
print "FILE HASH: " , filehash.hexdigest()
and using start = time.time () elapsed = time.time () - start I am measuring how long it takes to compute the hash. Pointing my code to a file with 653megs this is the result:
root@Mars:/home/tiago# python algorithm-timer.py
FILENAME: freebsd.iso
FILE HASH: ace0afedfa7c6e0ad12c77b6652b02ab
12.624
root@Mars:/home/tiago# python algorithm-timer.py
FILENAME: freebsd.iso
FILE HASH: ace0afedfa7c6e0ad12c77b6652b02ab
12.373
root@Mars:/home/tiago# python algorithm-timer.py
FILENAME: freebsd.iso
FILE HASH: ace0afedfa7c6e0ad12c77b6652b02ab
12.540
Ok now 12 seconds + - in file 653mb, my problem is that I intend to use this code in a program that will run across multiple files, some of them may be 4/5 / 6Gb and it will take a long time before calculating ... What's interesting if there is a faster way to compute the hash of a file? Maybe by doing multithreading? I used another script to check the use of the second and second CPU and I see that my code is only using 1 of 2 CPUs and only 25% max, is there any way I can change this?
Thanks everyone for this help.
a source to share
Computing the hash in your case will almost certainly be I / O related (unless you run it on a machine with a very slow processor), so multithreading or processing multiple files at the same time probably won't give you the results you expect.
Arraging files across multiple drives or a faster (SSD) drive will probably help, although this is probably not the solution you are looking for.
a source to share
What is it worth for:
c:\python\Python.exe c:\python\Tools\scripts\md5sum.py cd.iso
takes 9.671 seconds on my laptop (2GHz core2 duo with 80GB SATA hard drive).
As others have pointed out, MD5s are disk bound, but your 12 second test is probably close to the fastest you could get.
Also, python md5sum.py uses 8096 for the buffer size (although I'm pretty sure they meant either 4096 or 8192).
a source to share
This helped me increase the size of my buffer to a certain point. I started at 1024 and multiplied it by 2 ^ N, incrementing N each time starting from 1. With this method, I found that on my system, the 65536 buffer size seemed to be about as good as it was. ... However, this only gave me a roughly 7% improvement in runtime.
Profiling showed that about 80% of the time is spent on the MD5 update method, with the remaining 20% reading in the file. Since MD5 is a sequential algorithm, and Python's algorithm is already implemented in C, I don't think you can do this to speed up the MD5 part. You can try to compute the MD5 of two different files in parallel, but as everyone said, you will ultimately be limited by the speed of disk access.
a source to share