Python MD5 Hash Faster Calculation

Question

Python MD5 Hash Faster Calculation

I will try my best to explain my problem and my thought on how I think I can solve it.

I am using this code

    for root, dirs, files in os.walk(downloaddir):
for infile in files:
    f = open(os.path.join(root,infile),'rb')
    filehash = hashlib.md5()
    while True:
        data = f.read(10240)
        if len(data) == 0:
            break
        filehash.update(data)
    print "FILENAME: " , infile
    print "FILE HASH: " , filehash.hexdigest()

and using start = time.time () elapsed = time.time () - start I am measuring how long it takes to compute the hash. Pointing my code to a file with 653megs this is the result:

root@Mars:/home/tiago# python algorithm-timer.py 
FILENAME:  freebsd.iso
FILE HASH:  ace0afedfa7c6e0ad12c77b6652b02ab
          12.624
root@Mars:/home/tiago# python algorithm-timer.py 
FILENAME:  freebsd.iso
FILE HASH:  ace0afedfa7c6e0ad12c77b6652b02ab
          12.373
root@Mars:/home/tiago# python algorithm-timer.py 
FILENAME:  freebsd.iso
FILE HASH:  ace0afedfa7c6e0ad12c77b6652b02ab
          12.540

Ok now 12 seconds + - in file 653mb, my problem is that I intend to use this code in a program that will run across multiple files, some of them may be 4/5 / 6Gb and it will take a long time before calculating ... What's interesting if there is a faster way to compute the hash of a file? Maybe by doing multithreading? I used another script to check the use of the second and second CPU and I see that my code is only using 1 of 2 CPUs and only 25% max, is there any way I can change this?

Thanks everyone for this help.

+2

python multithreading multicore md5

balgan 11 May '10 at 19:00

a source to share

4 answers

Mavrik · Answer 1 · 2010-05-11T19:11:22+0000

Computing the hash in your case will almost certainly be I / O related (unless you run it on a machine with a very slow processor), so multithreading or processing multiple files at the same time probably won't give you the results you expect.

Arraging files across multiple drives or a faster (SSD) drive will probably help, although this is probably not the solution you are looking for.

Grzegorz Oledzki · Answer 2 · 2010-05-11T19:12:43+0000

Are disk operations the bottleneck here? Assuming a read speed of 80MB / s (this is how my hard drive works), it takes about 8 seconds to read the file.

Seth · Answer 3 · 2010-05-11T19:25:47+0000

What is it worth for:

c:\python\Python.exe c:\python\Tools\scripts\md5sum.py cd.iso

takes 9.671 seconds on my laptop (2GHz core2 duo with 80GB SATA hard drive).

As others have pointed out, MD5s are disk bound, but your 12 second test is probably close to the fastest you could get.

Also, python md5sum.py uses 8096 for the buffer size (although I'm pretty sure they meant either 4096 or 8192).

Justin peel · Answer 4 · 2010-05-11T20:33:25+0000

This helped me increase the size of my buffer to a certain point. I started at 1024 and multiplied it by 2 ^ N, incrementing N each time starting from 1. With this method, I found that on my system, the 65536 buffer size seemed to be about as good as it was. ... However, this only gave me a roughly 7% improvement in runtime.

Profiling showed that about 80% of the time is spent on the MD5 update method, with the remaining 20% reading in the file. Since MD5 is a sequential algorithm, and Python's algorithm is already implemented in C, I don't think you can do this to speed up the MD5 part. You can try to compute the MD5 of two different files in parallel, but as everyone said, you will ultimately be limited by the speed of disk access.

Python MD5 Hash Faster Calculation

More articles: