Hashing Algorithms In Python

While working on a larger project there was a need to detect some changes happened in given data structures. Usually you would immediately start using a hashing algorithm, as md5 for example:

Python 3.4.5 (default, Jan 14 2017, 22:06:30)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> a_string = "Hello World"
>>> another_string = "Hello World"

>>> import hashlib
>>> hashlib.md5(a_string.encode()).hexdigest()
'b10a8db164e0754105b7a99be72e3fe5'
>>> hashlib.md5(another_string.encode()).hexdigest()
'b10a8db164e0754105b7a99be72e3fe5'
>>>
>>> hashlib.md5(a_string.encode()).hexdigest() == hashlib.md5(another_string.encode()).hexdigest()
True

I was wondering which one of all hashing algorithms will be fastest and collision safest and furthermore, isn’t there a better way to do this?

hashing vs. checksum

hashing

Why do i need a hashing algorithm. Why not just using built in hash() method?

I failed because everytime i restarted my Python process I received a different hash from hash method, so this has somehow be randomized on process startup, so i asked Google and i found this excellent post on the python mailing list:

... The security issue exploits Python’s dict and set implementations. Carefully crafted input can lead to extremely long computation times and denials of service. [1] Python dict and set types use hash tables to provide amortized constant time operations. Hash tables require a well-distributed hash function to spread data evenly across the hash table. The security issue is that an attacker could compute thousands of keys with colliding hashes; this causes quadratic algorithmic complexity when the hash table is constructed. To alleviate the problem, the new releases add randomization to the hashing of Python’s string types (bytes/str in Python 3 and str/unicode in Python 2), datetime.date, and datetime.datetime. This prevents an attacker from computing colliding keys of these types without access to the Python process.

Hash randomization causes the iteration order of dicts and sets to be unpredictable and differ across Python runs. Python has never guaranteed iteration order of keys in a dict or set, and applications are advised to never rely on it. ...

[1] http://www.ocert.org/advisories/ocert-2011-003.html

—Benjamin Peterson / Python Mailing List

No good news and a anyway a bad idea to rely it.

checksum

Next try was md5. But how fast is md5 and why not using just a cyclic redundancy check for this. So i wrote a small testscript and included crc and adler as well just to get a feeling about the speed and efficiency of those algorithms as well

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#!/usr/bin/env python
import zlib
import hashlib
import time
import sys


algorithms = [getattr(hashlib, algo) for algo in hashlib.algorithms_guaranteed]
algorithms.append(zlib.crc32)
algorithms.append(zlib.adler32)


def run(bytes_data, algorithm, iterations=1000):
    start = time.time()
    for idx in range(0, iterations):
        algorithm(bytes_data)
    return time.time() - start


def iteration(title, sample):
    print(title + ' (%d chars):' % len(sample))
    values = []
    for algorithm in algorithms:
        duration = run(sample, algorithm)
        values.append((duration, algorithm))
        print('%.06fs using %s' % (duration, algorithm.__name__))
    print()
    return values


if __name__ == '__main__':
    sample = ''.join([chr(i) for i in range(32, 123)]).encode()
    print('Python: %s' % sys.version)
    print('Sample: %s' % sample)
    print()

    iteration('Short', sample)
    iteration('Long', sample * 10000)

Result

I was running these tests on an 3,4 GHz Intel Core i7 on macOS 10.12.5.

Python: 3.4.5 (default, Jan 14 2017, 22:06:30)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)]

Sample string used:

b' !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz'

For short sample string with (91 chars):

0.000331s using openssl_sha512
0.000410s using openssl_md5
0.000531s using openssl_sha224
0.000524s using openssl_sha256
0.000321s using openssl_sha384
0.000418s using openssl_sha1
0.000189s using crc32
0.000153s using adler32

For long sample string above with (91000 chars):

2.102090s using openssl_sha512
1.395054s using openssl_md5
3.119225s using openssl_sha224
3.385265s using openssl_sha256
2.185888s using openssl_sha384
1.321250s using openssl_sha1
0.295018s using crc32
0.071074s using adler32

Conclusion

  • Use adler32 if you have large data and speed matters.
  • Use md5 if you need more security and still be quite fast.
  • Never use crc32, adler32 or md5 if security matters.

Comments

comments powered by Disqus