Hashing Algorithms In Python¶
While working on a larger project there was a need to detect some changes happened in given data structures. Usually you would immediately start using a hashing algorithm, as md5 for example:
Python 3.4.5 (default, Jan 14 2017, 22:06:30)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> a_string = "Hello World"
>>> another_string = "Hello World"
>>> import hashlib
>>> hashlib.md5(a_string.encode()).hexdigest()
'b10a8db164e0754105b7a99be72e3fe5'
>>> hashlib.md5(another_string.encode()).hexdigest()
'b10a8db164e0754105b7a99be72e3fe5'
>>>
>>> hashlib.md5(a_string.encode()).hexdigest() == hashlib.md5(another_string.encode()).hexdigest()
True
I was wondering which one of all hashing algorithms will be fastest and collision safest and furthermore, isn’t there a better way to do this?
hashing vs. checksum¶
hashing¶
Why do i need a hashing algorithm. Why not just using built in hash()
method?
I failed because everytime i restarted my Python process I received a different
hash from hash
method, so this has somehow be randomized on process startup,
so i asked Google and i found this excellent post on the python mailing list:
… The security issue exploits Python’s dict and set implementations. Carefully crafted input can lead to extremely long computation times and denials of service. [1] Python dict and set types use hash tables to provide amortized constant time operations. Hash tables require a well-distributed hash function to spread data evenly across the hash table. The security issue is that an attacker could compute thousands of keys with colliding hashes; this causes quadratic algorithmic complexity when the hash table is constructed. To alleviate the problem, the new releases add randomization to the hashing of Python’s string types (bytes/str in Python 3 and str/unicode in Python 2), datetime.date, and datetime.datetime. This prevents an attacker from computing colliding keys of these types without access to the Python process.
Hash randomization causes the iteration order of dicts and sets to be unpredictable and differ across Python runs. Python has never guaranteed iteration order of keys in a dict or set, and applications are advised to never rely on it. …
[1] http://www.ocert.org/advisories/ocert-2011-003.html
—Benjamin Peterson / Python Mailing List
No good news and a anyway a bad idea to rely it.
checksum¶
Next try was md5
. But how fast is md5
and why not using just a cyclic
redundancy check for this. So i wrote a small testscript and included crc and
adler as well just to get a feeling about the speed and efficiency of those
algorithms as well
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | #!/usr/bin/env python import zlib import hashlib import time import sys algorithms = [getattr(hashlib, algo) for algo in hashlib.algorithms_guaranteed] algorithms.append(zlib.crc32) algorithms.append(zlib.adler32) def run(bytes_data, algorithm, iterations=1000): start = time.time() for idx in range(0, iterations): algorithm(bytes_data) return time.time() - start def iteration(title, sample): print(title + ' (%d chars):' % len(sample)) values = [] for algorithm in algorithms: duration = run(sample, algorithm) values.append((duration, algorithm)) print('%.06fs using %s' % (duration, algorithm.__name__)) print() return values if __name__ == '__main__': sample = ''.join([chr(i) for i in range(32, 123)]).encode() print('Python: %s' % sys.version) print('Sample: %s' % sample) print() iteration('Short', sample) iteration('Long', sample * 10000) |
Result¶
I was running these tests on an 3,4 GHz Intel Core i7 on macOS 10.12.5.
Python: 3.4.5 (default, Jan 14 2017, 22:06:30)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)]
Sample string used:
b' !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz'
For short sample string with (91 chars):
0.000331s using openssl_sha512
0.000410s using openssl_md5
0.000531s using openssl_sha224
0.000524s using openssl_sha256
0.000321s using openssl_sha384
0.000418s using openssl_sha1
0.000189s using crc32
0.000153s using adler32
For long sample string above with (91000 chars):
2.102090s using openssl_sha512
1.395054s using openssl_md5
3.119225s using openssl_sha224
3.385265s using openssl_sha256
2.185888s using openssl_sha384
1.321250s using openssl_sha1
0.295018s using crc32
0.071074s using adler32
Conclusion¶
- Use
adler32
if you have large data and speed matters. - Use
md5
if you need more security and still be quite fast. - Never use
crc32
,adler32
ormd5
if security matters.