Hashing Algorithms In Python¶

While working on a larger project there was a need to detect some changes happened in given data structures. Usually you would immediately start using a hashing algorithm, as md5 for example:

Python 3.4.5 (default, Jan 14 2017, 22:06:30)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> a_string = "Hello World"
>>> another_string = "Hello World"

>>> import hashlib
>>> hashlib.md5(a_string.encode()).hexdigest()
'b10a8db164e0754105b7a99be72e3fe5'
>>> hashlib.md5(another_string.encode()).hexdigest()
'b10a8db164e0754105b7a99be72e3fe5'
>>>
>>> hashlib.md5(a_string.encode()).hexdigest() == hashlib.md5(another_string.encode()).hexdigest()
True

I was wondering which one of all hashing algorithms will be fastest and collision safest and furthermore, isn’t there a better way to do this?

hashing vs. checksum¶

hashing¶

Why do i need a hashing algorithm. Why not just using built in hash() method?

I failed because everytime i restarted my Python process I received a different hash from hash method, so this has somehow be randomized on process startup, so i asked Google and i found this excellent post on the python mailing list:

… The security issue exploits Python’s dict and set implementations. Carefully crafted input can lead to extremely long computation times and denials of service. [1] Python dict and set types use hash tables to provide amortized constant time operations. Hash tables require a well-distributed hash function to spread data evenly across the hash table. The security issue is that an attacker could compute thousands of keys with colliding hashes; this causes quadratic algorithmic complexity when the hash table is constructed. To alleviate the problem, the new releases add randomization to the hashing of Python’s string types (bytes/str in Python 3 and str/unicode in Python 2), datetime.date, and datetime.datetime. This prevents an attacker from computing colliding keys of these types without access to the Python process.

Hash randomization causes the iteration order of dicts and sets to be unpredictable and differ across Python runs. Python has never guaranteed iteration order of keys in a dict or set, and applications are advised to never rely on it. …

[1] http://www.ocert.org/advisories/ocert-2011-003.html

—Benjamin Peterson / Python Mailing List

No good news and a anyway a bad idea to rely it.

checksum¶

Next try was md5. But how fast is md5 and why not using just a cyclic redundancy check for this. So i wrote a small testscript and included crc and adler as well just to get a feeling about the speed and efficiency of those algorithms as well

#!/usr/bin/env python
import zlib
import hashlib
import time
import sys


algorithms = [getattr(hashlib, algo) for algo in hashlib.algorithms_guaranteed]
algorithms.append(zlib.crc32)
algorithms.append(zlib.adler32)


def run(bytes_data, algorithm, iterations=1000):
    start = time.time()
    for idx in range(0, iterations):
        algorithm(bytes_data)
    return time.time() - start


def iteration(title, sample):
    print(title + ' (%d chars):' % len(sample))
    values = []
    for algorithm in algorithms:
        duration = run(sample, algorithm)
        values.append((duration, algorithm))
        print('%.06fs using %s' % (duration, algorithm.__name__))
    print()
    return values


if __name__ == '__main__':
    sample = ''.join([chr(i) for i in range(32, 123)]).encode()
    print('Python: %s' % sys.version)
    print('Sample: %s' % sample)
    print()

    iteration('Short', sample)
    iteration('Long', sample * 10000)

Result¶

I was running these tests on an 3,4 GHz Intel Core i7 on macOS 10.12.5.

Python: 3.4.5 (default, Jan 14 2017, 22:06:30)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)]

Sample string used:

b' !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz'

For short sample string with (91 chars):

000331s using openssl_sha512
000410s using openssl_md5
000531s using openssl_sha224
000524s using openssl_sha256
000321s using openssl_sha384
000418s using openssl_sha1
000189s using crc32
000153s using adler32

For long sample string above with (91000 chars):

102090s using openssl_sha512
395054s using openssl_md5
119225s using openssl_sha224
385265s using openssl_sha256
185888s using openssl_sha384
321250s using openssl_sha1
295018s using crc32
071074s using adler32

Conclusion¶

Use adler32 if you have large data and speed matters.
Use md5 if you need more security and still be quite fast.
Never use crc32, adler32 or md5 if security matters.

Continuous integration with GitLab and Docker Official Grafana docker image on OpenShift

Comments

comments powered by Disqus

Hashing Algorithms In Python¶

hashing vs. checksum¶

hashing¶

checksum¶

Result¶

Conclusion¶

Comments

widerin.net

Navigation

Recent Posts

Categories

Archives