DeepHash Reference¶

Note

These docs are OUTDATED.

Visit Zepworks.com for the current documentations.

class deepdiff.deephash.DeepHash(obj, *, hashes=None, exclude_types=None, exclude_paths=None, exclude_regex_paths=None, hasher=None, ignore_repetition=True, significant_digits=None, number_format_notation='f', apply_hash=True, ignore_type_in_groups=None, ignore_string_type_changes=False, ignore_numeric_type_changes=False, ignore_type_subclasses=False, ignore_string_case=False, number_to_string_func=None, **kwargs)¶

DeepHash

Note

DeepDiff documentations are now hosted on Zepworks.com

What you see here are the old documentations.

DeepHash calculates the hash of objects based on their contents in a deterministic way. This way 2 objects with the same content should have the same hash.

The main usage of DeepHash is to calculate the hash of otherwise unhashable objects. For example you can use DeepHash to calculate the hash of a set or a dictionary!

At the core of it, DeepHash is a deterministic serialization of your object into a string so it can be passed to a hash function. By default it uses Murmur 3 128 bit hash function which is a fast, non-cryptographic hashing function. You have the option to pass any another hashing function to be used instead.

If it can’t find Murmur3 package (mmh3) installed, it uses Python’s built-in SHA256 for hashing which is considerably slower than Murmur3. So it is advised that you install Murmur3 by running pip install ‘deepdiff[murmur]

Import

>>> from deepdiff import DeepHash

Parameters

obj : any object, The object to be hashed based on its content.

hashes: dictionary, default = empty dictionary

A dictionary of {object or object id: object hash} to start with. Any object that is encountered and it is already in the hashes dictionary or its id is in the hashes dictionary, will re-use the hash that is provided by this dictionary instead of re-calculating its hash. This is typically used when you have a series of objects to be hashed and there might be repeats of the same object.

exclude_types: list, default = None

List of object types to exclude from hashing.

exclude_paths: list, default = None

List of paths to exclude from the report. If only one item, you can path it as a string instead of a list containing only one path.

exclude_regex_paths: list, default = None

List of string regex paths or compiled regex paths objects to exclude from the report. If only one item, you can path it as a string instead of a list containing only one regex path.

hasher: function. default = DeepHash.murmur3_128bit

hasher is the hashing function. The default is DeepHash.murmur3_128bit. But you can pass another hash function to it if you want. For example a cryptographic hash function or Python’s builtin hash function. All it needs is a function that takes the input in string format and returns the hash.

You can use it by passing: hasher=hash for Python’s builtin hash.

The following alternatives are already provided:

hasher=DeepHash.murmur3_128bit
hasher=DeepHash.murmur3_64bit
hasher=DeepHash.sha1hex

ignore_repetition: Boolean, default = True

If repetitions in an iterable should cause the hash of iterable to be different. Note that the deepdiff diffing functionality lets this to be the default at all times. But if you are using DeepHash directly, you can set this parameter.

significant_digitsint >= 0, default=None

By default the significant_digits compares only that many digits AFTER the decimal point. However you can set override that by setting the number_format_notation=”e” which will make it mean the digits in scientific notation.

Important: This will affect ANY number comparison when it is set.

Note: If ignore_numeric_type_changes is set to True and you have left significant_digits to the default of None, it gets automatically set to 12. The reason is that normally when numbers from 2 different types are compared, instead of comparing the values, we only report the type change. However when ignore_numeric_type_changes=True, in order compare numbers from different types to each other, we need to convert them all into strings. The significant_digits will be used to make sure we accurately convert all the numbers into strings in order to report the changes between them.

Internally it uses “{:.Xf}”.format(Your Number) to compare numbers where X=significant_digits when the number_format_notation is left as the default of “f” meaning fixed point.

Note that “{:.3f}”.format(1.1135) = 1.113, but “{:.3f}”.format(1.11351) = 1.114

For Decimals, Python’s format rounds 2.5 to 2 and 3.5 to 4 (to the closest even number)

When you set the number_format_notation=”e”, we use “{:.Xe}”.format(Your Number) where X=significant_digits.

number_format_notationstring, default=”f”

number_format_notation is what defines the meaning of significant digits. The default value of “f” means the digits AFTER the decimal point. “f” stands for fixed point. The other option is “e” which stands for exponent notation or scientific notation.

apply_hash: Boolean, default = True

DeepHash at its core is doing deterministic serialization of objects into strings. Then it hashes the string. The only time you want the apply_hash to be False is if you want to know what the string representation of your object is BEFORE it gets hashed.

ignore_type_in_groups

Ignore type changes between members of groups of types. For example if you want to ignore type changes between float and decimals etc. Note that this is a more granular feature. Most of the times the shortcuts provided to you are enough. The shortcuts are ignore_string_type_changes which by default is False and ignore_numeric_type_changes which is by default False. You can read more about those shortcuts in this page. ignore_type_in_groups gives you more control compared to the shortcuts.

For example lets say you have specifically str and byte datatypes to be ignored for type changes. Then you have a couple of options:

Set ignore_string_type_changes=True which is the default.
Set ignore_type_in_groups=[(str, bytes)]. Here you are saying if we detect one type to be str and the other one bytes, do not report them as type change. It is exactly as passing ignore_type_in_groups=[DeepDiff.strings] or ignore_type_in_groups=DeepDiff.strings .

Now what if you want also typeA and typeB to be ignored when comparing agains each other?

ignore_type_in_groups=[DeepDiff.strings, (typeA, typeB)]
or ignore_type_in_groups=[(str, bytes), (typeA, typeB)]

ignore_string_type_changes: Boolean, default = True

string type conversions should not affect the hash output when this is set to True. For example “Hello” and b”Hello” should produce the same hash.

By setting it to True, both the string and bytes of hello return the same hash.

ignore_numeric_type_changes: Boolean, default = False

numeric type conversions should not affect the hash output when this is set to True. For example 10, 10.0 and Decimal(10) should produce the same hash. When ignore_numeric_type_changes is set to True, all numbers are converted to strings with the precision of significant_digits parameter and number_format_notation notation. If no significant_digits is passed by the user, a default value of 12 is used.

ignore_type_subclasses

Use ignore_type_subclasses=True so when ignoring type (class), the subclasses of that class are ignored too.

ignore_string_case

Whether to be case-sensitive or not when comparing strings. By settings ignore_string_case=False, strings will be compared case-insensitively.

Returns

A dictionary of {item: item hash}. If your object is nested, it will build hashes of all the objects it contains too.

Examples

Let’s say you have a dictionary object.

>>> from deepdiff import DeepHash
>>> obj = {1: 2, 'a': 'b'}

If you try to hash it:

>>> hash(obj)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'dict'

But with DeepHash:

>>> from deepdiff import DeepHash
>>> obj = {1: 2, 'a': 'b'}
>>> DeepHash(obj)
{1: 234041559348429806012597903916437026784, 2: 148655924348182454950690728321917595655, 'a': 119173504597196970070553896747624927922, 'b': 4994827227437929991738076607196210252, '!>*id4488569408': 32452838416412500686422093274247968754}
So what is exactly the hash of obj in this case? DeepHash is calculating the hash of the obj and any other object that obj contains. The output of DeepHash is a dictionary of object IDs to their hashes. In order to get the hash of obj itself, you need to use the object (or the id of object) to get its hash:
>>> hashes = DeepHash(obj)
>>> hashes[obj]
34150898645750099477987229399128149852
Which you can write as:
>>> hashes = DeepHash(obj)[obj]
At first it might seem weird why DeepHash(obj)[obj] but remember that DeepHash(obj) is a dictionary of hashes of all other objects that obj contains too.

The result hash is 34150898645750099477987229399128149852 which is generated by Murmur 3 128bit hashing algorithm. If you prefer to use another hashing algorithm, you can pass it using the hasher parameter. Read more about Murmur3 here: https://en.wikipedia.org/wiki/MurmurHash

If you do a deep copy of obj, it should still give you the same hash:
>>> from copy import deepcopy
>>> obj2 = deepcopy(obj)
>>> DeepHash(obj2)[obj2]
34150898645750099477987229399128149852
Note that by default DeepHash will include string type differences. So if your strings were bytes:
>>> obj3 = {1: 2, b'a': b'b'}
>>> DeepHash(obj3)[obj3]
64067525765846024488103933101621212760
But if you want the same hash if string types are different, set ignore_string_type_changes to True:
>>> DeepHash(obj3, ignore_string_type_changes=True)[obj3]
34150898645750099477987229399128149852
ignore_numeric_type_changes is by default False too.
>>> obj1 = {4:10}
>>> obj2 = {4.0: Decimal(10.0)}
>>> DeepHash(obj1)[4] == DeepHash(obj2)[4.0]
False
But by setting it to True, we can get the same hash.
>>> DeepHash(obj1, ignore_numeric_type_changes=True)[4] == DeepHash(obj2, ignore_numeric_type_changes=True)[4.0]
True

number_format_notation: String, default = “f”

number_format_notation is what defines the meaning of significant digits. The default value of “f” means the digits AFTER the decimal point. “f” stands for fixed point. The other option is “e” which stands for exponent notation or scientific notation.

ignore_string_type_changes: Boolean, default = True

By setting it to True, both the string and bytes of hello return the same hash.

>>> DeepHash(b'hello', ignore_string_type_changes=True)
{b'hello': 221860156526691709602818861774599422448}
>>> DeepHash('hello', ignore_string_type_changes=True)
{'hello': 221860156526691709602818861774599422448}

ignore_numeric_type_changes: Boolean, default = False

For example if significant_digits=5, 1.1, Decimal(1.1) are both converted to 1.10000

That way they both produce the same hash.

>>> t1 = {1: 1, 2: 2.22}
>>> t2 = {1: 1.0, 2: 2.22}
>>> DeepHash(t1)[1]
231678797214551245419120414857003063149
>>> DeepHash(t1)[1.0]
231678797214551245419120414857003063149

You can pass a list of tuples or list of lists if you have various type groups. When t1 and t2 both fall under one of these type groups, the type change will be ignored. DeepDiff already comes with 2 groups: DeepDiff.strings and DeepDiff.numbers . If you want to pass both:

>>> from deepdiff import DeepDiff
>>> ignore_type_in_groups = [DeepDiff.strings, DeepDiff.numbers]

ignore_type_in_groups example with custom objects:

>>> class Burrito:
...     bread = 'flour'
...     def __init__(self):
...         self.spicy = True
...
>>>
>>> class Taco:
...     bread = 'flour'
...     def __init__(self):
...         self.spicy = True
...
>>>
>>> burrito = Burrito()
>>> taco = Taco()
>>>
>>> burritos = [burrito]
>>> tacos = [taco]
>>>
>>> d1 = DeepHash(burritos, ignore_type_in_groups=[(Taco, Burrito)])
>>> d2 = DeepHash(tacos, ignore_type_in_groups=[(Taco, Burrito)])
>>> d1[burrito] == d2[taco]
True

ignore_type_subclasses

Use ignore_type_subclasses=True so when ignoring type (class), the subclasses of that class are ignored too.

>>> from deepdiff import DeepHash
>>>
>>> class ClassB:
...     def __init__(self, x):
...         self.x = x
...     def __repr__(self):
...         return "obj b"
...
>>>
>>> class ClassC(ClassB):
...     def __repr__(self):
...         return "obj c"
...
>>> obj_b = ClassB(1)
>>> obj_c = ClassC(1)
>>>
>>> # Since these 2 objects are from 2 different classes, the hashes are different by default.
... # ignore_type_in_groups is set to [(ClassB, )] which means to ignore any type conversion between
... # objects of classB and itself which does not make sense but it illustrates a better point when
... # ignore_type_subclasses is set to be True.
... hashes_b = DeepHash(obj_b, ignore_type_in_groups=[(ClassB, )])
>>> hashes_c = DeepHash(obj_c, ignore_type_in_groups=[(ClassB, )])
>>> hashes_b[obj_b] != hashes_c[obj_c]
True
>>>
>>> # Hashes of these 2 objects will be the same when ignore_type_subclasses is set to True
... hashes_b = DeepHash(obj_b, ignore_type_in_groups=[(ClassB, )], ignore_type_subclasses=True)
>>> hashes_c = DeepHash(obj_c, ignore_type_in_groups=[(ClassB, )], ignore_type_subclasses=True)
>>> hashes_b[obj_b] == hashes_c[obj_c]
True

ignore_string_case

Whether to be case-sensitive or not when comparing strings. By settings ignore_string_case=False, strings will be compared case-insensitively.

>>> from deepdiff import DeepHash
>>> DeepHash('hello')['hello'] == DeepHash('heLLO')['heLLO']
False
>>> DeepHash('hello', ignore_string_case=True)['hello'] == DeepHash('heLLO', ignore_string_case=True)['heLLO']
True

number_format_notationstring, default=”f”

When numbers are converted to the string, you have the choices between “f” as fixed point and “e” as scientific notation:

>>> t1=10002
>>> t2=10004
>>> t1_hash = DeepHash(t1, significant_digits=3, number_format_notation="f")
>>> t2_hash = DeepHash(t2, significant_digits=3, number_format_notation="f")
>>>
>>> t1_hash[t1] == t2_hash[t2]
False
>>>
>>>
>>> # Now we use the scientific notation
... t1_hash = DeepHash(t1, significant_digits=3, number_format_notation="e")
>>> t2_hash = DeepHash(t2, significant_digits=3, number_format_notation="e")
>>>
>>> t1_hash[t1] == t2_hash[t2]
True

Defining your own number_to_string_func

Lets say you want the hash of numbers below 100 to be the same for some reason.

>>> from deepdiff import DeepHash
>>> from deepdiff.helper import number_to_string
>>> def custom_number_to_string(number, *args, **kwargs):
...     number = 100 if number < 100 else number
...     return number_to_string(number, *args, **kwargs)
...
>>> t1 = [10, 12, 100000]
>>> t2 = [50, 63, 100021]
>>> t1_hash = DeepHash(t1, significant_digits=3, number_format_notation="e", number_to_string_func=custom_number_to_string)
>>> t2_hash = DeepHash(t2, significant_digits=3, number_format_notation="e", number_to_string_func=custom_number_to_string)
>>> t1_hash[t1] == t2_hash[t2]
True

So both lists produced the same hash thanks to the low significant digits for 100000 vs 100021 and also the custom_number_to_string that converted all numbers below 100 to be 100!

static murmur3_128bit(obj)¶: Use murmur3_128bit for bit hash by passing this method: hasher=DeepHash.murmur3_128bit This hasher is the default hasher.

static murmur3_64bit(obj)¶: Use murmur3_64bit for 64 bit hash by passing this method: hasher=DeepHash.murmur3_64bit

static sha1hex(obj)¶: Use Sha1 as a cryptographic hash.

static sha256hex(obj)¶: Use Sha256 as a cryptographic hash.

Back to DeepDiff OLD 4.0.7 documentation!

DeepHash Reference¶

DeepDiff

Navigation

Related Topics