Abstract
-
Occurs when running Hash Function on two different inputs and we obtain the same output
-
In the above example, and result in the same value
-
There 3 approaches to handle - Hash Map Expansion, Separate Chaining and Open Addressing
Hash Map Expansion
- Expand the output space to avoid Hash Collision totally if expansion is big enough
- Reduce the chance of hash collision if expansion isn’t big enough
- In the above example, when and result in the same output , we can double the output space to avoid the collision
- After expansion, ,
Intense Computation Needed
With the change of the output space size, it requires us to re-run the Hash Function on all keys to obtain the new Bucket
It is very time consuming to recalculate hashes and move all key-value pairs from old arrays to new arrays
So the best practice is to reserve a pretty decent space to avoid frequent expanding
Load Factor
- A faction of total elements inside Hash Map and total buckets inside the hash map
- Measures the severity of Hash Collision, acts as a triggering condition for Hash Map Expansion
Java
When Load Factor exceeds
0.75
, Hash Map Expansion will be triggered to expand twice the original size
Separate Chaining
- Use each Bucket in the underlying output space of Hash Map to hold a Linked List node. We put all the conflicted key-value pairs as nodes in the Linked List. This allows us to live with a certain degree of Hash Collision. Thus, reducing the frequency of Hash Map Expansion
- You can find the code implementation of separate chaining here
More Memory Usage
Each key-value pair node needs space to store a pointer to point to the next node
Linear Efficient
Operation like searching, insertion, and deletion takes time on Linked List.
We can converted the long Linked List to AVL Tree or Red Black Tree to optimise the search efficiency from to .
Implementation in Programming Languages
Java
- Starting from JDK 1.8, when Hash Map length reaches 64 & the Linked List length is 8. Linked List will be converted to Red Black Tree to improve performance
Golang
- Every Bucket has maximum 8 key-value pairs. Once exceeded, it will be linked to a overflowing bucket
- When there is too many overflowing buckets, Hash Map Expansion will be performed to ensure performance
Mechanism of Common Operations
Search key-value pair with key
- Obtain the index of bucket by passing the key to the Hash Function
- Obtain the starting node of the Linked List with the index
- Iterate through the Linked List to find key-value pair node that matches with the given key
Add key-value pair
- We perform Search key-value pair with key to ensure it is a new key-value pair we want to add
- If unable to find, we add a new node with the key-value pair to the Linked List. Else we return duplicated key error
Delete key-value pair
- We perform Search key-value pair with key to ensure the key we want to delete exists
- Delete the key-value pair node if there is one
Open Addressing
- Similar to Separate Chaining, it is an approach to live with Hash Collision, so as to reduce the frequency of Hash Map Expansion
- But unlike Separate Chaining, open addressing doesn’t leverage on extra data structures like Linked List to live with. There are mainly 2 approaches - Linear Probing and Double Hashing
Delete key-value pair
We need a special indicator to remove a key-value pair. If we just make it empty, the key-value pairs stored after it will be ignored when we are trying to search a key-value pair that has Hash Collision with the deleted key-value pair. Because the key-value pair may be stored right after the deleted key-value pair
Linear Probing
- The idea is to find the next empty Bucket for the conflicted key-value pair linearly
Key-value pair clustering
When the array is filled up with key-value pairs in a continuous manner. We may need to iterate through many key-value pairs, including key-value pairs that has a hash index to get/delete the desired key-value pair
Implementation in Programming Language
Python
- Uses Pseudorandomness to find the next empty bucket for conflicted key-value pair
Mechanism of Common Operations
Search key-value pair with key
- Obtain the Array index by passing the key to the Hash Function
- If the key-value pair in the bucket doesn’t match with the key, iterate through the buckets linearly until the correct key-value is found
- Else, if the bucket is empty, then key-value pair isn’t inside, return None
Add key-value pair
- Obtain the Array index by passing the key to the Hash Function
- If the bucket already has a key-value pair, then iterating linearly until a empty bucket is found, and then insert the key-value pair
Delete key-value pair
- Obtain the Array index by passing the key to the Hash Function
- Iterating through linearly until the desired key-value pair is found
- Deleting the key-value pair by placing a special indicator in the bucket to denote the bucket is empty and it shouldn’t break the iteration flow
Double Hashing
- The idea here to find the next empty bucket for the conflicted key-value pair by executing the Hash Function multiple times
More Distributed
Not easy to have key-value pair clustering
More Computational Demanding
More Hash Function performed means more computation is required
Mechanism of Common Operations
Search key-value pair with key
- Obtain the Array index by passing the key to the Hash Function
- If the desired key-value pair isn’t inside the bucket, perform the Hash Function again until we find the correct key-value pair
- If empty bucket is countered, then the desired key-value isn’t stored in the Hash Map
Add key-value pair
- Obtain the Array index by passing the key to the Hash Function
- If the bucket already has a key-value pair, then keep performing Hash Function until an empty bucket is found, then insert the key-value pair