COMP3138›Hashing Basics

Design and Analysis of AlgorithmsTopic 32 of 53

Hashing Basics

8 minread

1,338words

Intermediatelevel

Hashing Basics

Hashing is a powerful technique used to store and retrieve data efficiently, and it forms the foundation of hash tables and hash maps. The idea is to map data to a fixed-size value (called a hash code) using a hash function. This hash value allows for fast data access, providing constant time complexity for many operations like search, insertion, and deletion.

Here’s an overview of the key concepts of hashing, including the components involved and the common methods used.

1. Hash Function

A hash function is an algorithm that takes an input (often called the key) and maps it to a fixed-size value, usually an integer, called a hash code. The goal of a hash function is to produce a uniform distribution of hash values for different inputs, minimizing collisions (when two keys produce the same hash value).

Properties of a Good Hash Function:

Deterministic: The same input always produces the same hash value.
Uniformity: The function should distribute hash values uniformly across the hash table to reduce clustering.
Efficient: The hash function should be computationally efficient, ideally with constant time complexity O(1).
Minimize Collisions: Collisions should be minimized, where two different keys produce the same hash value.

Example of a Simple Hash Function:

For an integer key, a simple hash function might be the modulo operation:

hash(key) = key % table_size

For a string key, a more sophisticated approach is often used, such as:

hash(key) = sum of ASCII values of characters % table_size

This is a very basic example. In practice, hash functions are more complex to avoid collisions and ensure efficient distribution.

2. Hash Table

A hash table (or hash map) is a data structure that uses a hash function to map keys to values. The key is hashed to an index in an array (called a bucket or slot), and the value is stored at that index.

Structure of a Hash Table:

Array: A fixed-size array that holds values at specific indices, determined by the hash code.
Buckets/Slots: The array is divided into buckets or slots, where the data is stored. Each slot holds the value associated with a key.
Hash Function: The function that converts the key into an index within the array. The hash function determines where in the array the key-value pair will be placed.

Insertion in a Hash Table:

Compute the hash of the key using the hash function.
Map the hash value to an index in the array (bucket).
Insert the key-value pair at the corresponding index in the array.

Example of a Hash Table:

Suppose you want to store the following pairs in a hash table:

Key: "apple", Value: 5
Key: "banana", Value: 10
Key: "cherry", Value: 15

Let's assume a hash table with 5 slots and a hash function that returns the length of the key modulo 5. The hash function would look like this:

hash(key) = len(key) % 5

For each key:

"apple" → hash("apple") = 5 % 5 = 0
"banana" → hash("banana") = 6 % 5 = 1
"cherry" → hash("cherry") = 6 % 5 = 1

In this case, "banana" and "cherry" will both collide and be placed in the same slot (bucket 1).

3. Collisions and Collision Resolution

A collision occurs when two keys produce the same hash value and thus are mapped to the same bucket. Collisions are inevitable because the number of possible keys is generally much larger than the number of available buckets.

Common Collision Resolution Techniques:

Chaining (Separate Chaining):
- Each bucket in the hash table stores a linked list (or another collection) of all elements that hash to the same index.
- When a collision occurs, the new key-value pair is added to the linked list at the corresponding bucket.
Example:
- If both "banana" and "cherry" hash to bucket 1, the bucket 1 will hold a linked list of key-value pairs:
```
bucket[1] → ("banana", 10) → ("cherry", 15)
```
Open Addressing:
- When a collision occurs, the algorithm probes other buckets in the hash table to find an empty spot.
- There are several probing strategies:
  - Linear Probing: If a collision occurs, check the next bucket. If that’s occupied, check the one after, and so on, wrapping around to the beginning of the array if necessary.
  - Quadratic Probing: Instead of checking the next bucket, check buckets at intervals of increasing size (1, 4, 9, etc.).
  - Double Hashing: Use a second hash function to calculate a new step size for probing if a collision occurs.
Example (Linear Probing):
- If bucket 1 is already occupied, the hash table checks bucket 2, then bucket 3, and so on, until an empty bucket is found.

4. Load Factor and Resizing

The load factor of a hash table is a measure of how full the table is. It is defined as:

load factor = (number of elements in the table) / (size of the table)

A high load factor (i.e., a lot of keys in relatively few buckets) increases the likelihood of collisions, which can degrade performance.

Resizing:

When the load factor exceeds a certain threshold (often around 0.7 or 70%), the hash table may need to be resized (doubled in size, for example) to maintain efficient operations. Resizing involves:

Allocating a larger array.
Rehashing all existing keys and inserting them into the new array according to their new hash values.

Resizing and rehashing ensure that the performance of the hash table remains efficient, with operations staying close to O(1) on average.

5. Time Complexity of Hash Table Operations

Search: O(1) on average, assuming a good hash function and low collision rates. In the worst case (with many collisions), the time complexity can degrade to O(n) (e.g., if using chaining and all keys hash to the same bucket).
Insert: O(1) on average, as inserting into the correct bucket is typically a constant-time operation. However, when resizing occurs, the insertion time may temporarily increase due to rehashing.
Delete: O(1) on average, similar to search. Deleting an element requires locating it using the hash value and then removing it.
Resize/rehash: O(n), where n is the number of elements in the table. Rehashing requires moving all existing elements to the new table.

6. Advantages and Disadvantages of Hashing

Advantages:

Fast Lookups: Searching, inserting, and deleting elements in a hash table can be done in constant time O(1) on average.
Efficient Use of Space: Hash tables provide a good balance of time efficiency and space usage, especially when the hash function distributes keys uniformly across the buckets.

Disadvantages:

Collisions: Collisions are inevitable, and while they can be handled efficiently, they add some overhead to the operations.
Resizing: Hash tables may need resizing, which can temporarily degrade performance.
Non-Order: Hash tables do not maintain any specific order of elements, unlike data structures like binary search trees.
Memory Usage: Hash tables require extra memory to maintain an array and handle collisions (e.g., linked lists or additional probing).

Applications of Hashing

Hash Maps: Used in many programming languages (e.g., dict in Python, Map in Java) for fast lookups.
Caches: Hashing is used in cache systems to store recently accessed data, allowing for quick retrieval.
Databases: Hashing is used for indexing in databases to speed up query execution.
Cryptography: Cryptographic hash functions (e.g., SHA-256) are used to create secure hash values for storing passwords or verifying data integrity.
Unique Identifiers: Hashing can generate unique identifiers for records in large datasets.

Conclusion

Hashing is a powerful technique for quickly storing and retrieving data. By using a hash function to map keys to indices in a hash table, it offers constant-time average performance for common operations like insertion, deletion, and search. The performance depends on factors like the hash function, collision resolution strategy, and the load factor, but when used appropriately, hashing provides a highly efficient data storage solution.

Previous topic 31

Analysis of Insertion and Deletion in BST

Next topic 33

Examples of Hash Functions

Past Papers

Open this section to load past papers

Click on Show Past Papers to see past papers.

COMP3138›Hashing Basics

Design and Analysis of AlgorithmsTopic 32 of 53

Hashing Basics

8 minread

1,338words

Intermediatelevel

Hashing Basics

Here’s an overview of the key concepts of hashing, including the components involved and the common methods used.

1. Hash Function

Properties of a Good Hash Function:

Deterministic: The same input always produces the same hash value.
Uniformity: The function should distribute hash values uniformly across the hash table to reduce clustering.
Efficient: The hash function should be computationally efficient, ideally with constant time complexity O(1).
Minimize Collisions: Collisions should be minimized, where two different keys produce the same hash value.

Example of a Simple Hash Function:

For an integer key, a simple hash function might be the modulo operation:

hash(key) = key % table_size

For a string key, a more sophisticated approach is often used, such as:

hash(key) = sum of ASCII values of characters % table_size

This is a very basic example. In practice, hash functions are more complex to avoid collisions and ensure efficient distribution.

2. Hash Table

Structure of a Hash Table:

Array: A fixed-size array that holds values at specific indices, determined by the hash code.
Buckets/Slots: The array is divided into buckets or slots, where the data is stored. Each slot holds the value associated with a key.
Hash Function: The function that converts the key into an index within the array. The hash function determines where in the array the key-value pair will be placed.

Insertion in a Hash Table:

Compute the hash of the key using the hash function.
Map the hash value to an index in the array (bucket).
Insert the key-value pair at the corresponding index in the array.

Example of a Hash Table:

Suppose you want to store the following pairs in a hash table:

Key: "apple", Value: 5
Key: "banana", Value: 10
Key: "cherry", Value: 15

Let's assume a hash table with 5 slots and a hash function that returns the length of the key modulo 5. The hash function would look like this:

hash(key) = len(key) % 5

For each key:

"apple" → hash("apple") = 5 % 5 = 0
"banana" → hash("banana") = 6 % 5 = 1
"cherry" → hash("cherry") = 6 % 5 = 1

In this case, "banana" and "cherry" will both collide and be placed in the same slot (bucket 1).

3. Collisions and Collision Resolution

Common Collision Resolution Techniques:

Chaining (Separate Chaining):
- Each bucket in the hash table stores a linked list (or another collection) of all elements that hash to the same index.
- When a collision occurs, the new key-value pair is added to the linked list at the corresponding bucket.
Example:
- If both "banana" and "cherry" hash to bucket 1, the bucket 1 will hold a linked list of key-value pairs:
```
bucket[1] → ("banana", 10) → ("cherry", 15)
```
Open Addressing:
- When a collision occurs, the algorithm probes other buckets in the hash table to find an empty spot.
- There are several probing strategies:
  - Linear Probing: If a collision occurs, check the next bucket. If that’s occupied, check the one after, and so on, wrapping around to the beginning of the array if necessary.
  - Quadratic Probing: Instead of checking the next bucket, check buckets at intervals of increasing size (1, 4, 9, etc.).
  - Double Hashing: Use a second hash function to calculate a new step size for probing if a collision occurs.
Example (Linear Probing):
- If bucket 1 is already occupied, the hash table checks bucket 2, then bucket 3, and so on, until an empty bucket is found.

4. Load Factor and Resizing

The load factor of a hash table is a measure of how full the table is. It is defined as:

load factor = (number of elements in the table) / (size of the table)

A high load factor (i.e., a lot of keys in relatively few buckets) increases the likelihood of collisions, which can degrade performance.

Resizing:

Allocating a larger array.
Rehashing all existing keys and inserting them into the new array according to their new hash values.

Resizing and rehashing ensure that the performance of the hash table remains efficient, with operations staying close to O(1) on average.

5. Time Complexity of Hash Table Operations

Search: O(1) on average, assuming a good hash function and low collision rates. In the worst case (with many collisions), the time complexity can degrade to O(n) (e.g., if using chaining and all keys hash to the same bucket).
Insert: O(1) on average, as inserting into the correct bucket is typically a constant-time operation. However, when resizing occurs, the insertion time may temporarily increase due to rehashing.
Delete: O(1) on average, similar to search. Deleting an element requires locating it using the hash value and then removing it.
Resize/rehash: O(n), where n is the number of elements in the table. Rehashing requires moving all existing elements to the new table.

6. Advantages and Disadvantages of Hashing

Advantages:

Fast Lookups: Searching, inserting, and deleting elements in a hash table can be done in constant time O(1) on average.
Efficient Use of Space: Hash tables provide a good balance of time efficiency and space usage, especially when the hash function distributes keys uniformly across the buckets.

Disadvantages:

Collisions: Collisions are inevitable, and while they can be handled efficiently, they add some overhead to the operations.
Resizing: Hash tables may need resizing, which can temporarily degrade performance.
Non-Order: Hash tables do not maintain any specific order of elements, unlike data structures like binary search trees.
Memory Usage: Hash tables require extra memory to maintain an array and handle collisions (e.g., linked lists or additional probing).

Applications of Hashing

Hash Maps: Used in many programming languages (e.g., dict in Python, Map in Java) for fast lookups.
Caches: Hashing is used in cache systems to store recently accessed data, allowing for quick retrieval.
Databases: Hashing is used for indexing in databases to speed up query execution.
Cryptography: Cryptographic hash functions (e.g., SHA-256) are used to create secure hash values for storing passwords or verifying data integrity.
Unique Identifiers: Hashing can generate unique identifiers for records in large datasets.

Conclusion

Previous topic 31

Analysis of Insertion and Deletion in BST

Next topic 33

Examples of Hash Functions

Past Papers

Open this section to load past papers

Click on Show Past Papers to see past papers.