Analysis of AlgorithmsTopic 23 of 28

Hashing

8 minread

1,277words

Intermediatelevel

Hashing

Hashing is a technique used to map data of arbitrary size (such as a string or an object) to a fixed-size value, typically an integer, called a hash code or hash value. The idea is to use this hash value as an index in a hash table to enable efficient access to data. Hashing is widely used in data structures such as hash tables, hash maps, and sets.

Key Concepts in Hashing

Hash Function:
- A hash function takes an input (or key) and returns a fixed-size string of bytes, usually an integer.
- The hash function should map different keys to distinct hash values, although in practice, different inputs can sometimes produce the same hash value (this is known as a collision).
- The goal is to generate a hash value that distributes keys uniformly across the hash table to minimize collisions.
Good properties of a hash function:
- Deterministic: The same input always produces the same hash output.
- Efficient: It should be computationally fast.
- Uniform Distribution: It should distribute the keys uniformly across the hash table to minimize collisions.
Hash Table:
- A hash table (or hash map) is an array-like data structure where data is stored at positions (or buckets) determined by a hash function.
- Each bucket contains either a list or a direct entry corresponding to a key-value pair.
- The hash table uses the hash value to determine where to store the key-value pair.
Basic Operations on a Hash Table:
- Insert: Insert a key-value pair into the hash table. The position is determined by the hash of the key.
- Search: Given a key, compute the hash and search the corresponding bucket for the key.
- Delete: Compute the hash for the key and remove the corresponding key-value pair from the bucket.

Collision Handling

A collision occurs when two different keys hash to the same index in the hash table. Since hash tables rely on indices (or buckets) to store data, a collision needs to be handled to maintain data integrity.

Here are common methods of handling collisions:

Chaining:
- In chaining, each bucket of the hash table stores a linked list (or another dynamic data structure) to hold all the elements that hash to the same index.
- When a collision occurs (i.e., two keys hash to the same index), the new key-value pair is added to the list at that index.
Example of Chaining:
- Suppose we have a hash table with 10 buckets, and we insert the following keys: 12, 22, 32, 42.
- All of these keys may hash to the same index, say 2. In that case, a linked list at index 2 will contain the elements [12, 22, 32, 42].
Advantages:
- Chaining can handle an unlimited number of collisions, as it stores multiple elements in the same bucket.
- Simple to implement.
Disadvantages:
- If the hash function is poor or the table is poorly sized, the linked lists may become long, reducing the performance of operations.
Open Addressing:
- In open addressing, when a collision occurs, the algorithm searches for the next available bucket within the hash table (based on a probe sequence) and places the key-value pair there.
- Open addressing does not use linked lists but stores the values directly in the table. The probe sequence can be based on various methods such as linear probing, quadratic probing, or double hashing.
Types of Open Addressing:
- Linear Probing: If a collision occurs at index i, the algorithm checks index i+1, i+2, etc., until an empty bucket is found.
- Quadratic Probing: In quadratic probing, the next index is determined by a quadratic function. For example, if a collision occurs at index i, the next index checked could be i+1^2, i+2^2, and so on.
- Double Hashing: If a collision occurs at index i, a second hash function is applied to the key to calculate a new probe value.
Advantages:
- Open addressing uses less memory since it does not require linked lists.
- Can be more efficient in practice if the load factor is maintained carefully.
Disadvantages:
- Open addressing can lead to clustering, where groups of consecutive occupied buckets reduce the efficiency of the probe sequence.
- Requires a larger table size to maintain performance, especially when the table is near full capacity.

Load Factor and Resizing

The load factor is the ratio of the number of elements in the hash table to the total number of buckets. A load factor close to 1 means the hash table is nearly full, leading to more collisions and performance degradation.
To maintain efficiency, hash tables often resize (or rehash) when the load factor exceeds a certain threshold (e.g., 0.7). This involves:
- Doubling the size of the hash table.
- Rehashing all the existing keys into the new table based on the new table size.

Resizing helps to keep the hash table operations (insert, delete, and search) efficient, typically maintaining average time complexities of $O(1)$ for these operations.

Hashing Applications

Hash Tables (or Hash Maps):
- Hash tables are used to implement key-value data structures, such as hash maps and hash sets.
- Commonly used in applications requiring fast lookups, insertions, and deletions, such as:
  - Caching (e.g., in-memory caches for web servers).
  - Associative arrays in programming languages.
  - Indexing in databases.
Cryptographic Hashing:
- Hash functions are used in cryptography for generating hash codes (or digests) of data. Cryptographic hash functions (like SHA-256) are designed to be collision-resistant, meaning that it is computationally infeasible to generate two different inputs that produce the same hash.
- Used in digital signatures, password storage, data integrity, and blockchain technologies.
Unique Identifiers:
- Hashing is used to generate unique identifiers for objects or data, ensuring that data retrieval or comparison operations can be done quickly.

Example of Hashing in Practice

Let’s assume we want to implement a hash table that stores student records, where the key is the student ID and the value is the student’s name.

Step 1: Hash Function: Suppose we use a simple hash function that returns the modulus of the student ID with the table size:

hash(student_id) = student_id % table_size

Step 2: Insertion: Let’s insert the following records into a hash table of size 5:

Student ID 101, Name "Alice"
Student ID 106, Name "Bob"
Student ID 111, Name "Charlie"

Using our hash function, we calculate the hash values:

$101 \% 5 = 1$ (insert at index 1)
$106 \% 5 = 1$ (collision, use chaining to insert at index 1)
$111 \% 5 = 1$ (collision, use chaining to insert at index 1)

After inserting, the hash table (using chaining) might look like this:

Index 0: []
Index 1: [101 -> Alice, 106 -> Bob, 111 -> Charlie]
Index 2: []
Index 3: []
Index 4: []

Step 3: Searching: To search for "Bob", we compute:

hash(106) = 106 % 5 = 1

Then, we traverse the linked list at index 1 to find Bob.

Summary

Hashing is a technique for mapping data to fixed-size values (hash codes) for efficient lookup and storage.
Hash tables use these hash values to store data at specific indices, and collision handling techniques (such as chaining and open addressing) are employed to manage hash collisions.
Hashing applications include databases, caches, cryptography, and unique identifiers.
Load factor and resizing are important factors to consider when using hash tables to maintain efficient operations.

Past Papers

Open this section to load past papers

Click on Show Past Papers to see past papers.

COMP4121›Hashing

Analysis of AlgorithmsTopic 23 of 28

Hashing

8 minread

1,277words

Intermediatelevel

Hashing

Key Concepts in Hashing

Hash Function:
- A hash function takes an input (or key) and returns a fixed-size string of bytes, usually an integer.
- The hash function should map different keys to distinct hash values, although in practice, different inputs can sometimes produce the same hash value (this is known as a collision).
- The goal is to generate a hash value that distributes keys uniformly across the hash table to minimize collisions.
Good properties of a hash function:
- Deterministic: The same input always produces the same hash output.
- Efficient: It should be computationally fast.
- Uniform Distribution: It should distribute the keys uniformly across the hash table to minimize collisions.
Hash Table:
- A hash table (or hash map) is an array-like data structure where data is stored at positions (or buckets) determined by a hash function.
- Each bucket contains either a list or a direct entry corresponding to a key-value pair.
- The hash table uses the hash value to determine where to store the key-value pair.
Basic Operations on a Hash Table:
- Insert: Insert a key-value pair into the hash table. The position is determined by the hash of the key.
- Search: Given a key, compute the hash and search the corresponding bucket for the key.
- Delete: Compute the hash for the key and remove the corresponding key-value pair from the bucket.

Collision Handling

Here are common methods of handling collisions:

Chaining:
- In chaining, each bucket of the hash table stores a linked list (or another dynamic data structure) to hold all the elements that hash to the same index.
- When a collision occurs (i.e., two keys hash to the same index), the new key-value pair is added to the list at that index.
Example of Chaining:
- Suppose we have a hash table with 10 buckets, and we insert the following keys: 12, 22, 32, 42.
- All of these keys may hash to the same index, say 2. In that case, a linked list at index 2 will contain the elements [12, 22, 32, 42].
Advantages:
- Chaining can handle an unlimited number of collisions, as it stores multiple elements in the same bucket.
- Simple to implement.
Disadvantages:
- If the hash function is poor or the table is poorly sized, the linked lists may become long, reducing the performance of operations.
Open Addressing:
- In open addressing, when a collision occurs, the algorithm searches for the next available bucket within the hash table (based on a probe sequence) and places the key-value pair there.
- Open addressing does not use linked lists but stores the values directly in the table. The probe sequence can be based on various methods such as linear probing, quadratic probing, or double hashing.
Types of Open Addressing:
- Linear Probing: If a collision occurs at index i, the algorithm checks index i+1, i+2, etc., until an empty bucket is found.
- Quadratic Probing: In quadratic probing, the next index is determined by a quadratic function. For example, if a collision occurs at index i, the next index checked could be i+1^2, i+2^2, and so on.
- Double Hashing: If a collision occurs at index i, a second hash function is applied to the key to calculate a new probe value.
Advantages:
- Open addressing uses less memory since it does not require linked lists.
- Can be more efficient in practice if the load factor is maintained carefully.
Disadvantages:
- Open addressing can lead to clustering, where groups of consecutive occupied buckets reduce the efficiency of the probe sequence.
- Requires a larger table size to maintain performance, especially when the table is near full capacity.

Load Factor and Resizing

The load factor is the ratio of the number of elements in the hash table to the total number of buckets. A load factor close to 1 means the hash table is nearly full, leading to more collisions and performance degradation.
To maintain efficiency, hash tables often resize (or rehash) when the load factor exceeds a certain threshold (e.g., 0.7). This involves:
- Doubling the size of the hash table.
- Rehashing all the existing keys into the new table based on the new table size.

Resizing helps to keep the hash table operations (insert, delete, and search) efficient, typically maintaining average time complexities of $O(1)$ for these operations.

Hashing Applications

Hash Tables (or Hash Maps):
- Hash tables are used to implement key-value data structures, such as hash maps and hash sets.
- Commonly used in applications requiring fast lookups, insertions, and deletions, such as:
  - Caching (e.g., in-memory caches for web servers).
  - Associative arrays in programming languages.
  - Indexing in databases.
Cryptographic Hashing:
- Hash functions are used in cryptography for generating hash codes (or digests) of data. Cryptographic hash functions (like SHA-256) are designed to be collision-resistant, meaning that it is computationally infeasible to generate two different inputs that produce the same hash.
- Used in digital signatures, password storage, data integrity, and blockchain technologies.
Unique Identifiers:
- Hashing is used to generate unique identifiers for objects or data, ensuring that data retrieval or comparison operations can be done quickly.

Example of Hashing in Practice

Let’s assume we want to implement a hash table that stores student records, where the key is the student ID and the value is the student’s name.

Step 1: Hash Function: Suppose we use a simple hash function that returns the modulus of the student ID with the table size:

hash(student_id) = student_id % table_size

Step 2: Insertion: Let’s insert the following records into a hash table of size 5:

Student ID 101, Name "Alice"
Student ID 106, Name "Bob"
Student ID 111, Name "Charlie"

Using our hash function, we calculate the hash values:

$101 \% 5 = 1$ (insert at index 1)
$106 \% 5 = 1$ (collision, use chaining to insert at index 1)
$111 \% 5 = 1$ (collision, use chaining to insert at index 1)

After inserting, the hash table (using chaining) might look like this:

Index 0: []
Index 1: [101 -> Alice, 106 -> Bob, 111 -> Charlie]
Index 2: []
Index 3: []
Index 4: []

Step 3: Searching: To search for "Bob", we compute:

hash(106) = 106 % 5 = 1

Then, we traverse the linked list at index 1 to find Bob.

Summary

Hashing is a technique for mapping data to fixed-size values (hash codes) for efficient lookup and storage.
Hash tables use these hash values to store data at specific indices, and collision handling techniques (such as chaining and open addressing) are employed to manage hash collisions.
Hashing applications include databases, caches, cryptography, and unique identifiers.
Load factor and resizing are important factors to consider when using hash tables to maintain efficient operations.

Past Papers

Open this section to load past papers

Click on Show Past Papers to see past papers.