Hashing
Hashing is a process used in computer science to convert input (like a string, number, or any other type of data) into a fixed-size string of bytes. The output, known as a hash, is typically a number generated from a string of text. The idea is to use a hash function to map data of arbitrary size to fixed-size values. Hash functions are used in various applications like in data retrieval for faster access, in cryptographic functions, and more.
In the context of hash tables, which are a type of data structure, hashing plays a crucial role. A hash table uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found.
There are different methods to handle collisions (when two different inputs produce the same hash). Two popular methods are:
Linear Probing: In linear probing, when a collision occurs, the hash table checks the next element in the array (by incrementing the index) and continues doing so until an empty slot is found.
Chaining: Chaining handles collisions by maintaining a list of all elements that hash to the same location. If a collision occurs, the item is added to the list at the hashed index.
Let's consider a simple example of a hash table that has 10 slots and then we insert 5 elements using both linear probing and chaining. For simplicity, let's assume the hash function is hash(x) = x % 5
, where %
is the modulo operator.
Linear Probing
Initial Table (size=10):
[0] -
[1] -
[2] -
[3] -
[4] -
[5] -
[6] -
[7] -
[8] -
[9] -
Let's insert the element 29. 29 % 5 = 4. Since slot 4 is empty, 29 can be inserted there.
[0] -
[1] -
[2] -
[3] -
[4] - 29
[5] -
[6] -
[7] -
[8] -
[9] -
Let's insert the element 17. 17 % 5 = 2. Since slot 2 is empty, 17 can be inserted there.
[0] -
[1] -
[2] - 17
[3] -
[4] - 29
[5] -
[6] -
[7] -
[8] -
[9] -
Let's insert the element 12. 12 % 5 = 2. Since slot 2 is occupied, we need to linearly probe for the next empty slot
and that happens to be slot 3. So, 12 goes there.
[0] -
[1] -
[2] - 17
[3] - 12
[4] - 29
[5] -
[6] -
[7] -
[8] -
[9] -
To insert 23 (23 % 5 = 3), we first notioce thet slot 3 is taken and so is slot 4. Slot 5 is the next empty slot.
So, 23 goes there.
[0] -
[1] -
[2] - 17
[3] - 12
[4] - 29
[5] - 23
[6] -
[7] -
[8] -
[9] -
Top insert 32 (32 % 2 = 2), we see that starting from slot 2 all the way up to slot 5 are taken.
So, 32 goes into slot 6.
[0] -
[1] -
[2] - 17
[3] - 12
[4] - 29
[5] - 23
[6] - 32
[7] -
[8] -
[9] -
Chaining
Initial Table (size=10):
[0] -
[1] -
[2] -
[3] -
[4] -
[5] -
[6] -
[7] -
[8] -
[9] -
Let's insert the element 29. 29 % 5 = 4. Since slot 4 is empty, 29 can be inserted there.
[0] -
[1] -
[2] -
[3] -
[4] - 29
[5] -
[6] -
[7] -
[8] -
[9] -
Let's insert the element 17. 17 % 5 = 2. Since slot 2 is empty, 17 can be inserted there.
[0] -
[1] -
[2] - 17
[3] -
[4] - 29
[5] -
[6] -
[7] -
[8] -
[9] -
Let's insert the element 12. 12 % 5 = 2. Since slot 2 is occupied, we need to start a linked list
at 17 and point it to 12 like a chain.
[0] -
[1] -
[2] - 17 -> 12
[3] -
[4] - 29
[5] -
[6] -
[7] -
[8] -
[9] -
23 (23 % 5 = 3), can go to slot as it is empty.
[0] -
[1] -
[2] - 17 -> 12
[3] - 23
[4] - 29
[5] -
[6] -
[7] -
[8] -
[9] -
Top insert 32 (32 % 2 = 2), we extend the chain and add 32 behind 12.
[0] -
[1] -
[2] - 17 -> 12 -> 32
[3] - 23
[4] - 29
[5] -
[6] -
[7] -
[8] -
[9] -
In linear probing, as the table gets filled, it becomes harder to find empty slots, leading to a potential increase in the time it takes to insert new elements. In chaining, although the insertion is straightforward, the time to search for an element might increase as the chains get longer.
Let's write pseudocode for both linear probing and chaining in the context of a hash table. We'll assume that the hash table is implemented using an array, and for chaining, each array element points to a linked list.
Linear Probing
In linear probing, when we insert a new element, we start at the position indicated by the hash function. If that position is already occupied, we move sequentially through the array until we find an empty slot.
function hash(key):
return key % tableSize
function insert(hashTable, key, value):
index = hash(key)
while hashTable[index] is not empty:
index = (index + 1) % tableSize
hashTable[index] = value
function search(hashTable, key):
index = hash(key)
while hashTable[index] is not empty:
if hashTable[index] is key:
return hashTable[index]
index = (index + 1) % tableSize
return null
Chaining
In chaining, each position in the hash table points to a linked list (or another dynamic data structure) that stores all elements which hash to the same index. This way, collisions are handled by adding elements to the corresponding list.
function hash(key):
return key % tableSize
function insert(hashTable, key, value):
index = hash(key)
if hashTable[index] is empty:
hashTable[index] = new LinkedList()
hashTable[index].append(value)
function search(hashTable, key):
index = hash(key)
if hashTable[index] is not empty:
for item in hashTable[index]:
if item is key:
return item
return null
Explanation
In both methods, the
hash
function takes a key and computes an index based on the size of the hash table.In linear probing (
insert
function), if the computed index is occupied, we keep moving to the next index until an empty slot is found.In chaining, each table index points to a list. When inserting, we add the new value to this list. This allows multiple values to exist at the same index without conflicts.
The
search
functions in both methods try to locate the item by using the hash of the key. In linear probing, it sequentially checks each index if the initial one is occupied. In chaining, it traverses the linked list at the particular index.
Both methods have their trade-offs in terms of performance and memory usage, and the choice between them often depends on the expected load factor and the nature of the keys being inserted.
Let's write pseudocode for the contains
operation for both linear probing and chaining in a hash table. The contains
operation checks if a certain key is present in the hash table.
Linear Probing
In linear probing, the contains
operation starts at the index indicated by the hash function. If the key at that position doesn't match, it linearly probes subsequent indices until it either finds the key or encounters an empty slot, indicating the key is not in the table.
function hash(key):
return key % tableSize
function contains(hashTable, key):
index = hash(key)
while hashTable[index] is not empty:
if hashTable[index] is key:
return true
index = (index + 1) % tableSize
return false
Chaining
In chaining, each index of the hash table points to a list of keys. The contains
function computes the hash of the key to find the correct list, then iterates over the list to check if the key is present.
function hash(key):
return key % tableSize
function contains(hashTable, key):
index = hash(key)
if hashTable[index] is not empty:
for item in hashTable[index]:
if item is key:
return true
return false
Explanation
The
hash
function in both methods computes the index in the hash table based on the key.In linear probing,
contains
looks through each index sequentially starting from the hashed index. If it finds the key, it returnstrue
. If it finds an empty slot, it returnsfalse
, as this indicates the key is not present in the table.In chaining,
contains
directly goes to the hashed index and iterates through the list at that index. If the key is found in the list, it returnstrue
. If the list is empty or the key is not found in the list, it returnsfalse
.
These contains
operations are efficient in their respective methods. Linear probing can suffer from clustering, affecting performance, especially in a nearly full table. Chaining avoids clustering but can have longer lists to search through, which can affect performance if the hash table is not well-dimensioned or the hash function does not distribute keys evenly.
let's write C code for both linear probing and chaining in a hash table. We'll create a basic implementation of each, focusing on the insert
and contains
operations.
Linear Probing in C
First, let's look at linear probing. We'll assume a fixed-size hash table where each slot can hold an integer value.
#include <stdio.h>
#include <stdbool.h>
#define TABLE_SIZE 10
int hashTable[TABLE_SIZE];
int hash(int key) {
return key % TABLE_SIZE;
}
void initHashTable() {
for (int i = 0; i < TABLE_SIZE; i++) {
hashTable[i] = -1; // Using -1 to indicate empty slot
}
}
bool insert(int key) {
int index = hash(key);
int originalIndex = index;
do {
if (hashTable[index] == -1) {
hashTable[index] = key;
return true;
}
index = (index + 1) % TABLE_SIZE;
} while (index != originalIndex); // Stop if we come back to the start
return false; // Table is full
}
bool contains(int key) {
int index = hash(key);
int originalIndex = index;
do {
if (hashTable[index] == key) {
return true;
}
index = (index + 1) % TABLE_SIZE;
} while (hashTable[index] != -1 && index != originalIndex);
return false;
}
int main() {
initHashTable();
insert(15);
insert(25); // Will collide with 15
printf("%d\n", contains(15)); // Should print 1 (true)
printf("%d\n", contains(30)); // Should print 0 (false)
return 0;
}
Chaining in C
For chaining, we need a linked list at each index of the hash table. Each list node will store a key.
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#define TABLE_SIZE 10
typedef struct Node {
int key;
struct Node *next;
} Node;
Node *hashTable[TABLE_SIZE];
int hash(int key) {
return key % TABLE_SIZE;
}
void initHashTable() {
for (int i = 0; i < TABLE_SIZE; i++) {
hashTable[i] = NULL;
}
}
bool insert(int key) {
int index = hash(key);
Node *newNode = (Node*) malloc(sizeof(Node));
if (newNode == NULL) {
return false; // Failed to allocate memory
}
newNode->key = key;
newNode->next = hashTable[index];
hashTable[index] = newNode;
return true;
}
bool contains(int key) {
int index = hash(key);
Node *current = hashTable[index];
while (current != NULL) {
if (current->key == key) {
return true;
}
current = current->next;
}
return false;
}
int main() {
initHashTable();
insert(15);
insert(25); // Can coexist with 15 without collision
printf("%d\n", contains(15)); // Should print 1 (true)
printf("%d\n", contains(30)); // Should print 0 (false)
return 0;
}
Explanation
In both implementations, the
hash
function computes the index for a key.In linear probing, we insert by finding the next available slot and check existence by searching linearly from the hashed index.
In chaining, we manage collisions by using a linked list at each index. Insertion adds a new node at the beginning of the list. The
contains
function searches through the list at the given index.Both versions use
TABLE_SIZE
to define the hash table's size. This value can be adjusted based on needs.
Let's perform a complexity analysis for both linear probing and chaining in hash tables, focusing on the insert
and contains
operations. We'll analyze the time and space complexities for each method.
Linear Probing Complexity Analysis
Time Complexity:
insert
: In the best case, when there are no collisions, the time complexity is (O(1)) as we directly insert into the desired index. However, in the worst case, when there are many collisions and the table is nearly full, the time complexity can become (O(n)), where (n) is the size of the hash table. This is because linear probing can require searching through many slots to find an empty one.contains
: Similar toinsert
, in the best case, it's (O(1)) when the key is found at its hashed index. In the worst case, it's also (O(n)) if the key is not in the table, as it may require searching through the entire table.
Space Complexity:
- The space complexity for the hash table itself is (O(n)), where (n) is the size of the table. The additional space complexity for each key is (O(1)).
Chaining Complexity Analysis
Time Complexity:
insert
: In the average case, assuming a good hash function and relatively uniform distribution of keys, the time complexity for insertion is (O(1)) on average. This is because inserting into a linked list (which is typically small) is a constant-time operation. In the worst case, when there are many collisions at a single index, it can become (O(n)), but this is less likely with a good hash function.contains
: In the average case, it's also (O(1)) on average because traversing a linked list is constant time for small lists. In the worst case, it can be (O(k)), where (k) is the length of the linked list at a specific index, but this is less likely to occur with a well-designed hash function.
Space Complexity:
- The space complexity for the hash table itself is (O(n)), where (n) is the size of the table. The space complexity for the linked lists is (O(m)), where (m) is the total number of keys with collisions. In practice, (m) is likely to be much less than (n) when the hash function is good.
Important to note
Linear probing has a simpler space complexity (per key) but can have worse worst-case time complexity when there are many collisions.
Chaining has a better average-case time complexity due to better handling of collisions but may have slightly higher space complexity due to the linked lists.