Hashing

Hashing is a process used in computer science to convert input (like a string, number, or any other type of data) into a fixed-size string of bytes. The output, known as a hash, is typically a number generated from a string of text. The idea is to use a hash function to map data of arbitrary size to fixed-size values. Hash functions are used in various applications like in data retrieval for faster access, in cryptographic functions, and more.

In the context of hash tables, which are a type of data structure, hashing plays a crucial role. A hash table uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found.

There are different methods to handle collisions (when two different inputs produce the same hash). Two popular methods are:

  1. Linear Probing: In linear probing, when a collision occurs, the hash table checks the next element in the array (by incrementing the index) and continues doing so until an empty slot is found.

  2. Chaining: Chaining handles collisions by maintaining a list of all elements that hash to the same location. If a collision occurs, the item is added to the list at the hashed index.

Let's consider a simple example of a hash table that has 10 slots and then we insert 5 elements using both linear probing and chaining. For simplicity, let's assume the hash function is hash(x) = x % 5, where % is the modulo operator.

Linear Probing

Initial Table (size=10):
[0] - 
[1] - 
[2] - 
[3] - 
[4] - 
[5] -
[6] -
[7] -
[8] -
[9] -

Let's insert the element 29. 29 % 5 = 4. Since slot 4 is empty, 29 can be inserted there.

[0] - 
[1] - 
[2] - 
[3] - 
[4] - 29
[5] -
[6] -
[7] -
[8] -
[9] -

Let's insert the element 17. 17 % 5 = 2. Since slot 2 is empty, 17 can be inserted there.

[0] - 
[1] - 
[2] - 17
[3] - 
[4] - 29
[5] -
[6] -
[7] -
[8] -
[9] -

Let's insert the element 12. 12 % 5 = 2. Since slot 2 is occupied, we need to linearly probe for the next empty slot
and that happens to be slot 3. So, 12 goes there.

[0] - 
[1] - 
[2] - 17
[3] - 12
[4] - 29
[5] -
[6] -
[7] -
[8] -
[9] -

To insert 23 (23 % 5 = 3), we first notioce thet slot 3 is taken and so is slot 4. Slot 5 is the next empty slot.
So, 23 goes there.

[0] - 
[1] - 
[2] - 17
[3] - 12
[4] - 29
[5] - 23
[6] -
[7] -
[8] -
[9] -

Top insert 32 (32 % 2 = 2), we see that starting from slot 2 all the way up to slot 5 are taken.
So, 32 goes into slot 6.

[0] - 
[1] - 
[2] - 17
[3] - 12
[4] - 29
[5] - 23
[6] - 32
[7] -
[8] -
[9] -

Chaining

Initial Table (size=10):
[0] - 
[1] - 
[2] - 
[3] - 
[4] - 
[5] -
[6] -
[7] -
[8] -
[9] -

Let's insert the element 29. 29 % 5 = 4. Since slot 4 is empty, 29 can be inserted there.

[0] - 
[1] - 
[2] - 
[3] - 
[4] - 29
[5] -
[6] -
[7] -
[8] -
[9] -

Let's insert the element 17. 17 % 5 = 2. Since slot 2 is empty, 17 can be inserted there.

[0] - 
[1] - 
[2] - 17
[3] - 
[4] - 29
[5] -
[6] -
[7] -
[8] -
[9] -

Let's insert the element 12. 12 % 5 = 2. Since slot 2 is occupied, we need to start a linked list
at 17 and point it to 12 like a chain.

[0] - 
[1] - 
[2] - 17 -> 12
[3] - 
[4] - 29
[5] -
[6] -
[7] -
[8] -
[9] -

23 (23 % 5 = 3), can go to slot as it is empty.

[0] - 
[1] - 
[2] - 17 -> 12
[3] - 23
[4] - 29
[5] -
[6] -
[7] -
[8] -
[9] -

Top insert 32 (32 % 2 = 2), we extend the chain and add 32 behind 12.

[0] - 
[1] - 
[2] - 17 -> 12 -> 32
[3] - 23
[4] - 29
[5] -
[6] -
[7] -
[8] -
[9] -

In linear probing, as the table gets filled, it becomes harder to find empty slots, leading to a potential increase in the time it takes to insert new elements. In chaining, although the insertion is straightforward, the time to search for an element might increase as the chains get longer.

Let's write pseudocode for both linear probing and chaining in the context of a hash table. We'll assume that the hash table is implemented using an array, and for chaining, each array element points to a linked list.

Linear Probing

In linear probing, when we insert a new element, we start at the position indicated by the hash function. If that position is already occupied, we move sequentially through the array until we find an empty slot.

function hash(key):
    return key % tableSize

function insert(hashTable, key, value):
    index = hash(key)
    while hashTable[index] is not empty:
        index = (index + 1) % tableSize
    hashTable[index] = value

function search(hashTable, key):
    index = hash(key)
    while hashTable[index] is not empty:
        if hashTable[index] is key:
            return hashTable[index]
        index = (index + 1) % tableSize
    return null

Chaining

In chaining, each position in the hash table points to a linked list (or another dynamic data structure) that stores all elements which hash to the same index. This way, collisions are handled by adding elements to the corresponding list.

function hash(key):
    return key % tableSize

function insert(hashTable, key, value):
    index = hash(key)
    if hashTable[index] is empty:
        hashTable[index] = new LinkedList()
    hashTable[index].append(value)

function search(hashTable, key):
    index = hash(key)
    if hashTable[index] is not empty:
        for item in hashTable[index]:
            if item is key:
                return item
    return null

Explanation

  • In both methods, the hash function takes a key and computes an index based on the size of the hash table.

  • In linear probing (insert function), if the computed index is occupied, we keep moving to the next index until an empty slot is found.

  • In chaining, each table index points to a list. When inserting, we add the new value to this list. This allows multiple values to exist at the same index without conflicts.

  • The search functions in both methods try to locate the item by using the hash of the key. In linear probing, it sequentially checks each index if the initial one is occupied. In chaining, it traverses the linked list at the particular index.

Both methods have their trade-offs in terms of performance and memory usage, and the choice between them often depends on the expected load factor and the nature of the keys being inserted.

Let's write pseudocode for the contains operation for both linear probing and chaining in a hash table. The contains operation checks if a certain key is present in the hash table.

Linear Probing

In linear probing, the contains operation starts at the index indicated by the hash function. If the key at that position doesn't match, it linearly probes subsequent indices until it either finds the key or encounters an empty slot, indicating the key is not in the table.

function hash(key):
    return key % tableSize

function contains(hashTable, key):
    index = hash(key)
    while hashTable[index] is not empty:
        if hashTable[index] is key:
            return true
        index = (index + 1) % tableSize
    return false

Chaining

In chaining, each index of the hash table points to a list of keys. The contains function computes the hash of the key to find the correct list, then iterates over the list to check if the key is present.

function hash(key):
    return key % tableSize

function contains(hashTable, key):
    index = hash(key)
    if hashTable[index] is not empty:
        for item in hashTable[index]:
            if item is key:
                return true
    return false

Explanation

  • The hash function in both methods computes the index in the hash table based on the key.

  • In linear probing, contains looks through each index sequentially starting from the hashed index. If it finds the key, it returns true. If it finds an empty slot, it returns false, as this indicates the key is not present in the table.

  • In chaining, contains directly goes to the hashed index and iterates through the list at that index. If the key is found in the list, it returns true. If the list is empty or the key is not found in the list, it returns false.

These contains operations are efficient in their respective methods. Linear probing can suffer from clustering, affecting performance, especially in a nearly full table. Chaining avoids clustering but can have longer lists to search through, which can affect performance if the hash table is not well-dimensioned or the hash function does not distribute keys evenly.

let's write C code for both linear probing and chaining in a hash table. We'll create a basic implementation of each, focusing on the insert and contains operations.

Linear Probing in C

First, let's look at linear probing. We'll assume a fixed-size hash table where each slot can hold an integer value.

#include <stdio.h>
#include <stdbool.h>
#define TABLE_SIZE 10

int hashTable[TABLE_SIZE];

int hash(int key) {
    return key % TABLE_SIZE;
}

void initHashTable() {
    for (int i = 0; i < TABLE_SIZE; i++) {
        hashTable[i] = -1; // Using -1 to indicate empty slot
    }
}

bool insert(int key) {
    int index = hash(key);
    int originalIndex = index;
    do {
        if (hashTable[index] == -1) {
            hashTable[index] = key;
            return true;
        }
        index = (index + 1) % TABLE_SIZE;
    } while (index != originalIndex); // Stop if we come back to the start

    return false; // Table is full
}

bool contains(int key) {
    int index = hash(key);
    int originalIndex = index;
    do {
        if (hashTable[index] == key) {
            return true;
        }
        index = (index + 1) % TABLE_SIZE;
    } while (hashTable[index] != -1 && index != originalIndex);

    return false;
}

int main() {
    initHashTable();
    insert(15);
    insert(25); // Will collide with 15
    printf("%d\n", contains(15)); // Should print 1 (true)
    printf("%d\n", contains(30)); // Should print 0 (false)
    return 0;
}

Chaining in C

For chaining, we need a linked list at each index of the hash table. Each list node will store a key.

#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#define TABLE_SIZE 10

typedef struct Node {
    int key;
    struct Node *next;
} Node;

Node *hashTable[TABLE_SIZE];

int hash(int key) {
    return key % TABLE_SIZE;
}

void initHashTable() {
    for (int i = 0; i < TABLE_SIZE; i++) {
        hashTable[i] = NULL;
    }
}

bool insert(int key) {
    int index = hash(key);
    Node *newNode = (Node*) malloc(sizeof(Node));
    if (newNode == NULL) {
        return false; // Failed to allocate memory
    }
    newNode->key = key;
    newNode->next = hashTable[index];
    hashTable[index] = newNode;
    return true;
}

bool contains(int key) {
    int index = hash(key);
    Node *current = hashTable[index];
    while (current != NULL) {
        if (current->key == key) {
            return true;
        }
        current = current->next;
    }
    return false;
}

int main() {
    initHashTable();
    insert(15);
    insert(25); // Can coexist with 15 without collision
    printf("%d\n", contains(15)); // Should print 1 (true)
    printf("%d\n", contains(30)); // Should print 0 (false)
    return 0;
}

Explanation

  • In both implementations, the hash function computes the index for a key.

  • In linear probing, we insert by finding the next available slot and check existence by searching linearly from the hashed index.

  • In chaining, we manage collisions by using a linked list at each index. Insertion adds a new node at the beginning of the list. The contains function searches through the list at the given index.

  • Both versions use TABLE_SIZE to define the hash table's size. This value can be adjusted based on needs.

Let's perform a complexity analysis for both linear probing and chaining in hash tables, focusing on the insert and contains operations. We'll analyze the time and space complexities for each method.

Linear Probing Complexity Analysis

  1. Time Complexity:

    • insert: In the best case, when there are no collisions, the time complexity is (O(1)) as we directly insert into the desired index. However, in the worst case, when there are many collisions and the table is nearly full, the time complexity can become (O(n)), where (n) is the size of the hash table. This is because linear probing can require searching through many slots to find an empty one.

    • contains: Similar to insert, in the best case, it's (O(1)) when the key is found at its hashed index. In the worst case, it's also (O(n)) if the key is not in the table, as it may require searching through the entire table.

  2. Space Complexity:

    • The space complexity for the hash table itself is (O(n)), where (n) is the size of the table. The additional space complexity for each key is (O(1)).

Chaining Complexity Analysis

  1. Time Complexity:

    • insert: In the average case, assuming a good hash function and relatively uniform distribution of keys, the time complexity for insertion is (O(1)) on average. This is because inserting into a linked list (which is typically small) is a constant-time operation. In the worst case, when there are many collisions at a single index, it can become (O(n)), but this is less likely with a good hash function.

    • contains: In the average case, it's also (O(1)) on average because traversing a linked list is constant time for small lists. In the worst case, it can be (O(k)), where (k) is the length of the linked list at a specific index, but this is less likely to occur with a well-designed hash function.

  2. Space Complexity:

    • The space complexity for the hash table itself is (O(n)), where (n) is the size of the table. The space complexity for the linked lists is (O(m)), where (m) is the total number of keys with collisions. In practice, (m) is likely to be much less than (n) when the hash function is good.

Important to note

  • Linear probing has a simpler space complexity (per key) but can have worse worst-case time complexity when there are many collisions.

  • Chaining has a better average-case time complexity due to better handling of collisions but may have slightly higher space complexity due to the linked lists.