Introduction to Hashing

Hashing is a technique used to map data (such as strings, numbers, or any other data type) to a fixed-size value, often called a hash code or hash value. The purpose of hashing is to efficiently store and retrieve data from a collection, typically using a hash table. In hashing, a hash function is used to transform the input data into a hash value that serves as an index where the data is stored in a hash table. This allows for near-constant time complexity for search, insertion, and deletion operations.

What is a Hash Function?

A hash function is a function that takes an input (or "key") and returns a fixed-size string of bytes (usually an integer). The output is typically a number that is used as an index in a hash table. The efficiency of a hashing system is largely dependent on how well the hash function distributes the keys across the hash table, reducing collisions.

Simple Example:

If you have a set of names like ["Alice", "Bob", "Charlie"], a hash function could convert these names into integers:

hash("Alice") → 5
hash("Bob") → 12
hash("Charlie") → 8

These hash values can then be used to store the names in a hash table at positions 5, 12, and 8, respectively.

Properties of a Good Hash Function

A good hash function should have the following properties to ensure efficient performance:

  1. Deterministic: The hash function should always produce the same hash value for the same input data. For instance, hash("Bob") should always return the same number (e.g., 12).

  2. Uniform Distribution: The hash function should map input data to hash values uniformly. This ensures that the keys are evenly distributed across the hash table, minimizing collisions (when two different keys hash to the same value).

  3. Minimize Collisions: Collisions occur when two different inputs produce the same hash value. A good hash function should minimize the chances of collisions, although some are inevitable in most practical scenarios.

  4. Fast Computation: The hash function should be efficient and fast to compute, even for large inputs. Time spent calculating the hash should be minimal to ensure the overall efficiency of the system.

  5. Avalanche Effect: A small change in the input should result in a drastic change in the hash value. For example, hash("dog") and hash("Dog") should produce vastly different hash values.

  6. Low Memory Usage: The hash function should use minimal memory to compute and store values, especially when dealing with large data sets.

How to Craft a Good Hash Function

Designing a good hash function requires balancing the properties mentioned above. Here are some general techniques and considerations when crafting a hash function:

  1. Modular Arithmetic: One of the simplest and most common techniques is to use the modulo operator. For example, hash(key) = key % table_size. This ensures the hash value is within the range of the table size.

  2. Multiplication and Shifting: Combine multiplication and shifting to achieve uniform distribution. For example:

     hash(key) = (key * constant) >> some_bits
    

    This technique helps spread the keys uniformly across the hash table.

  3. String Hashing: For strings, you can create a hash function by converting characters to numeric values (e.g., using ASCII codes) and combining them. One common string hashing technique is the djb2 algorithm, which is simple but effective:

     unsigned long hash(char *str) {
         unsigned long hash = 5381;
         int c;
         while ((c = *str++)) {
             hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
         }
         return hash;
     }
    

    In this case, each character affects the hash value, and a small change in the input string will result in a large difference in the output hash.

  4. Bit Manipulation: For integer inputs, manipulating the bits by rotating, shifting, or XOR-ing the bits of the input can create a more uniform distribution of hash values. XOR-ing parts of the data with each other is a common approach in crafting effective hash functions.

  5. Use of Prime Numbers: Prime numbers are often used in hash functions to reduce the likelihood of patterns and improve the distribution of hash values. For example, using prime numbers in modulo operations can help spread the hash values more uniformly across the table size.

Dealing with Collisions

No hash function can completely avoid collisions, especially when the input data set is larger than the number of possible hash values (this is known as the pigeonhole principle). There are several strategies to handle collisions:

  1. Chaining: In this method, each slot in the hash table contains a linked list. When a collision occurs (i.e., two keys hash to the same value), the key is added to the linked list at that index. Searching, inserting, and deleting still require traversing the linked list, but this keeps the hash table functional even with collisions.

    Example of Chaining:

     Hash Table:
     Index 0: [Alice] → [Bob] → [Charlie]
    
     #include <stdio.h>
     #include <stdlib.h>
    
     // Define the structure for the linked list used in chaining
     struct Node {
         int key;
         struct Node* next;
     };
    
     // Hash function
     int hash(int key, int table_size) {
         return key % table_size;
     }
    
     // Function to create a new node
     struct Node* createNode(int key) {
         struct Node* newNode = (struct Node*)malloc(sizeof(struct Node));
         newNode->key = key;
         newNode->next = NULL;
         return newNode;
     }
    
     // Insert key into the hash table using chaining
     void insert(struct Node* hashTable[], int key, int table_size) {
         int index = hash(key, table_size);
         struct Node* newNode = createNode(key);
    
         // Insert the node at the beginning of the linked list at the hashed index
         newNode->next = hashTable[index];
         hashTable[index] = newNode;
     }
    
     // Search for a key in the hash table
     struct Node* search(struct Node* hashTable[], int key, int table_size) {
         int index = hash(key, table_size);
         struct Node* current = hashTable[index];
    
         while (current != NULL) {
             if (current->key == key) {
                 return current;  // Key found
             }
             current = current->next;
         }
         return NULL;  // Key not found
     }
    
     // Delete a key from the hash table
     void deleteKey(struct Node* hashTable[], int key, int table_size) {
         int index = hash(key, table_size);
         struct Node* current = hashTable[index];
         struct Node* prev = NULL;
    
         while (current != NULL && current->key != key) {
             prev = current;
             current = current->next;
         }
    
         // If key is not found
         if (current == NULL) return;
    
         // If the key is the first node in the list
         if (prev == NULL) {
             hashTable[index] = current->next;
         } else {
             prev->next = current->next;
         }
    
         free(current);
     }
    
     // Display the hash table with chaining
     void display(struct Node* hashTable[], int table_size) {
         for (int i = 0; i < table_size; i++) {
             printf("Index %d: ", i);
             struct Node* current = hashTable[i];
             while (current != NULL) {
                 printf("%d -> ", current->key);
                 current = current->next;
             }
             printf("NULL\n");
         }
     }
    
     int main() {
         int table_size = 10;
         struct Node* hashTable[10] = {NULL};  // Initialize hash table with NULL pointers
    
         // Insert keys into the hash table
         insert(hashTable, 15, table_size);
         insert(hashTable, 25, table_size);
         insert(hashTable, 35, table_size);
         insert(hashTable, 45, table_size);
         insert(hashTable, 55, table_size);
    
         display(hashTable, table_size);  // Display the hash table
    
         // Search for a key
         int key = 25;
         if (search(hashTable, key, table_size)) {
             printf("Key %d found.\n", key);
         } else {
             printf("Key %d not found.\n", key);
         }
    
         // Delete a key
         deleteKey(hashTable, 25, table_size);
         display(hashTable, table_size);  // Display the hash table after deletion
    
         return 0;
     }
    
  2. Open Addressing: Instead of using a linked list, open addressing looks for the next available slot in the hash table when a collision occurs. Common open addressing strategies include:

    • Linear Probing: When a collision occurs, the algorithm checks the next slot (index + 1) until an empty slot is found.

    •   #include <stdio.h>
      
        #define TABLE_SIZE 10
      
        // Hash function
        int hash(int key, int table_size) {
            return key % table_size;
        }
      
        // Insert key using linear probing
        void insert(int hashTable[], int key) {
            int index = hash(key, TABLE_SIZE);
      
            // Search for the next available slot
            while (hashTable[index] != -1) {
                index = (index + 1) % TABLE_SIZE;
            }
      
            hashTable[index] = key;
        }
      
        // Search for a key using linear probing
        int search(int hashTable[], int key) {
            int index = hash(key, TABLE_SIZE);
            int start_index = index;  // Keep track of the starting index to avoid infinite loop
      
            // Search for the key
            while (hashTable[index] != -1) {
                if (hashTable[index] == key) {
                    return index;
                }
                index = (index + 1) % TABLE_SIZE;
                if (index == start_index) {
                    break;  // If we loop back to the starting index, the table is full
                }
            }
      
            return -1;  // Key not found
        }
      
        // Display the hash table
        void display(int hashTable[]) {
            for (int i = 0; i < TABLE_SIZE; i++) {
                if (hashTable[i] != -1) {
                    printf("Index %d: %d\n", i, hashTable[i]);
                } else {
                    printf("Index %d: NULL\n", i);
                }
            }
        }
      
        int main() {
            int hashTable[TABLE_SIZE];
            for (int i = 0; i < TABLE_SIZE; i++) {
                hashTable[i] = -1;  // Initialize the hash table with -1 (indicating empty slots)
            }
      
            // Insert keys into the hash table
            insert(hashTable, 15);
            insert(hashTable, 25);
            insert(hashTable, 35);
            insert(hashTable, 45);
      
            display(hashTable);  // Display the hash table
      
            // Search for a key
            int key = 25;
            int result = search(hashTable, key);
            if (result != -1) {
                printf("Key %d found at index %d.\n", key, result);
            } else {
                printf("Key %d not found.\n", key);
            }
      
            return 0;
        }
      
    • Quadratic Probing: The algorithm checks slots with increasing intervals (index + 1, index + 4, index + 9, etc.).

    • Double Hashing: A second hash function is applied to determine how far to move when a collision occurs.

  3. Perfect Hashing: If the set of keys is known in advance and is relatively small, a perfect hash function can be crafted that guarantees no collisions. This is particularly useful for static data sets.

Hashing in Real-World Applications

  1. Hash Tables: The most common application of hashing is in hash tables, where it provides fast access to data in constant time, \(O(1)\) on average. Hash tables are used in many data structures, like dictionaries (in Python), hash maps (in Java), and unordered maps (in C++).

  2. Cryptographic Hashing: Hashing is also used in cryptography for data integrity, digital signatures, and password storage. Cryptographic hash functions (e.g., SHA-256, MD5) have additional security properties, such as being difficult to reverse or tamper with.

  3. File and Data Deduplication: Hashing can identify duplicate files or data blocks by hashing their contents and comparing the hash values.

  4. Hashing in Data Structures: Hash functions are used in complex data structures like bloom filters, caches, and databases for efficient lookup and retrieval.

Time Complexity of Hashing

The efficiency of hashing depends on the design of the hash function and how well it handles collisions. In an ideal scenario (with a perfect hash function), the time complexity of insertion, deletion, and search operations in a hash table is O(1) (constant time). However, if the hash function is poor and there are many collisions, these operations may degrade to O(n) (linear time) in the worst case, depending on the collision resolution method.