Hashing Recap: Direct Addressing Table, Chaining, Open Addressing
Overview
A quick note on Hash Table including direct addressing table, chaining, open addressing with counter-intuitive facts discussed and C++ implementation given.
Hash Table Related Topics
A quick look at the time and space complexity from Wiki:
Hash table | ||
---|---|---|
Type | Unordered associative array | |
Invented | 1953 | |
Time complexity in big O notation |
||
Average | Worst case | |
Space | O(n)^{ } | O(n) |
Search | O(1) | O(n) |
Insert | O(1) | O(n) |
Delete | O(1) | O(n) |
1. Direct Addressing Table
Suppose the key set K = {k1, k2, … kn}, | K | = n, is a subset from universe U = {0, 1, …, M – 1}, | U | = M; Let k be a key for an arbitrary number from K |
Direct-address Tables will do: Allocate an array T of size M, and perform the following for insert, search and delete:
- Search time: O(1) worst, simply return T[k]
- Insert time: O(1) worst, set T[k] = x
- Delete time: O(1) worst, set T[k] = NULL,
- Space: O(M), bad because M » n
2. Hash Tables
A more practical way to achieve insert, search delete as in direct addressing table is just allocate a table of size m « M for the n keys, and let’s define a = n / m as the load factor for this hash table. The issue associated with this approach is that different keys would possibly be hashed into the same slot into the table T[0…m-1]. There are two different ways to handle the collisions:
Separate Chaining:
use a linked list to hold all nodes with their own keys hashed into the same table slot, and perform the following for insert, search and delete:
- Search time: O(len) worst and O(1 + a) average, where len is the length of the linked list. It has to search the whole linked list hashed at h(k) slot to determine whether it is in the table or not.
- Insert time: O(len) worst and O(1 + a) average, similarly, it has to search the linked list first to determine if the k is already there, if so, it just update the old (k, x’) as the new pair (k, x), otherwise, it would do insertion.
- Delete time: O(len) worst and O(1 + a) average, similarly, it has to search the linked list first to determine if the k is already there, if so, it just delete it with required information kept (e.g., parent pointer in the linked list).
- Space: O(n), fair enough
The following implementation is based on the above algorithm after which I will discuss more about insert and delete with variants.
struct HashEntry { int key; // >= 0 int val; // >= 0 HashEntry *next; HashEntry(int k, int x) : key(k), val(x), next(NULL) {} }; class HashTable { private: HashEntry **table; public: const static int TABLE_SIZE = 1024; HashTable() { table = new HashEntry* [HashTable::TABLE_SIZE]; std::fill(table, &table[HashTable::TABLE_SIZE - 1], NULL); } ~HashTable() { for (int i = 0; i < HashTable::TABLE_SIZE; i++) { if (NULL == table[i]) continue; HashEntry *curr = table[i]; HashEntry *next = curr->next; while (curr) { next = curr->next; delete curr; curr = next; } } delete[] table; } int hashValue(int k) { return k % HashTable::TABLE_SIZE; } int search(int k) { int slot = hashValue(k); HashEntry *curr = table[slot]; while (curr && curr->key != k) curr = curr->next; if (curr == NULL) throw NotFoundException("Key does not exist!"); else return curr->val; } void insert(int k, int x) { int slot = hashValue(k); HashEntry *curr = table[slot]; while (curr && curr->key != k) curr = curr->next; if (NULL == curr) { HashEntry *entry = new HashEntry(k, x); entry->next = table[slot]; table[slot] = entry; } else curr->val = x; } void remove(int k) { int slot = hashValue(k); HashEntry *curr = table[slot]; HashEntry *prev = NULL; while (curr && curr->key != k) { prev = curr; curr = curr->next; } if (NULL == curr) return ; if (NULL == prev) { table[slot] = curr->next; curr->next = NULL; } else { prev->next = curr->next; curr->next = NULL; } delete curr; } };
Note my above implementation makes sure that in the hash table there are no duplicate keys, however, in the book introduction to algorithms, it is quite misleading by their algorithm of insert, and delete. There are two counter-intuitive points I want to raise up here:
- the insert in the book simply inserts a new node into the head of the linked list without checking the existence of the key, the benefit of this is obvious, it takes O(1) time in the worst case. However, the bad side about his approach is more concerning, there could be duplicated keys in the hash table (quite counter-intuitive by our normal interpretation about hash table), and another thing is it increase the memory cost. So I don’t think this is a good practice to implement hash table.
- for the delete, in the book, the author mentions that “Deletion of an element x can be accomplished in O(1) time if the lists are doubly linked”. I don’t understand why and how to achieve this using a doubly linked list unless we directly pass the pointer of the node, i.e., if we want to remove say HashEntry(3, 5), we directly pass the pointer to the function remove. Any suggestions on this issue?
- About the average time O(1 + a), the book gives formal proof that whenever it involves search the linked list, we have the average time bound O(1 + a) under the assumption of simple uniform hashing.
Open Addressing:
Instead of using linked list, we directly use the table entry to store the values we have, and if we find a slot in the table is already occupied, we try (probe) the next entry determined by a new hash function h: U × {0, 1, …, m – 1} -> {0, 1, …, m – 1}. This is called open addressing to resolve the collision issue.
The big problem with this approach is that the delete is quite a difficult issue for open addressing since the search would fail upon some NULL slot incorrectly and if we use a special symbol say DELETE to just mark the deleted slot, then the time complexity for open addressing would not be bounded by the table load factor. So generally, we prefer separate chaining, and this is also why I don’t bother to dig deep into this here but just bring up the pros and cons between the two approaches to resolve collisions.
And for the time complexity, you can find it in the book introduction to algorithms that the average time for insert, delete, search of open addressing would be O( 1 / (1 – a)) assuming uniform hashing and a < 1.
pros | cons | |
---|---|---|
Separate Chaining | 1) easy to implement 2) and the average time is good; 3) more ??? | 1) pointers cost more storage 2) pointer dereferencing might slow down the search for a bit; 3) more??? |
Direct Addressing | 1) simpler logic for storage management (no dynamic allocation); 2) faster inserts (for reason of simpler storage); 3) reduced storage requirement in general; 4) the extra memory freed by not storing pointers provides the hash table with a larger number of slots for the same amount of memory, potentially yielding fewer collisions and faster retrieval | 1) a very big drawback: delete is difficult and the time complexity would not depend on the load factor if we use FAKE DELETE 2) more??? |
Perfect Hashing:
Perfect Hashing is a two-level hash technique specially designed for static hash (the keys in the table stay unchanged) table and make sure the worst time for all the three operations are constant bounded: O(1). For more details, you might want to read the related chapter in Introduction to Algorithms.
Summary
I discussed Hash Table here including direct addressing table, chaining, open addressing with counter-intuitive facts discussed and C++ implementation given. Please feel free the share your comments on anything including the questions I raised up. Thanks!