Hash Tables: Beyond the Basics
Every developer knows hash tables offer O(1) average-case lookups. But that simple statement hides a wealth of nuance that separates developers who use hash tables from those who truly understand them.
The O(1) Lie
Let’s be clear: O(1) is the average case under ideal conditions. In the worst case, hash tables degrade to O(n). Understanding when and why this happens is crucial.
The performance of a hash table depends on three factors:
- The quality of your hash function
- Your collision resolution strategy
- The load factor
Get any of these wrong, and your “constant time” data structure becomes a linked list in disguise.
Collision Resolution: Two Schools of Thought
Chaining
The simplest approach: each bucket contains a linked list of entries that hash to that index. When collisions occur, you append to the list.
class ChainedHashTable:
def __init__(self, size=16):
self.buckets = [[] for _ in range(size)]
def put(self, key, value):
index = hash(key) % len(self.buckets)
bucket = self.buckets[index]
for i, (k, v) in enumerate(bucket):
if k == key:
bucket[i] = (key, value)
return
bucket.append((key, value))
Chaining is simple and degrades gracefully, but cache performance suffers because linked list nodes scatter across memory.
Open Addressing
Instead of chaining, find another slot in the array. Linear probing checks the next slot, quadratic probing jumps by increasing squares, and double hashing uses a second hash function.
class OpenAddressedHashTable:
def __init__(self, size=16):
self.keys = [None] * size
self.values = [None] * size
def put(self, key, value):
index = hash(key) % len(self.keys)
while self.keys[index] is not None:
if self.keys[index] == key:
self.values[index] = value
return
index = (index + 1) % len(self.keys) # Linear probing
self.keys[index] = key
self.values[index] = value
Open addressing has better cache locality but suffers from clustering—collisions beget more collisions.
Load Factor: The Critical Threshold
The load factor is the ratio of entries to buckets. As it increases, collision probability grows exponentially.
Most implementations resize when the load factor exceeds 0.75. Java’s HashMap, Python’s dict, and Go’s map all use this threshold. But resizing is expensive—you must rehash every entry.
For performance-critical code, pre-sizing your hash table to avoid resizes can matter:
// Bad: will resize multiple times
Map<String, Integer> map = new HashMap<>();
// Better: sized for expected entries
Map<String, Integer> map = new HashMap<>(expectedSize * 4 / 3 + 1);When Hash Tables Fail
Hash tables aren’t always the answer:
Ordered iteration: Hash tables don’t maintain insertion order (unless you use LinkedHashMap or similar). If you need sorted keys, a tree-based structure is better.
Range queries: Finding all keys between A and B requires scanning every entry. Trees handle this in O(log n + k).
Memory constraints: Hash tables trade memory for speed. With a load factor of 0.75, you’re using 33% more memory than necessary, plus overhead for the structure itself.
Adversarial inputs: If an attacker can control your keys, they can craft inputs that all hash to the same bucket, turning O(1) into O(n). This has been exploited in denial-of-service attacks against web frameworks.
Hash Function Quality
A good hash function distributes keys uniformly. A bad one creates clusters.
Consider hashing strings by summing character codes:
def bad_hash(s):
return sum(ord(c) for c in s)
“abc” and “cba” hash to the same value. So do “ab” and “ba”. Collisions everywhere.
Modern languages use sophisticated hash functions. Python uses SipHash for security. Java uses a mixing function that spreads bits across the hash space. Don’t roll your own unless you have a compelling reason.
Practical Takeaways
-
Pre-size when you know the expected capacity. Avoiding resizes can significantly improve performance for large tables.
-
Be aware of key distribution. Sequential integers hash well. Clustered floating-point values don’t.
-
Consider alternatives for small collections. For fewer than 10-20 elements, a sorted array with binary search often beats a hash table due to cache effects.
-
Profile before optimizing. Hash table operations are fast enough that other parts of your code are likely the bottleneck.
The hash table is one of computing’s great inventions. But like all tools, it works best when you understand not just what it does, but how and why.