Hashing and Hash Maps in DSA
How hash functions, hash maps, and hash sets work — the intuition behind buckets and collisions, chaining vs open addressing, average and worst-case complexity, and the Python containers built on them.
What you'll learn
- ✓What a hash function is and what makes a good one
- ✓How hash maps store key-value pairs in buckets
- ✓Collision resolution: chaining vs open addressing
- ✓Why average lookup is O(1) and worst case is O(n)
- ✓Python dict, set, frozenset, and Counter at a glance
- ✓How to make your own class hashable with __hash__
Prerequisites
- •Comfortable with dictionaries — see Python Dictionaries
- •Familiarity with arrays and basic Big-O
The hash map is the single most useful data structure in coding interviews. It turns “find this thing” from O(n) into O(1) on average, and it powers an enormous number of clever solutions: two-sum, anagram grouping, subarray sums, longest consecutive sequence, top-k frequency. This post explains how hash maps work under the hood, why they sometimes fail to be O(1), and how to use the Python containers built on them.
The problem hash maps solve
Suppose you have a list of one million user records and you want to look up a user by email. With a plain list, you scan — that is O(n) per lookup, and a million lookups becomes a trillion operations.
A hash map turns the email directly into a slot in an array. The lookup becomes one arithmetic operation plus one or two memory reads — effectively O(1).
users = {"alice@example.com": 1, "bob@example.com": 2}
print(users["alice@example.com"]) # 1 — constant time
The magic is in the hash function.
Hash functions
A hash function maps an arbitrary input — string, integer, tuple — to a fixed-size integer. A good hash function has three properties:
- Deterministic. The same input always produces the same hash.
- Fast. Computable in O(length of key).
- Uniform. Different inputs spread evenly across the output range.
Python exposes the builtin:
print(hash("hello")) # some large integer
print(hash(42)) # 42 (small integers hash to themselves)
print(hash((1, 2, 3))) # tuple of immutables is hashable
# hash([1, 2, 3]) # TypeError: unhashable type: 'list'
Python randomizes the seed of string hashing per process to defend against adversarial collisions. Two runs of your program will produce different hash values for the same string.
Buckets and the array under the hood
A hash map is, internally, an array of buckets. To insert a key-value pair:
- Compute
h = hash(key). - Reduce to a bucket index:
i = h mod capacity. - Store the key and value in bucket
i.
To look up a key:
- Compute the same
i. - Read bucket
i.
If two distinct keys land in the same bucket, that’s a collision — and every hash map needs a strategy for them.
Collision resolution
Two classical strategies.
Chaining
Each bucket is a small list. On collision, append the new pair to that list. Lookup compares the searched key against each entry in the bucket’s list.
bucket 3: -> (key='cat', value=1) -> (key='act', value=7)
- Insert: O(1) average.
- Lookup: O(1) average, O(length of chain) worst.
- Memory: extra pointers, but simple to reason about.
Open addressing
The hash map is a flat array; on collision, probe the next bucket (linear probing, quadratic probing, or double hashing) until you find an empty slot.
bucket 3: (key='cat', value=1)
bucket 4: (key='act', value=7) — placed here because bucket 3 was full
- Better cache locality (one contiguous array, no chasing pointers).
- More sensitive to load factor — once the table is more than ~70% full, probes become long.
CPython’s dict uses open addressing with perturbation — a clever probing sequence designed to spread collisions out. Java’s HashMap uses chaining. Both achieve average-case O(1) for the four operations we care about: insert, lookup, update, delete.
Load factor and resizing
The load factor is entries / capacity. As it climbs, collisions become more likely. To keep operations fast, hash maps resize — typically doubling the capacity and re-inserting every entry — when the load factor crosses a threshold (about 0.66 in CPython).
Resizing is O(n), but it happens rarely. Averaged across many operations, each insert is still O(1) amortized.
Average vs worst case
| Operation | Average | Worst (pathological) |
|---|---|---|
| Insert | O(1) | O(n) |
| Lookup | O(1) | O(n) |
| Delete | O(1) | O(n) |
| Iterate | O(n) | O(n) |
The worst case happens when every key hashes to the same bucket. In practice this requires an adversary who knows your hash seed (a defended class of attack against web servers). For interview problems, assume average-case O(1) and you’ll be right.
Try it yourself. Without running any code, predict the output:
d = {}
d["a"] = 1
d["b"] = 2
d["a"] = 3
print(d, len(d))Then run it and confirm. The point is to internalize that re-inserting a key updates the value but does not grow the dictionary.
What makes a key hashable?
In Python, an object is hashable if it has a __hash__ method that returns an integer, and two equal objects produce equal hashes. By default:
- Immutable built-ins are hashable:
int,float,str,tuple(of hashables),frozenset,bool,None. - Mutable built-ins are not:
list,dict,set.
The reason mutable objects are excluded: if you mutate the object after using it as a key, its hash would change and you could never find it again.
# Legal keys
{"hello": 1, 42: 2, (1, 2): 3, frozenset([1, 2]): 4}
# Illegal
# {[1, 2]: 5} # TypeError: unhashable type: 'list'
Sets and frozensets
A set is a hash map with no values — just keys. Same average-case O(1) for insert, delete, and membership test.
s = {"apple", "banana", "cherry"}
print("apple" in s) # True — O(1)
s.add("date")
s.remove("banana")
A frozenset is the immutable cousin. Because it cannot change, it is hashable, which means a set of sets is possible if the inner sets are frozen:
groups = {frozenset({1, 2}), frozenset({3, 4})}
print(frozenset({1, 2}) in groups) # True
This is the right tool for deduplicating unordered groups (for example, the keys you’d use to group anagrams).
Counter — the counting dictionary
collections.Counter is a dictionary subclass purpose-built for counting:
from collections import Counter
c = Counter("abracadabra")
print(c) # Counter({'a': 5, 'b': 2, 'r': 2, 'c': 1, 'd': 1})
print(c.most_common(2)) # [('a', 5), ('b', 2)]
# Counters support arithmetic
c2 = Counter("alabama")
print(c + c2)
print(c & c2) # intersection: min of each
Counter handles the boilerplate of “if key not in dict, set to zero, otherwise increment.” For interview problems on frequencies, reach for it first.
defaultdict — automatic empty values
collections.defaultdict removes the setdefault ceremony from grouping problems:
from collections import defaultdict
groups = defaultdict(list)
for word in ["eat", "tea", "tan", "ate", "nat", "bat"]:
key = "".join(sorted(word))
groups[key].append(word)
print(dict(groups))
# {'aet': ['eat','tea','ate'], 'ant': ['tan','nat'], 'abt': ['bat']}
Pass list, int, set, or any zero-argument callable. When a missing key is accessed, the default is created automatically.
Making your own class hashable
If you put your own objects into a set or use them as dict keys, you’ll usually want to define both __eq__ and __hash__. The contract: equal objects must have equal hashes.
class Point:
def __init__(self, x, y):
self.x = x
self.y = y
def __eq__(self, other):
return isinstance(other, Point) and self.x == other.x and self.y == other.y
def __hash__(self):
return hash((self.x, self.y))
points = {Point(0, 0), Point(1, 2)}
print(Point(0, 0) in points) # True
The easy way: derive the hash from a tuple of the fields that participate in equality. Never include mutable fields — if they change, the hash changes, and you’ll never find the object again.
Dataclasses with frozen=True handle this for you automatically.
Try it yourself. Write a small program that uses a Counter to find the three most common words in this paragraph after stripping punctuation and lowercasing. Use re.findall(r"\w+", text.lower()) to extract words.
When to use a hash map
Reach for a hash map when you need to:
- Look up by key. “Have I seen this value before?” is one line.
- Count things. Frequency tables, histograms.
- Group things. “All items with the same X go into the same bucket.”
- Memoize. Cache the result of an expensive function call.
- Index by something other than position. Names, ids, tuples — anything hashable.
Reach for a hash set when you only need membership without an associated value — for example, to deduplicate or to mark “visited” nodes in a graph traversal.
When not to
A hash map is the wrong tool when:
- You need ordered traversal by key. Use a sorted structure (e.g.
SortedDictinsortedcontainers, or a balanced tree in other languages). - Keys are not hashable (mutable lists, dicts).
- You need range queries (“all keys between A and B”). Use a tree.
- Memory is tight and the data is naturally array-indexed.
Recap
You now know:
- A hash function maps a key to an integer; a hash map uses it to find a bucket
- Collisions are resolved with chaining or open addressing
- Operations are O(1) average, O(n) worst case
- Python
dictandsetare hash maps;Counteranddefaultdictare conveniences overdict - Use
frozensetwhen you need a hashable set - Custom classes need
__eq__and__hash__to be used as keys; equal objects must hash equal
Next steps
The follow-up post applies all this to a focused practice set — Two Sum, Subarray Sum Equals K, Longest Consecutive Sequence, Top K Frequent Elements, and four more — each with a Python solution and Big-O analysis.
→ Next: Hashing: 8 Practice Problems with Solutions
Questions or feedback? Email codeloomdevv@gmail.com.