Skip to content
C Codeloom
DSA

Hashing and Hash Maps in DSA

How hash functions, hash maps, and hash sets work — the intuition behind buckets and collisions, chaining vs open addressing, average and worst-case complexity, and the Python containers built on them.

·9 min read · By Yash Kesharwani
Intermediate 12 min read

What you'll learn

  • What a hash function is and what makes a good one
  • How hash maps store key-value pairs in buckets
  • Collision resolution: chaining vs open addressing
  • Why average lookup is O(1) and worst case is O(n)
  • Python dict, set, frozenset, and Counter at a glance
  • How to make your own class hashable with __hash__

Prerequisites

  • Comfortable with dictionaries — see Python Dictionaries
  • Familiarity with arrays and basic Big-O

The hash map is the single most useful data structure in coding interviews. It turns “find this thing” from O(n) into O(1) on average, and it powers an enormous number of clever solutions: two-sum, anagram grouping, subarray sums, longest consecutive sequence, top-k frequency. This post explains how hash maps work under the hood, why they sometimes fail to be O(1), and how to use the Python containers built on them.

The problem hash maps solve

Suppose you have a list of one million user records and you want to look up a user by email. With a plain list, you scan — that is O(n) per lookup, and a million lookups becomes a trillion operations.

A hash map turns the email directly into a slot in an array. The lookup becomes one arithmetic operation plus one or two memory reads — effectively O(1).

users = {"alice@example.com": 1, "bob@example.com": 2}
print(users["alice@example.com"])    # 1   — constant time

The magic is in the hash function.

Hash functions

A hash function maps an arbitrary input — string, integer, tuple — to a fixed-size integer. A good hash function has three properties:

  1. Deterministic. The same input always produces the same hash.
  2. Fast. Computable in O(length of key).
  3. Uniform. Different inputs spread evenly across the output range.

Python exposes the builtin:

print(hash("hello"))    # some large integer
print(hash(42))         # 42 (small integers hash to themselves)
print(hash((1, 2, 3)))  # tuple of immutables is hashable
# hash([1, 2, 3])       # TypeError: unhashable type: 'list'

Python randomizes the seed of string hashing per process to defend against adversarial collisions. Two runs of your program will produce different hash values for the same string.

Buckets and the array under the hood

A hash map is, internally, an array of buckets. To insert a key-value pair:

  1. Compute h = hash(key).
  2. Reduce to a bucket index: i = h mod capacity.
  3. Store the key and value in bucket i.

To look up a key:

  1. Compute the same i.
  2. Read bucket i.

If two distinct keys land in the same bucket, that’s a collision — and every hash map needs a strategy for them.

Collision resolution

Two classical strategies.

Chaining

Each bucket is a small list. On collision, append the new pair to that list. Lookup compares the searched key against each entry in the bucket’s list.

bucket 3: -> (key='cat', value=1) -> (key='act', value=7)
  • Insert: O(1) average.
  • Lookup: O(1) average, O(length of chain) worst.
  • Memory: extra pointers, but simple to reason about.

Open addressing

The hash map is a flat array; on collision, probe the next bucket (linear probing, quadratic probing, or double hashing) until you find an empty slot.

bucket 3: (key='cat', value=1)
bucket 4: (key='act', value=7)   — placed here because bucket 3 was full
  • Better cache locality (one contiguous array, no chasing pointers).
  • More sensitive to load factor — once the table is more than ~70% full, probes become long.

CPython’s dict uses open addressing with perturbation — a clever probing sequence designed to spread collisions out. Java’s HashMap uses chaining. Both achieve average-case O(1) for the four operations we care about: insert, lookup, update, delete.

Load factor and resizing

The load factor is entries / capacity. As it climbs, collisions become more likely. To keep operations fast, hash maps resize — typically doubling the capacity and re-inserting every entry — when the load factor crosses a threshold (about 0.66 in CPython).

Resizing is O(n), but it happens rarely. Averaged across many operations, each insert is still O(1) amortized.

Average vs worst case

OperationAverageWorst (pathological)
InsertO(1)O(n)
LookupO(1)O(n)
DeleteO(1)O(n)
IterateO(n)O(n)

The worst case happens when every key hashes to the same bucket. In practice this requires an adversary who knows your hash seed (a defended class of attack against web servers). For interview problems, assume average-case O(1) and you’ll be right.

Try it yourself. Without running any code, predict the output:

d = {}
d["a"] = 1
d["b"] = 2
d["a"] = 3
print(d, len(d))

Then run it and confirm. The point is to internalize that re-inserting a key updates the value but does not grow the dictionary.

What makes a key hashable?

In Python, an object is hashable if it has a __hash__ method that returns an integer, and two equal objects produce equal hashes. By default:

  • Immutable built-ins are hashable: int, float, str, tuple (of hashables), frozenset, bool, None.
  • Mutable built-ins are not: list, dict, set.

The reason mutable objects are excluded: if you mutate the object after using it as a key, its hash would change and you could never find it again.

# Legal keys
{"hello": 1, 42: 2, (1, 2): 3, frozenset([1, 2]): 4}

# Illegal
# {[1, 2]: 5}   # TypeError: unhashable type: 'list'

Sets and frozensets

A set is a hash map with no values — just keys. Same average-case O(1) for insert, delete, and membership test.

s = {"apple", "banana", "cherry"}
print("apple" in s)     # True       — O(1)
s.add("date")
s.remove("banana")

A frozenset is the immutable cousin. Because it cannot change, it is hashable, which means a set of sets is possible if the inner sets are frozen:

groups = {frozenset({1, 2}), frozenset({3, 4})}
print(frozenset({1, 2}) in groups)   # True

This is the right tool for deduplicating unordered groups (for example, the keys you’d use to group anagrams).

Counter — the counting dictionary

collections.Counter is a dictionary subclass purpose-built for counting:

from collections import Counter

c = Counter("abracadabra")
print(c)                       # Counter({'a': 5, 'b': 2, 'r': 2, 'c': 1, 'd': 1})
print(c.most_common(2))        # [('a', 5), ('b', 2)]

# Counters support arithmetic
c2 = Counter("alabama")
print(c + c2)
print(c & c2)                  # intersection: min of each

Counter handles the boilerplate of “if key not in dict, set to zero, otherwise increment.” For interview problems on frequencies, reach for it first.

defaultdict — automatic empty values

collections.defaultdict removes the setdefault ceremony from grouping problems:

from collections import defaultdict

groups = defaultdict(list)
for word in ["eat", "tea", "tan", "ate", "nat", "bat"]:
    key = "".join(sorted(word))
    groups[key].append(word)

print(dict(groups))
# {'aet': ['eat','tea','ate'], 'ant': ['tan','nat'], 'abt': ['bat']}

Pass list, int, set, or any zero-argument callable. When a missing key is accessed, the default is created automatically.

Making your own class hashable

If you put your own objects into a set or use them as dict keys, you’ll usually want to define both __eq__ and __hash__. The contract: equal objects must have equal hashes.

class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def __eq__(self, other):
        return isinstance(other, Point) and self.x == other.x and self.y == other.y

    def __hash__(self):
        return hash((self.x, self.y))

points = {Point(0, 0), Point(1, 2)}
print(Point(0, 0) in points)    # True

The easy way: derive the hash from a tuple of the fields that participate in equality. Never include mutable fields — if they change, the hash changes, and you’ll never find the object again.

Dataclasses with frozen=True handle this for you automatically.

Try it yourself. Write a small program that uses a Counter to find the three most common words in this paragraph after stripping punctuation and lowercasing. Use re.findall(r"\w+", text.lower()) to extract words.

When to use a hash map

Reach for a hash map when you need to:

  • Look up by key. “Have I seen this value before?” is one line.
  • Count things. Frequency tables, histograms.
  • Group things. “All items with the same X go into the same bucket.”
  • Memoize. Cache the result of an expensive function call.
  • Index by something other than position. Names, ids, tuples — anything hashable.

Reach for a hash set when you only need membership without an associated value — for example, to deduplicate or to mark “visited” nodes in a graph traversal.

When not to

A hash map is the wrong tool when:

  • You need ordered traversal by key. Use a sorted structure (e.g. SortedDict in sortedcontainers, or a balanced tree in other languages).
  • Keys are not hashable (mutable lists, dicts).
  • You need range queries (“all keys between A and B”). Use a tree.
  • Memory is tight and the data is naturally array-indexed.

Recap

You now know:

  • A hash function maps a key to an integer; a hash map uses it to find a bucket
  • Collisions are resolved with chaining or open addressing
  • Operations are O(1) average, O(n) worst case
  • Python dict and set are hash maps; Counter and defaultdict are conveniences over dict
  • Use frozenset when you need a hashable set
  • Custom classes need __eq__ and __hash__ to be used as keys; equal objects must hash equal

Next steps

The follow-up post applies all this to a focused practice set — Two Sum, Subarray Sum Equals K, Longest Consecutive Sequence, Top K Frequent Elements, and four more — each with a Python solution and Big-O analysis.

→ Next: Hashing: 8 Practice Problems with Solutions

Questions or feedback? Email codeloomdevv@gmail.com.