String Pattern Matching: Naive, KMP, and Rabin-Karp
A practical tour of substring search — the naive O(n·m) scan, the KMP algorithm with its LPS array, and the Rabin-Karp rolling hash. Worked examples, code, and intuition for when to use each.
What you'll learn
- ✓How the naive substring scan works and where it fails
- ✓The intuition behind the KMP failure function (LPS array)
- ✓How to build an LPS array and use it to skip redundant comparisons
- ✓How the Rabin-Karp rolling hash matches in average O(n+m)
- ✓When to reach for each algorithm in interviews and in production
Prerequisites
- •Familiarity with string indexing and slicing — see Strings: Operations and Patterns
- •Basic understanding of arrays and loops
The substring-search problem is simple to state: given a text T of length n and a pattern P of length m, find every index i such that T[i:i+m] == P. The naive solution is two lines. The classical solutions — KMP and Rabin-Karp — are surprisingly subtle, but the intuition is worth the effort. This post walks through all three.
The naive algorithm
The simplest approach: try every starting position in the text and compare character by character.
def naive_search(text, pattern):
n, m = len(text), len(pattern)
matches = []
for i in range(n - m + 1):
j = 0
while j < m and text[i + j] == pattern[j]:
j += 1
if j == m:
matches.append(i)
return matches
print(naive_search("ababcabcabababd", "ababd")) # [10]
Complexity
- Worst case: O(n · m). Consider text
"aaaaaaaaaab"and pattern"aaaab". At each of n positions we compare m characters before failing. - Best case: O(n). When the first character mismatches at most positions.
- Space: O(1).
For small inputs this is fine. For repeated searches on the same text, or pathological inputs, we want better.
Why naive search is wasteful
Look at this comparison:
text: A B A B A B C ...
pattern: A B A B D
^ ^ ^ ^ x — mismatch at index 4
The naive algorithm now restarts at text index 1, retesting characters it already saw. But notice: the first four characters of the pattern, ABAB, contain a prefix AB that is also a suffix. So after the mismatch we already know text[2..3] == "AB", and we should resume comparing from pattern index 2 — not throw away that knowledge.
This observation is the seed of KMP.
KMP: the failure function (LPS array)
The Knuth–Morris–Pratt algorithm precomputes, for each prefix of the pattern, the length of the longest proper prefix that is also a suffix — abbreviated LPS. “Proper” means not equal to the prefix itself.
For pattern ABABD:
| index | prefix | longest proper prefix == suffix | LPS |
|---|---|---|---|
| 0 | A | — | 0 |
| 1 | AB | — | 0 |
| 2 | ABA | A | 1 |
| 3 | ABAB | AB | 2 |
| 4 | ABABD | — | 0 |
The LPS tells us how many characters we can safely keep when a mismatch occurs.
Building the LPS array
def build_lps(pattern):
m = len(pattern)
lps = [0] * m
length = 0 # length of the previous longest prefix-suffix
i = 1
while i < m:
if pattern[i] == pattern[length]:
length += 1
lps[i] = length
i += 1
elif length != 0:
length = lps[length - 1] # fall back
else:
lps[i] = 0
i += 1
return lps
print(build_lps("ABABD")) # [0, 0, 1, 2, 0]
print(build_lps("AAAA")) # [0, 1, 2, 3]
print(build_lps("ABCABCD")) # [0, 0, 0, 1, 2, 3, 0]
The clever line is length = lps[length - 1] — when characters differ but we have a previous prefix-suffix, we shrink to the next-longest one without restarting.
The KMP search
Armed with the LPS array, the search walks the text once. On a mismatch, instead of stepping back in the text, we step the pattern index back to lps[j - 1].
def kmp_search(text, pattern):
n, m = len(text), len(pattern)
if m == 0:
return [0]
lps = build_lps(pattern)
matches = []
i = j = 0 # i over text, j over pattern
while i < n:
if text[i] == pattern[j]:
i += 1
j += 1
if j == m:
matches.append(i - m)
j = lps[j - 1] # continue searching for overlapping matches
else:
if j != 0:
j = lps[j - 1]
else:
i += 1
return matches
print(kmp_search("ababcabcabababd", "ababd")) # [10]
print(kmp_search("aaaaa", "aa")) # [0, 1, 2, 3]
Complexity
- Preprocessing: O(m).
- Search: O(n). The text index
inever decreases. - Total: O(n + m).
- Space: O(m) for the LPS array.
KMP guarantees linear time even on pathological inputs.
Try it yourself. Compute the LPS array for "AABAACAABAA" by hand. Then verify with build_lps. Pay attention to how the LPS jumps when the pattern contains repeated blocks.
Rabin-Karp: the rolling hash
The Rabin-Karp algorithm takes a different angle. Instead of comparing characters, it compares hashes of windows.
The idea: compute a hash h(P) of the pattern. Slide a window of length m across the text and compute h(T[i:i+m]). Whenever the window hash equals the pattern hash, do a full character comparison to confirm (because different strings can collide on the same hash).
The trick is the rolling hash — updating the window hash from position i to i+1 in O(1), not O(m).
Polynomial hash
Treat the m-character window as a base-b number modulo a large prime q:
h(s) = (s[0]·b^(m-1) + s[1]·b^(m-2) + ... + s[m-1]·b^0) mod q
To slide one position right, subtract the leading character’s contribution, multiply by b, then add the new trailing character:
h_new = ((h_old - s[0]·b^(m-1)) · b + s[m]) mod q
Implementation
def rabin_karp(text, pattern, base=256, mod=10**9 + 7):
n, m = len(text), len(pattern)
if m == 0 or m > n:
return []
h = pow(base, m - 1, mod) # base^(m-1) mod q
p_hash = 0
t_hash = 0
matches = []
# Initial hashes for pattern and first window
for i in range(m):
p_hash = (base * p_hash + ord(pattern[i])) % mod
t_hash = (base * t_hash + ord(text[i])) % mod
for i in range(n - m + 1):
if p_hash == t_hash and text[i:i + m] == pattern:
matches.append(i)
if i < n - m:
t_hash = (base * (t_hash - ord(text[i]) * h) + ord(text[i + m])) % mod
t_hash = (t_hash + mod) % mod # keep non-negative
return matches
print(rabin_karp("ababcabcabababd", "ababd")) # [10]
print(rabin_karp("hello world hello", "hello")) # [0, 12]
Complexity
- Average case: O(n + m). Collisions are rare with a good hash.
- Worst case: O(n · m). All hashes collide and every position requires a full character comparison. A second hash or a strong prime mitigates this.
- Space: O(1).
The killer feature of Rabin-Karp is multiple-pattern search: hash a set of patterns once, then a single pass over the text checks them all.
Choosing between them
| Need | Use |
|---|---|
| One pattern, worst-case guarantee | KMP |
| Many patterns at once | Rabin-Karp (or Aho-Corasick) |
| Tiny pattern, tiny text | Naive — simplicity wins |
| Built-in language method | Often a tuned variant of BMH or Two-Way |
In competitive programming KMP is the safer bet because of the worst-case bound. In production, the language’s built-in str.find / String.indexOf is already extremely fast and usually fine.
A note on Z-function and Boyer-Moore
Two algorithms worth knowing exist but are beyond this post:
- Z-function computes, for each position, the length of the longest substring starting there that matches a prefix of the string. It is an alternative to LPS with the same O(n+m) bound and is often easier to reason about.
- Boyer-Moore scans the pattern right-to-left and uses two heuristics (bad character and good suffix) to skip large blocks. It is sublinear on typical inputs and is what
grepuses.
Each deserves a focused post. The intuition you built around LPS transfers cleanly to both.
Try it yourself. Implement count_occurrences(text, pattern) using KMP. Test on ("abababab", "ab") — you should get 4. Then check that your function works on overlapping matches like ("aaaa", "aa") returning 3.
Recap
You now know:
- Naive search is two lines and O(n·m) in the worst case
- KMP precomputes an LPS array so the text pointer never moves backwards — O(n+m)
- The LPS array stores the longest proper prefix that is also a suffix at each position
- Rabin-Karp uses a rolling polynomial hash to compare windows in O(1) per shift
- KMP is the right pick for one pattern with worst-case guarantees; Rabin-Karp shines for multi-pattern search
Next steps
Next we move from algorithms to drills. The practice post tackles eight classic string interview problems — Valid Anagram, Group Anagrams, Longest Substring Without Repeating Characters, and more — each with a Python solution and complexity analysis.
→ Next: Strings: 8 Practice Problems with Solutions
Questions or feedback? Email codeloomdevv@gmail.com.