Skip to content
C Codeloom
DSA

String Pattern Matching: Naive, KMP, and Rabin-Karp

A practical tour of substring search — the naive O(n·m) scan, the KMP algorithm with its LPS array, and the Rabin-Karp rolling hash. Worked examples, code, and intuition for when to use each.

·8 min read · By Yash Kesharwani
Intermediate 12 min read

What you'll learn

  • How the naive substring scan works and where it fails
  • The intuition behind the KMP failure function (LPS array)
  • How to build an LPS array and use it to skip redundant comparisons
  • How the Rabin-Karp rolling hash matches in average O(n+m)
  • When to reach for each algorithm in interviews and in production

Prerequisites

The substring-search problem is simple to state: given a text T of length n and a pattern P of length m, find every index i such that T[i:i+m] == P. The naive solution is two lines. The classical solutions — KMP and Rabin-Karp — are surprisingly subtle, but the intuition is worth the effort. This post walks through all three.

The naive algorithm

The simplest approach: try every starting position in the text and compare character by character.

def naive_search(text, pattern):
    n, m = len(text), len(pattern)
    matches = []
    for i in range(n - m + 1):
        j = 0
        while j < m and text[i + j] == pattern[j]:
            j += 1
        if j == m:
            matches.append(i)
    return matches

print(naive_search("ababcabcabababd", "ababd"))  # [10]

Complexity

  • Worst case: O(n · m). Consider text "aaaaaaaaaab" and pattern "aaaab". At each of n positions we compare m characters before failing.
  • Best case: O(n). When the first character mismatches at most positions.
  • Space: O(1).

For small inputs this is fine. For repeated searches on the same text, or pathological inputs, we want better.

Why naive search is wasteful

Look at this comparison:

text:    A B A B A B C ...
pattern: A B A B D
         ^ ^ ^ ^ x        — mismatch at index 4

The naive algorithm now restarts at text index 1, retesting characters it already saw. But notice: the first four characters of the pattern, ABAB, contain a prefix AB that is also a suffix. So after the mismatch we already know text[2..3] == "AB", and we should resume comparing from pattern index 2 — not throw away that knowledge.

This observation is the seed of KMP.

KMP: the failure function (LPS array)

The Knuth–Morris–Pratt algorithm precomputes, for each prefix of the pattern, the length of the longest proper prefix that is also a suffix — abbreviated LPS. “Proper” means not equal to the prefix itself.

For pattern ABABD:

indexprefixlongest proper prefix == suffixLPS
0A0
1AB0
2ABAA1
3ABABAB2
4ABABD0

The LPS tells us how many characters we can safely keep when a mismatch occurs.

Building the LPS array

def build_lps(pattern):
    m = len(pattern)
    lps = [0] * m
    length = 0     # length of the previous longest prefix-suffix
    i = 1
    while i < m:
        if pattern[i] == pattern[length]:
            length += 1
            lps[i] = length
            i += 1
        elif length != 0:
            length = lps[length - 1]   # fall back
        else:
            lps[i] = 0
            i += 1
    return lps

print(build_lps("ABABD"))    # [0, 0, 1, 2, 0]
print(build_lps("AAAA"))     # [0, 1, 2, 3]
print(build_lps("ABCABCD"))  # [0, 0, 0, 1, 2, 3, 0]

The clever line is length = lps[length - 1] — when characters differ but we have a previous prefix-suffix, we shrink to the next-longest one without restarting.

Armed with the LPS array, the search walks the text once. On a mismatch, instead of stepping back in the text, we step the pattern index back to lps[j - 1].

def kmp_search(text, pattern):
    n, m = len(text), len(pattern)
    if m == 0:
        return [0]
    lps = build_lps(pattern)
    matches = []
    i = j = 0   # i over text, j over pattern
    while i < n:
        if text[i] == pattern[j]:
            i += 1
            j += 1
            if j == m:
                matches.append(i - m)
                j = lps[j - 1]    # continue searching for overlapping matches
        else:
            if j != 0:
                j = lps[j - 1]
            else:
                i += 1
    return matches

print(kmp_search("ababcabcabababd", "ababd"))     # [10]
print(kmp_search("aaaaa", "aa"))                  # [0, 1, 2, 3]

Complexity

  • Preprocessing: O(m).
  • Search: O(n). The text index i never decreases.
  • Total: O(n + m).
  • Space: O(m) for the LPS array.

KMP guarantees linear time even on pathological inputs.

Try it yourself. Compute the LPS array for "AABAACAABAA" by hand. Then verify with build_lps. Pay attention to how the LPS jumps when the pattern contains repeated blocks.

Rabin-Karp: the rolling hash

The Rabin-Karp algorithm takes a different angle. Instead of comparing characters, it compares hashes of windows.

The idea: compute a hash h(P) of the pattern. Slide a window of length m across the text and compute h(T[i:i+m]). Whenever the window hash equals the pattern hash, do a full character comparison to confirm (because different strings can collide on the same hash).

The trick is the rolling hash — updating the window hash from position i to i+1 in O(1), not O(m).

Polynomial hash

Treat the m-character window as a base-b number modulo a large prime q:

h(s) = (s[0]·b^(m-1) + s[1]·b^(m-2) + ... + s[m-1]·b^0) mod q

To slide one position right, subtract the leading character’s contribution, multiply by b, then add the new trailing character:

h_new = ((h_old - s[0]·b^(m-1)) · b + s[m]) mod q

Implementation

def rabin_karp(text, pattern, base=256, mod=10**9 + 7):
    n, m = len(text), len(pattern)
    if m == 0 or m > n:
        return []
    h = pow(base, m - 1, mod)   # base^(m-1) mod q
    p_hash = 0
    t_hash = 0
    matches = []

    # Initial hashes for pattern and first window
    for i in range(m):
        p_hash = (base * p_hash + ord(pattern[i])) % mod
        t_hash = (base * t_hash + ord(text[i])) % mod

    for i in range(n - m + 1):
        if p_hash == t_hash and text[i:i + m] == pattern:
            matches.append(i)
        if i < n - m:
            t_hash = (base * (t_hash - ord(text[i]) * h) + ord(text[i + m])) % mod
            t_hash = (t_hash + mod) % mod   # keep non-negative
    return matches

print(rabin_karp("ababcabcabababd", "ababd"))   # [10]
print(rabin_karp("hello world hello", "hello")) # [0, 12]

Complexity

  • Average case: O(n + m). Collisions are rare with a good hash.
  • Worst case: O(n · m). All hashes collide and every position requires a full character comparison. A second hash or a strong prime mitigates this.
  • Space: O(1).

The killer feature of Rabin-Karp is multiple-pattern search: hash a set of patterns once, then a single pass over the text checks them all.

Choosing between them

NeedUse
One pattern, worst-case guaranteeKMP
Many patterns at onceRabin-Karp (or Aho-Corasick)
Tiny pattern, tiny textNaive — simplicity wins
Built-in language methodOften a tuned variant of BMH or Two-Way

In competitive programming KMP is the safer bet because of the worst-case bound. In production, the language’s built-in str.find / String.indexOf is already extremely fast and usually fine.

A note on Z-function and Boyer-Moore

Two algorithms worth knowing exist but are beyond this post:

  • Z-function computes, for each position, the length of the longest substring starting there that matches a prefix of the string. It is an alternative to LPS with the same O(n+m) bound and is often easier to reason about.
  • Boyer-Moore scans the pattern right-to-left and uses two heuristics (bad character and good suffix) to skip large blocks. It is sublinear on typical inputs and is what grep uses.

Each deserves a focused post. The intuition you built around LPS transfers cleanly to both.

Try it yourself. Implement count_occurrences(text, pattern) using KMP. Test on ("abababab", "ab") — you should get 4. Then check that your function works on overlapping matches like ("aaaa", "aa") returning 3.

Recap

You now know:

  • Naive search is two lines and O(n·m) in the worst case
  • KMP precomputes an LPS array so the text pointer never moves backwards — O(n+m)
  • The LPS array stores the longest proper prefix that is also a suffix at each position
  • Rabin-Karp uses a rolling polynomial hash to compare windows in O(1) per shift
  • KMP is the right pick for one pattern with worst-case guarantees; Rabin-Karp shines for multi-pattern search

Next steps

Next we move from algorithms to drills. The practice post tackles eight classic string interview problems — Valid Anagram, Group Anagrams, Longest Substring Without Repeating Characters, and more — each with a Python solution and complexity analysis.

→ Next: Strings: 8 Practice Problems with Solutions

Questions or feedback? Email codeloomdevv@gmail.com.