System Design: Design Search Autocomplete (Typeahead)
Design a search autocomplete service like Google or Amazon typeahead. Covers trie structures, ranking, prefix caches, and serving suggestions in under 50 ms.
What you'll learn
- ✓Build a prefix index with tries or sorted strings
- ✓Rank suggestions by frequency and recency
- ✓Serve every keystroke under 50 ms
- ✓Ingest fresh query data without rebuilding everything
- ✓Handle typos and personalization briefly
Prerequisites
- •Familiarity with hash maps and tree structures. See [Big O Notation Explained](/blog/big-o-notation-explained).
- •Comfort with HTTP and CDN basics.
Autocomplete fires a request on every keystroke. A 200 ms search backend is great. A 200 ms autocomplete is unusable. The design problem is to serve ranked prefix matches in tens of milliseconds, refresh the index without downtime, and stay cheap per request.
Functional Requirements
- Return top N suggestions for a prefix.
- Rank by popularity and recency.
- Update suggestions as user queries change (within hours, not seconds).
- Support multiple locales and languages.
- Personalization is optional.
Non-Functional Requirements
- Read QPS: 100k average, 500k peak.
- Latency: p99 under 50 ms end-to-end including network.
- Index size: roughly 10M unique queries per locale.
- Availability: 99.99 percent. Better to return nothing than to be slow.
High-Level Architecture
- Edge layer: CDN caches popular prefixes (1- to 3-letter prefixes get hit constantly).
- Autocomplete service: stateless, holds a per-locale prefix index in memory.
- Index builder: offline job that aggregates query logs, ranks, and produces a new index snapshot.
- Index distribution: snapshots pushed to autocomplete nodes via object storage with a version pointer.
- Query log pipeline: every search query is logged through a queue into a data warehouse for the next index build.
Data Model
Raw query aggregation:
CREATE TABLE query_counts (
query TEXT,
locale VARCHAR(8),
count BIGINT,
last_seen TIMESTAMP,
PRIMARY KEY (locale, query)
);
The serving index is a trie or a compressed prefix structure (a ternary search tree, FST, or sorted array of (prefix, suggestions[])). Each prefix node stores the top N suggestions:
prefix "rea":
top10: [ "real estate", "react", "reading list", ... ]
Pre-computing the top N per prefix collapses query time to a single lookup.
Key APIs
GET /api/v1/suggest?q=rea&locale=en-US&limit=10
returns: suggestions[]
Response should be cacheable at the CDN with a short TTL, keyed on q and locale.
Building the Index
The builder runs every few hours:
- Read recent query logs from the warehouse.
- Aggregate counts with a time decay so recent activity outweighs old.
- For each prefix from 1 to maxlen characters, take the top N suggestions ordered by score.
- Serialize the per-prefix top-N map into a compact binary file.
- Upload to object storage, bump the version pointer.
Autocomplete nodes poll the version pointer, download the new snapshot, swap it in atomically (load new into a fresh structure, flip a pointer, free the old one). No restart needed.
Scaling and Tradeoffs
In-memory vs Redis. In-memory per node is faster and cheaper at scale, but requires the index to fit in RAM and to be replicated to every node. Redis is fine for smaller systems and avoids the snapshot-distribution complexity.
Sharding by locale. A single node holds one or a few locales. Route by locale header. This keeps memory bounded and lets you scale languages independently.
CDN caching. Short-prefix queries (a, ab, abc) are massively repetitive. Cache aggressively at the edge with a 60-second TTL. This drops 80 percent of traffic before it reaches your service.
Top-N precomputation. Computing top-N at query time is too slow for trie traversal. Store the precomputed top-N at each prefix node. Memory cost is roughly N * average_query_length * unique_prefixes — tractable.
Typo tolerance. Either run a separate edit-distance index (BK-tree, Levenshtein automaton) and merge results, or rely on the search backend’s typo handling and only autocomplete clean prefixes.
Personalization. Blend a small personal list (recent searches) into the global suggestions client-side or via a thin server-side merge. Avoid putting per-user data in the global index.
Ranking signals. Frequency is the baseline. Add time decay, CTR (suggestions clicked vs shown), and category boosts. Keep the model simple — autocomplete is a latency game, not a relevance game.
Cold prefixes. A long prefix may have zero suggestions. Return empty fast — do not search further. Empty responses are still cached.
Multi-region. Replicate snapshots to each region’s object storage and warm the nodes there. The query log pipeline aggregates globally.
For the deployment view, autocomplete nodes are a great fit for a Kubernetes Deployment behind a load balancer — see What is Kubernetes and What is AWS for the infra context.
What to Say in an Interview
- Lead with the latency budget. 50 ms forces in-memory indexes and CDN caching.
- Separate the build path from the serve path. The interviewer wants to hear an offline job, not real-time updates.
- Precompute top-N per prefix and explain the memory tradeoff.
- Mention CDN caching for short prefixes — most candidates miss this and it is the single biggest cost saver.
- Treat typos as a separate problem you can either solve with a second index or defer to the search backend.
Wrap up
Autocomplete is a precomputed map from prefix to top-N suggestions, served from memory, fronted by a CDN, and rebuilt offline from query logs. Get the snapshot swap right and the rest of the system is almost embarrassingly simple — which is exactly the point at 50 ms p99.