System Design: Search Autocomplete System
Design a fast search autocomplete that returns suggestions in tens of milliseconds, with personalization, ranking, and freshness updates.
What you'll learn
- ✓Why autocomplete is hard at scale
- ✓Tries vs inverted indexes
- ✓Ranking by popularity and recency
- ✓Personalization layers
- ✓Keeping suggestions fresh
Prerequisites
- •Familiar with HTTP and databases
What and Why
Autocomplete is the first thing a user sees when they tap a search box, and it shapes what they search for. It must respond in under 100 ms so suggestions feel like part of typing.
Beyond speed, suggestions need to be relevant. A grocery app should not suggest the same query a news site would. Ranking is half the design.
Mental Model
Two reads dominate: “what queries start with this prefix” and “how popular is each candidate.” The first is a structural lookup; the second is a numeric ranking. Most designs separate them.
A trie or a precomputed prefix map answers the first. A score per query, refreshed on a schedule, answers the second.
Architecture
A precomputed index maps each prefix (up to N characters) to its top K candidate queries already ranked. Lookups become a single hash get. A streaming pipeline updates scores from search logs.
Client -> Edge API -> Prefix Cache (Redis)
|
hit -> top K queries
|
miss -> Index Service -> rebuild prefix
|
Background job
|
updates from logs The score combines popularity, recency, and personalization. A simple model: score = log(count) * recency_decay + user_affinity. Compute the global piece offline; layer personalization at query time.
Tries are elegant but memory-heavy at scale. A precomputed top-K map per prefix trades memory for simpler reads. For very long prefixes you fall back to a search index like Elasticsearch.
Trade-offs
Latency budgets are tight. You usually cannot afford a database hit, so the entire hot path lives in memory. That means the index must fit in RAM across the fleet.
Personalization at scale is expensive. Most products apply only a thin per-user re-rank on top of global suggestions, instead of building a personalized index per user.
Spelling correction adds value but doubles complexity. Many teams ship pure prefix matching first and add fuzzy matching later.
Practical Tips
Cap the prefix length at the index. Prefixes beyond 6-8 characters are rare and can fall back to a generic search call.
Throttle on the client: only send a request when the user pauses typing for 50 ms. This cuts request volume by 5-10x without hurting feel.
Cache aggressively at the edge. Many users type the same first letters; an edge cache with a one-minute TTL is enough to absorb most traffic.
Watch for offensive or low-quality suggestions. Maintain a blocklist and a manual override path. This is product work, not infrastructure, but it gates launch.
Wrap-up
Autocomplete is a small surface area with deep engineering. Precompute prefixes, keep them in memory, and rank with a simple but tuned score. Add personalization and freshness only where they pay off. The result is a feature users barely notice, which is exactly what good autocomplete should be.
Related articles
- System Design System Design: Design Search Autocomplete (Typeahead)
Design a search autocomplete service like Google or Amazon typeahead. Covers trie structures, ranking, prefix caches, and serving suggestions in under 50 ms.
- System Design CAP Theorem in Practice: What It Actually Means for Your System
A pragmatic look at the CAP theorem: what consistency and availability mean for real workloads, and how PACELC describes the trade-offs better.
- System Design Consistent Hashing Explained for Engineers Who Operate Real Systems
How consistent hashing actually works in production: virtual nodes, rebalancing, hot keys, and why naive modulo hashing fails at scale.
- System Design Designing Rate Limiters: A System Design Deep Dive
A senior-engineer guide to designing rate limiters: algorithms, distributed coordination, trade-offs, and production patterns that actually scale.