System Design: Building a Scalable Chat Application
Design a real-time chat system that supports millions of users with low latency messaging, presence, and message persistence at scale.
What you'll learn
- ✓Connection model for real-time chat
- ✓How to fan-out messages
- ✓Storage choices for chat history
- ✓Presence and typing indicators
- ✓Scaling WebSockets horizontally
Prerequisites
- •Familiar with HTTP and databases
What and Why
Chat is one of the most requested system design questions because it touches real-time communication, eventual consistency, persistence, and fan-out. A good chat system must deliver messages in under a second, survive server restarts, and let users read history from any device.
We will focus on a one-to-one and small-group chat. Massive group chats (10k+ members) add fan-out problems we will only briefly touch on.
Mental Model
Think of chat as three loops:
- A long-lived connection so the server can push messages instantly.
- A durable log so messages survive crashes and reconnects.
- A routing layer that knows which server holds which user’s connection.
If you keep these three concerns separate, scaling becomes a matter of growing each piece independently.
Architecture
Clients connect to a gateway over WebSocket. The gateway authenticates the user, registers the connection in a routing table (often Redis), and forwards inbound messages to a message service. The message service writes to a durable store and publishes to a pub/sub topic so other gateways can deliver the message to recipients.
Client A -> Gateway 1 -> Message Service -> DB (append)
|
v
Pub/Sub
|
+-----------+-----------+
v v
Gateway 2 Gateway 3
| |
Client B Client C For storage, a wide-column store like Cassandra or DynamoDB works well because each conversation is an append-only log keyed by conversation_id and timestamp. Reads are range scans on the partition key.
Presence (“online” indicators) lives in Redis with TTLs that refresh on heartbeat. Typing indicators are ephemeral and skip persistence entirely.
Trade-offs
WebSockets give low latency but make horizontal scaling tricky because connections are sticky. Long polling is simpler but burns CPU. Server-sent events work for one-way streams but not bidirectional chat.
Choosing Cassandra trades flexible querying for predictable write throughput. If you need complex filters, you would layer search on top using Elasticsearch.
Strict ordering across servers is expensive. Most chat systems accept per-conversation ordering only, using a server-assigned timestamp or sequence number.
Practical Tips
Use a client_message_id for idempotency so retries do not create duplicates. Acknowledge messages in two stages: “received by server” and “delivered to recipient.” This makes the UI feel responsive while keeping correctness.
Keep group fan-out under control by using a “fan-out on read” strategy for large groups: instead of pushing to every member, store once and let clients pull. Small groups can use “fan-out on write” for snappier delivery.
Compress payloads with permessage-deflate, but watch CPU. Cap message size to a few KB and force media to go through a separate upload endpoint that returns a URL.
Use a single connection per device, not per tab. A shared worker on the web client can multiplex many tabs through one socket.
Wrap-up
A chat system is a fun problem because it forces you to think about pushing data, not pulling it. Start with three pillars: persistent gateways, durable storage, and pub/sub routing. Scale each pillar separately and the rest of the design tends to fall into place.
In an interview, lead with the data model, then the connection layer, then fan-out. That order proves you understand the constraints before you reach for fancy tech.
Related articles
- System Design Designing Rate Limiters: A System Design Deep Dive
A senior-engineer guide to designing rate limiters: algorithms, distributed coordination, trade-offs, and production patterns that actually scale.
- System Design System Design: A File Storage Service Like Dropbox
Design a file storage service with uploads, sync, deduplication, and sharing, scaling to petabytes while keeping reads fast and cheap.
- System Design System Design: Real-Time Leaderboard with Redis
Design a real-time leaderboard using Redis sorted sets, handling millions of score updates per second with low latency rankings.
- System Design System Design: Newsfeed Architecture (Fanout, Ranking, Caching)
How to design a social newsfeed: fanout-on-write vs read, ranking pipelines, hybrid models for celebrities, and operational reality.