System Design: Design a Message Queue (Kafka-like)
Design a durable message queue like Kafka or Pulsar. Covers partitions, replication, consumer groups, ordering guarantees, and exactly-once semantics in practice.
What you'll learn
- ✓Model topics, partitions, and offsets
- ✓Replicate logs for durability without losing throughput
- ✓Run consumer groups with rebalancing
- ✓Pick at-least-once, at-most-once, or exactly-once
- ✓Decide when to use Kafka vs a queue like SQS
Prerequisites
- •Familiarity with disks, append-only files, and TCP.
- •Comfort with replication and leader election concepts.
Kafka turned messaging into a distributed log: an append-only, partitioned, replicated, infinitely-retainable sequence of records. Designing one yourself forces clear thinking about ordering, durability, and what “consume” actually means.
Functional Requirements
- Producers publish records to a named topic.
- Consumers read records in order from a topic.
- Records are durable for a configured retention (days or unbounded).
- Multiple independent consumers can read the same topic.
- Consumer groups split a topic’s load across members.
Non-Functional Requirements
- Throughput: 1M messages per second per cluster, scaling with brokers.
- End-to-end latency: p99 under 50 ms.
- Durability: no acked write is lost on single-node failure.
- Retention: 7 days default, configurable up to forever.
- Availability: 99.99 percent.
High-Level Architecture
- Brokers: stateful servers that hold partition data on disk.
- Topics: logical streams. Each topic is split into N partitions.
- Partitions: append-only logs replicated to a small set of brokers (typically 3).
- Controller: a single broker that manages partition leadership and reassignment. Modern Kafka uses KRaft (Raft-based) instead of ZooKeeper.
- Producers: write to the leader of each partition.
- Consumers: read from the leader, track their own offset.
Data Model
A partition is a sequence of records:
record {
offset: uint64
timestamp: uint64
key: bytes
value: bytes
headers: map
}
On disk, a partition is a directory of segment files:
00000000000000000000.log
00000000000000000000.index
00000000000000123456.log
00000000000000123456.index
Segments rotate by size or time. Old segments are deleted (or compacted) on retention expiry. The index file maps offsets to file positions for O(log n) seeks.
Key APIs
Producer:
produce(topic, key, value) -> { partition, offset }
Consumer:
subscribe(topic, group_id)
poll(timeout) -> records[]
commit(offsets)
Admin:
create_topic(name, partitions, replication_factor)
list_topics()
Partitioning and Ordering
A topic is split into partitions for parallelism. The producer picks a partition by hashing the key, or round-robin if there is no key. Order is guaranteed only within a single partition, never across.
Consumer parallelism is bounded by partition count: N consumers in a group can each own a disjoint set of partitions, but you cannot have more active consumers than partitions.
If you need ordering across millions of users, key by user_id so each user’s messages land on a single partition. If you need global ordering, you need one partition, and you have rebuilt a single-machine queue.
Replication
Each partition has a leader and N-1 followers. The producer writes to the leader. The leader appends to its log and replicates to followers. Followers fetch and ack.
Ack levels:
acks=0: fire and forget. Fastest, can lose data.acks=1: wait for leader. Loses data if leader crashes before replication.acks=all: wait for the in-sync replica set. Slowest, no data loss.
The in-sync replica set (ISR) is the set of followers caught up to the leader. If a follower lags, it falls out of ISR; if it catches up, it rejoins.
On leader failure, the controller elects a new leader from the ISR. If the ISR is empty, you choose between unclean leader election (pick a stale follower, lose data) or unavailability.
Scaling and Tradeoffs
Why a log, not a queue? Multiple consumers reading the same data is free — they each track an offset. Queues delete on consume; logs let you replay.
Disk performance. Sequential append is fast on spinning disks and absurdly fast on SSDs. The hot path is write() plus fsync, plus a sendfile to consumers. No random IO.
Zero copy. Brokers use sendfile to ship bytes from the page cache straight to the socket without copying into user space. This is a big throughput unlock.
Consumer rebalancing. When a member joins or leaves a group, partitions are reassigned. Old protocols stopped the world. New (incremental cooperative) protocols only move the partitions that change owners.
Exactly-once. True exactly-once requires idempotent producers (sequence numbers per partition) plus transactional commits across topic write and offset commit. It costs throughput. Most teams stick with at-least-once plus idempotent consumers.
Retention vs compaction. Time- or size-based retention deletes old segments. Log compaction keeps only the latest value per key — useful for changelog topics that drive caches.
Kafka vs SQS/RabbitMQ. Kafka is a high-throughput log with consumer-tracked offsets. SQS is a per-message queue with broker-tracked state and per-message acks. RabbitMQ sits in between with flexible routing. Pick the log when you need replay and high throughput; pick a queue when you need per-message visibility timeouts and dead-letter routing.
Deployment. Brokers are stateful. On Kubernetes, run them as a StatefulSet with persistent volumes — see What is Kubernetes. On AWS, managed MSK is fine — see What is AWS.
What to Say in an Interview
- Frame the partition as the unit of parallelism and ordering. This single sentence covers half the design.
- Explain ISR and
acks=allto defend durability claims. - Cover consumer group rebalancing — interviewers love asking what happens when a consumer dies mid-poll.
- Be honest about exactly-once: it is possible but expensive, and at-least-once plus idempotency is more common.
- Mention zero-copy and the page cache as the reason throughput is so high. Shows you have read the Kafka design papers.
Wrap up
A message queue at this scale is a replicated, append-only log split into partitions. Producers hash by key, consumers track offsets, and the controller manages leadership. Make peace with at-least-once delivery and build idempotent consumers — that is what every production Kafka deployment ends up doing anyway.