System Design: A Multi-Channel Notification Service
Design a notification service that delivers email, SMS, and push reliably with templating, rate limits, retries, and user preferences.
What you'll learn
- ✓How to model channels and templates
- ✓Queueing and retry strategies
- ✓Respecting user preferences and quiet hours
- ✓Vendor failover patterns
- ✓Tracking delivery for analytics
Prerequisites
- •Familiar with HTTP and databases
What and Why
Almost every product needs to ping users somehow. Notifications across email, SMS, and push share more than they differ: templating, audience selection, rate limiting, retries, and analytics. A central service avoids each team rebuilding this wheel poorly.
Done well, it becomes the source of truth for “what did we send to whom, when, and why.”
Mental Model
Three jobs: choose the audience, render the message, and ship it through the right channel. Each is a small service that hands off to the next via a queue. Each step is retryable on failure.
Architecture
Producers (other services) publish events: “user.signed_up”, “order.shipped”. A routing layer maps events to templates and channels based on user preferences. Renderers produce final payloads. Channel workers call vendors.
Service -> Event Bus
|
Routing (preferences, A/B, quiet hours)
|
+----+----+
v v
Email Q Push Q
| |
Renderer Renderer
| |
Vendor Vendor
(SES) (FCM)
| |
Webhooks -> Status DB Status updates flow back from vendors via webhooks. The service writes them to a status store so the product can show “delivered” or “bounced” on a notification log.
A scheduler handles delayed sends and recurring jobs. It uses a sorted store (Redis ZSET keyed by send time) to find work.
Trade-offs
Synchronous send-and-confirm APIs are simpler but couple producers to vendor latency. Async (publish event, return immediately) decouples but loses the ability to surface immediate failures.
Vendor failover sounds attractive but adds complexity. Email is the friendliest case because most providers speak SMTP; SMS and push are more vendor-specific.
Per-user rate limiting protects users from spam at the cost of more state. Global limits are easier but allow noisy senders to drown out important messages.
Practical Tips
Tag every send with a correlation_id from the source event. Engineers will thank you when debugging “why did the user get this?”
Honor unsubscribe and quiet-hour preferences at the routing layer, not the renderer. By the time you render, the decision should already be made.
Throttle blast campaigns. Sending a million emails in one minute looks like spam to providers and earns you a rate-limit penalty. Spread sends over a window.
Budget for vendor outages. A small in-memory queue with exponential backoff covers brief blips. A persistent queue with dead-letter handling covers extended ones.
Wrap-up
A notification service is a workflow problem dressed as a messaging problem. Model it as a pipeline of small steps with retries and observability between each. Pick boring infrastructure (Kafka or SQS, Redis, Postgres) and put the smarts in routing and preferences. The service grows up to be one of the most-used parts of your platform.
Related articles
- System Design Designing Rate Limiters: A System Design Deep Dive
A senior-engineer guide to designing rate limiters: algorithms, distributed coordination, trade-offs, and production patterns that actually scale.
- System Design Event-Driven Architecture: The Pragmatic Introduction
What event-driven architecture really gives you, when to choose it, and the operational realities of running asynchronous systems at scale.
- System Design Message Queues: Kafka vs RabbitMQ (When to Pick Which)
A senior-engineer comparison of Kafka and RabbitMQ: log vs queue semantics, throughput, ordering, retention, and the real selection criteria.
- System Design System Design: Building a Scalable Chat Application
Design a real-time chat system that supports millions of users with low latency messaging, presence, and message persistence at scale.