System Design: A Multi-Channel Notification Service

Intermediate 10 min read

What you'll learn

✓How to model channels and templates
✓Queueing and retry strategies
✓Respecting user preferences and quiet hours
✓Vendor failover patterns
✓Tracking delivery for analytics

Prerequisites

•Familiar with HTTP and databases

What and Why

Almost every product needs to ping users somehow. Notifications across email, SMS, and push share more than they differ: templating, audience selection, rate limiting, retries, and analytics. A central service avoids each team rebuilding this wheel poorly.

Done well, it becomes the source of truth for “what did we send to whom, when, and why.”

Mental Model

Three jobs: choose the audience, render the message, and ship it through the right channel. Each is a small service that hands off to the next via a queue. Each step is retryable on failure.

Architecture

Producers (other services) publish events: “user.signed_up”, “order.shipped”. A routing layer maps events to templates and channels based on user preferences. Renderers produce final payloads. Channel workers call vendors.

Service -> Event Bus
            |
        Routing (preferences, A/B, quiet hours)
            |
       +----+----+
       v         v
    Email Q    Push Q
       |         |
   Renderer  Renderer
       |         |
   Vendor    Vendor
     (SES)    (FCM)
       |         |
   Webhooks -> Status DB

Notification pipeline

Status updates flow back from vendors via webhooks. The service writes them to a status store so the product can show “delivered” or “bounced” on a notification log.

A scheduler handles delayed sends and recurring jobs. It uses a sorted store (Redis ZSET keyed by send time) to find work.

Trade-offs

Synchronous send-and-confirm APIs are simpler but couple producers to vendor latency. Async (publish event, return immediately) decouples but loses the ability to surface immediate failures.

Vendor failover sounds attractive but adds complexity. Email is the friendliest case because most providers speak SMTP; SMS and push are more vendor-specific.

Per-user rate limiting protects users from spam at the cost of more state. Global limits are easier but allow noisy senders to drown out important messages.

Practical Tips

Tag every send with a correlation_id from the source event. Engineers will thank you when debugging “why did the user get this?”

Honor unsubscribe and quiet-hour preferences at the routing layer, not the renderer. By the time you render, the decision should already be made.

Throttle blast campaigns. Sending a million emails in one minute looks like spam to providers and earns you a rate-limit penalty. Spread sends over a window.

Budget for vendor outages. A small in-memory queue with exponential backoff covers brief blips. A persistent queue with dead-letter handling covers extended ones.

Wrap-up

A notification service is a workflow problem dressed as a messaging problem. Model it as a pipeline of small steps with retries and observability between each. Pick boring infrastructure (Kafka or SQS, Redis, Postgres) and put the smarts in routing and preferences. The service grows up to be one of the most-used parts of your platform.