Building a Million‑User Notification System: Guarantees, Trade‑offs, and Real‑World Patterns
#Infrastructure

Building a Million‑User Notification System: Guarantees, Trade‑offs, and Real‑World Patterns

Backend Reporter
5 min read

A practical blueprint for a notification platform that delivers email, SMS, and push messages to over a million users while avoiding duplicates, missed sends, and provider outages. The design covers API contracts, queue choices, worker architecture, idempotency, graceful degradation, observability, and the scalability compromises you must accept.

Problem: Delivering Reliable, High‑Volume Notifications

Enterprises that need to reach a million‑plus users cannot afford a single missed alert or a flood of duplicate messages. The stakes are high – a missed password reset email can lock a user out, a duplicated SMS can incur unnecessary cost, and a failed push can break a time‑critical workflow. The system must also stay alive when a third‑party provider (Twilio, SendGrid, Firebase) experiences an outage, and it must give operators clear insight into latency, failure rates, and queue health.


Solution Approach

1. API Layer – the contract frontier

  • EndpointsPOST /notifications accepts a payload describing the target, channel, template, and optional idempotency_key.
  • Response – returns notification_id and echoes the idempotency_key. The API writes the request to the notifications table inside a transaction before returning, guaranteeing durability even if the client disconnects.
  • Validation – schema checks, user‑preference lookup, and a quick existence test on the idempotency_key (unique index) enforce idempotency at the edge.

2. Asynchronous Backbone – Message Queue

Choice Strengths Weaknesses
Kafka (partitioned topics) Strong ordering per key, high throughput, built‑in log compaction for replay. Operational overhead, need for Zookeeper‑style coordination.
Amazon SQS (standard) Managed service, auto‑scaling, simple IAM integration. At‑least‑once delivery semantics require explicit deduplication.
RabbitMQ (classic) Flexible routing, easy DLQ configuration. Limited partitioning; scaling beyond a few hundred thousand msgs/sec can be tricky.

All three satisfy the core requirements: persistence until ack, back‑pressure handling, and native dead‑letter queues. The final selection should align with the existing cloud footprint and operational expertise.

3. Worker Tier – channel‑specific processors

  • EmailWorker, SMSWorker, PushWorker each subscribe to a dedicated topic/queue slice (e.g., notifications.email).
  • Workers are stateless; they pull a message, verify that a successful delivery record does not already exist, then invoke the appropriate provider adapter.
  • Retry logic – exponential backoff (1 min → 5 min → 15 min → 1 h) with a configurable max‑retry count. After exceeding the threshold the message lands in a dead‑letter queue for manual investigation.
  • Horizontal scaling – Kubernetes Horizontal Pod Autoscaler (HPA) or ECS Service Autoscaling watches queue depth and spins up additional pods when messages_pending > threshold.

4. Provider Adapters & Graceful Degradation

Each adapter implements a common interface (send(notification) → result). The system maintains a circuit‑breaker per provider:

  • Failure rate > 5 % over a 30‑second window trips the breaker.
  • While tripped, the worker routes the message to a fallback provider (e.g., Twilio → Termii, SendGrid → Amazon SES).
  • Breaker reset follows a cool‑down period and a health‑check ping.

5. Idempotency & Duplicate Prevention

  • Database constraintUNIQUE(idempotency_key) on the notifications table.
  • Workers perform an upsert on the delivery_attempts table with a status column (PENDING, SENT, FAILED).
  • Before each send, the worker checks SELECT status FROM delivery_attempts WHERE notification_id = $id. If SENT, the message is dropped; if PENDING, the retry proceeds.

6. Preventing Missed Sends – Transactional Outbox Pattern

  1. API transaction writes the notification row and an outbox entry in the same DB transaction.
  2. A background outbox scanner reads pending rows, publishes them to the queue, and marks them ENQUEUED.
  3. If a worker crashes after pulling a message but before persisting the delivery result, the scanner will re‑enqueue after a timeout (e.g., PENDING > 10 min).

7. Observability Stack

  • Metrics – Prometheus counters for notifications_created, notifications_sent, retry_attempts, and per‑provider latency histograms.
  • Dashboards – Grafana panels show queue depth, error rate per channel, and circuit‑breaker status.
  • Tracing – OpenTelemetry spans from API request through queue publish to provider response, enabling root‑cause analysis of latency spikes.
  • Alerting – CloudWatch/Sentry alerts trigger on thresholds such as failed_delivery_rate > 0.5 % or queue_depth > 100k.

Featured image

8. Security & Compliance

  • Provider credentials are stored encrypted in Kubernetes Secrets or AWS Parameter Store and accessed via runtime injection.
  • All webhook endpoints validate signatures (HMAC‑SHA256) to prevent spoofed delivery reports.
  • RBAC restricts admin API calls; every mutation is logged to an immutable audit table.

Trade‑offs & Decision Points

Aspect Pro Con
Kafka for the queue Near‑zero latency, replayability, strong ordering Requires a dedicated ops team; higher cost at scale
SQS Fully managed, pay‑as‑you‑go, simple IAM At‑least‑once delivery forces extra deduplication logic
Stateless workers Easy to autoscale, simple deployments Must rely on external storage for any state (e.g., delivery attempts)
Circuit‑breaker fallback Keeps traffic flowing during provider outages May incur higher per‑message cost if fallback provider is premium
Transactional outbox Guarantees no lost messages even if the API crashes after DB write Adds a scanning component that must be monitored for lag
Exponential backoff retries Reduces load on flaky providers, improves eventual success Increases overall delivery time for transient failures

The most common source of friction is the balance between latency and durability. Pushing a notification instantly to the queue yields the fastest user experience, but if the queue or database is momentarily unavailable the request must be retried at the API layer, re‑introducing the risk of duplicate notification_ids. The outbox pattern isolates that risk at the cost of an extra background process.


Closing Thoughts

A notification platform that serves a million users cannot rely on a single monolithic service. By separating concerns—API validation, durable queuing, idempotent workers, provider adapters, and a rich observability layer—you gain the ability to scale each piece independently, survive third‑party outages, and keep operators informed about health in real time. The trade‑offs are explicit: operational complexity grows with Kafka, fallback providers add cost, and retry policies stretch delivery windows. Understanding those compromises lets you tune the system to the exact service‑level expectations of your product.

For a concrete reference implementation, see the open‑source notification‑service on GitHub, which follows the patterns described above and includes Helm charts for Kubernetes deployment.

Comments

Loading comments...