A practical blueprint for a notification platform that delivers email, SMS, and push messages to over a million users while avoiding duplicates, missed sends, and provider outages. The design covers API contracts, queue choices, worker architecture, idempotency, graceful degradation, observability, and the scalability compromises you must accept.
Problem: Delivering Reliable, High‑Volume Notifications
Enterprises that need to reach a million‑plus users cannot afford a single missed alert or a flood of duplicate messages. The stakes are high – a missed password reset email can lock a user out, a duplicated SMS can incur unnecessary cost, and a failed push can break a time‑critical workflow. The system must also stay alive when a third‑party provider (Twilio, SendGrid, Firebase) experiences an outage, and it must give operators clear insight into latency, failure rates, and queue health.
Solution Approach
1. API Layer – the contract frontier
- Endpoints –
POST /notificationsaccepts a payload describing the target, channel, template, and optionalidempotency_key. - Response – returns
notification_idand echoes theidempotency_key. The API writes the request to the notifications table inside a transaction before returning, guaranteeing durability even if the client disconnects. - Validation – schema checks, user‑preference lookup, and a quick existence test on the
idempotency_key(unique index) enforce idempotency at the edge.
2. Asynchronous Backbone – Message Queue
| Choice | Strengths | Weaknesses |
|---|---|---|
| Kafka (partitioned topics) | Strong ordering per key, high throughput, built‑in log compaction for replay. | Operational overhead, need for Zookeeper‑style coordination. |
| Amazon SQS (standard) | Managed service, auto‑scaling, simple IAM integration. | At‑least‑once delivery semantics require explicit deduplication. |
| RabbitMQ (classic) | Flexible routing, easy DLQ configuration. | Limited partitioning; scaling beyond a few hundred thousand msgs/sec can be tricky. |
All three satisfy the core requirements: persistence until ack, back‑pressure handling, and native dead‑letter queues. The final selection should align with the existing cloud footprint and operational expertise.
3. Worker Tier – channel‑specific processors
- EmailWorker, SMSWorker, PushWorker each subscribe to a dedicated topic/queue slice (e.g.,
notifications.email). - Workers are stateless; they pull a message, verify that a successful delivery record does not already exist, then invoke the appropriate provider adapter.
- Retry logic – exponential backoff (1 min → 5 min → 15 min → 1 h) with a configurable max‑retry count. After exceeding the threshold the message lands in a dead‑letter queue for manual investigation.
- Horizontal scaling – Kubernetes Horizontal Pod Autoscaler (HPA) or ECS Service Autoscaling watches queue depth and spins up additional pods when
messages_pending > threshold.
4. Provider Adapters & Graceful Degradation
Each adapter implements a common interface (send(notification) → result). The system maintains a circuit‑breaker per provider:
- Failure rate > 5 % over a 30‑second window trips the breaker.
- While tripped, the worker routes the message to a fallback provider (e.g., Twilio → Termii, SendGrid → Amazon SES).
- Breaker reset follows a cool‑down period and a health‑check ping.
5. Idempotency & Duplicate Prevention
- Database constraint –
UNIQUE(idempotency_key)on thenotificationstable. - Workers perform an upsert on the
delivery_attemptstable with astatuscolumn (PENDING,SENT,FAILED). - Before each send, the worker checks
SELECT status FROM delivery_attempts WHERE notification_id = $id. IfSENT, the message is dropped; ifPENDING, the retry proceeds.
6. Preventing Missed Sends – Transactional Outbox Pattern
- API transaction writes the notification row and an outbox entry in the same DB transaction.
- A background outbox scanner reads pending rows, publishes them to the queue, and marks them
ENQUEUED. - If a worker crashes after pulling a message but before persisting the delivery result, the scanner will re‑enqueue after a timeout (e.g.,
PENDING > 10 min).
7. Observability Stack
- Metrics – Prometheus counters for
notifications_created,notifications_sent,retry_attempts, and per‑provider latency histograms. - Dashboards – Grafana panels show queue depth, error rate per channel, and circuit‑breaker status.
- Tracing – OpenTelemetry spans from API request through queue publish to provider response, enabling root‑cause analysis of latency spikes.
- Alerting – CloudWatch/Sentry alerts trigger on thresholds such as
failed_delivery_rate > 0.5 %orqueue_depth > 100k.

8. Security & Compliance
- Provider credentials are stored encrypted in Kubernetes Secrets or AWS Parameter Store and accessed via runtime injection.
- All webhook endpoints validate signatures (HMAC‑SHA256) to prevent spoofed delivery reports.
- RBAC restricts admin API calls; every mutation is logged to an immutable audit table.
Trade‑offs & Decision Points
| Aspect | Pro | Con |
|---|---|---|
| Kafka for the queue | Near‑zero latency, replayability, strong ordering | Requires a dedicated ops team; higher cost at scale |
| SQS | Fully managed, pay‑as‑you‑go, simple IAM | At‑least‑once delivery forces extra deduplication logic |
| Stateless workers | Easy to autoscale, simple deployments | Must rely on external storage for any state (e.g., delivery attempts) |
| Circuit‑breaker fallback | Keeps traffic flowing during provider outages | May incur higher per‑message cost if fallback provider is premium |
| Transactional outbox | Guarantees no lost messages even if the API crashes after DB write | Adds a scanning component that must be monitored for lag |
| Exponential backoff retries | Reduces load on flaky providers, improves eventual success | Increases overall delivery time for transient failures |
The most common source of friction is the balance between latency and durability. Pushing a notification instantly to the queue yields the fastest user experience, but if the queue or database is momentarily unavailable the request must be retried at the API layer, re‑introducing the risk of duplicate notification_ids. The outbox pattern isolates that risk at the cost of an extra background process.
Closing Thoughts
A notification platform that serves a million users cannot rely on a single monolithic service. By separating concerns—API validation, durable queuing, idempotent workers, provider adapters, and a rich observability layer—you gain the ability to scale each piece independently, survive third‑party outages, and keep operators informed about health in real time. The trade‑offs are explicit: operational complexity grows with Kafka, fallback providers add cost, and retry policies stretch delivery windows. Understanding those compromises lets you tune the system to the exact service‑level expectations of your product.
For a concrete reference implementation, see the open‑source notification‑service on GitHub, which follows the patterns described above and includes Helm charts for Kubernetes deployment.

Comments
Please log in or register to join the discussion