Retry Engines, Redis Pub/Sub, and the Point Where Backend Systems Stop Being Simple

HNGi14 became a practical lesson in what happens when backend systems meet failure, scale, and real external APIs.

Problem

Most backend systems begin with a comforting shape: a client sends a request, the server handles it, the database stores the result, and a response goes back out. That model is easy to explain and easy to build against. It is also where many systems quietly lie to their owners.

The lie is that the request path is the system. In real production behavior, the system is also every failed webhook, every duplicate retry, every half-completed background task, every overloaded dependency, every API rate limit, and every browser tab polling the same data because no one designed a better delivery path.

That was the practical lesson behind the HNGi14 work described in the original DEV Community article. Two systems stood out: a retry engine using exponential backoff and jitter, and a Redis Pub/Sub plus Server-Sent Events pipeline for inverter metrics. Both are small enough to understand in one sitting, but they point at larger distributed systems problems: how to preserve work across failures, how to avoid coordinated retry storms, how to decouple producers from consumers, and how to design APIs that match the actual consistency needs of the user.

A payment webhook is the cleanest example. A provider such as Stripe sends an event to your server after a payment succeeds. If your server is briefly unavailable, the customer has still paid, but your application may not record that fact. From the user’s point of view, the system lost money or lost state. From the provider’s point of view, the request failed and may need to be retried. From the database’s point of view, nothing happened.

That gap between external truth and internal state is where backend systems become distributed systems. There is no single clock, no single memory, and no guarantee that every participant sees the same event at the same time. The job of the architecture is not to pretend those gaps do not exist. The job is to make them bounded, observable, and recoverable.

Solution Approach

The retry engine addresses the first failure mode: transient failure on outbound HTTP work. Instead of making an HTTP call directly and hoping the target service is available, the system persists the request as a job, returns an identifier immediately, and lets a worker process the job in the background.

That changes the API contract. The client no longer receives only success or failure for the remote operation. It receives acknowledgement that the work has been accepted. The eventual result can be fetched later, emitted through another channel, or used internally to drive a state transition.

A simplified flow looks like this:

A client submits an HTTP job to the retry service.
The service validates the request and stores it with status pending.
A worker claims the job before processing it.
The worker sends the outbound request.
On success, the result is stored and the job becomes completed.
On retryable failure, the next attempt is scheduled using exponential backoff and jitter.
On permanent failure, the job becomes dead or failed.

The critical design choice is persistence. If the process crashes after accepting the request but before sending it, the job still exists. If the worker crashes while processing, another worker can eventually claim it. That is the difference between best-effort code and infrastructure that can recover after the process disappears.

Exponential backoff is the next piece. If a dependency returns a transient error, retrying immediately can make the outage worse. A common policy is to wait longer after each failed attempt, for example 1 second, then 2 seconds, then 4 seconds, then 8 seconds, usually with a maximum delay cap. The AWS Architecture Blog has a useful treatment of exponential backoff and jitter because the issue shows up in almost every large client-server system.

Backoff alone is not enough. If 10,000 jobs fail at the same time and every job retries after the same delay, the recovering service receives another spike at exactly the wrong moment. Jitter adds randomness to the delay. Instead of every retry firing at 10 seconds, requests spread across nearby times. That reduces the thundering herd effect and gives the dependency a chance to recover under smoother load.

The retry engine also forces a consistency decision. The system is no longer strongly consistent from the caller’s perspective. A submitted job may not be completed when the client receives the first response. This is eventual consistency by design. The benefit is survivability and load control. The cost is that API consumers must understand job state, polling, callbacks, or event delivery.

That is not a weakness. It is an honest contract. Many APIs hide eventual behavior behind synchronous endpoints, then surprise clients when timeouts, duplicates, and partial failures appear. A job-based API makes the state machine visible.

The second system, the inverter metrics pipeline, solves a different scaling problem. EnergyIQ needed to fetch solar inverter data from external brand APIs and make that data available to users watching dashboards. A direct polling model is tempting: each browser periodically asks the backend for fresh data, and the backend calls the external inverter API.

That approach works for a small number of users. It fails as soon as the same inverter dashboard is open in many places. Ten users watching the same inverter should not create ten independent polling loops against the same vendor API. The data source has rate limits, the backend does duplicate work, and users still receive updates only on polling intervals.

The revised architecture separates collection from delivery:

A background poller fetches inverter data on a controlled schedule.
The poller respects vendor rate limits and device-specific timing.
Fresh data is written to storage and published to a Redis channel.
Clients connect to a Server-Sent Events endpoint.
The SSE endpoint subscribes to the relevant Redis channel and streams updates to connected browsers.

Redis Pub/Sub fits this use case because it provides low-latency message distribution between backend components. It is not a durable queue. If a subscriber is offline, it misses messages. For live dashboard metrics, that may be acceptable because the next poll will produce another update and the database can hold the latest known value. For billing or irreversible business events, Redis Pub/Sub alone would be the wrong abstraction. A durable log such as Kafka, Redis Streams, or a managed queue would be more appropriate.

Server-Sent Events are also a pragmatic fit. SSE gives the browser a long-lived HTTP connection for server-to-client updates. Unlike WebSockets, SSE is one-way by default, which matches a metrics dashboard where the server pushes readings and the client mostly listens. The browser’s EventSource API handles reconnection behavior, which reduces client complexity.

The API pattern changes again. Instead of GET /inverter/:id/metrics every few seconds from every active tab, the application can expose an endpoint such as GET /inverters/:id/events. Clients subscribe once, then receive updates as the backend obtains them. The backend controls the vendor polling cadence, and the frontend stops acting like a distributed scheduler by accident.

Trade-Offs

The retry engine improves reliability, but it introduces state. Once a request becomes a persisted job, the system needs job identifiers, status transitions, retry policies, idempotency handling, and cleanup. A direct HTTP call has fewer moving parts. It also loses work more easily.

Idempotency is the part that tends to hurt later if it is ignored early. A retry engine may send the same logical operation more than once. If the target endpoint creates a charge, sends an email, or mutates inventory without an idempotency key, retries can create duplicate side effects. Systems such as Stripe document idempotent requests for exactly this reason. Retrying safely requires both client and server to agree on how duplicate attempts are recognized.

The worker model also needs a claiming strategy. If two workers pick up the same pending job, they may duplicate work. A common pattern is claim-before-process: atomically mark a job as claimed, set a lock expiration, then process it. The expiration matters because workers crash. A boolean locked flag can strand work forever unless another process knows how to clear it. A timestamp such as locked_until or next_retry_at gives the system a self-repair path.

The consistency model is deliberately weaker than a synchronous request. The caller knows the job was accepted, not that the remote side completed the action. That is often the correct trade. The alternative is letting a client wait on an unreliable dependency until an HTTP timeout gives it an ambiguous result. Ambiguous synchronous failure is usually worse than explicit asynchronous state.

The Redis Pub/Sub metrics pipeline has its own trade-offs. It reduces duplicate vendor API calls and improves update latency for active dashboards. It also creates a live delivery path that must be managed. Long-lived HTTP connections consume server resources. Proxies and load balancers need sane timeout settings. Deployments must avoid breaking every stream without reconnect behavior.

Redis Pub/Sub also does not provide replay. If the SSE server is down when a metric is published, that message is gone for that subscriber. For live telemetry, the database can remain the source of latest state while Pub/Sub handles fanout. For audit trails, alerts, or workflows where every event must be processed, the design needs durable messaging. Redis Streams, Apache Kafka, or a cloud queue would give stronger delivery properties at the cost of operational complexity.

The inverter example also shows why API adapters deserve respect. The implementation ran into real vendor API mismatches: documented fields did not always match returned fields, device types changed response shapes, and one endpoint required application/x-www-form-urlencoded rather than JSON. These are not edge cases. They are normal integration work. External APIs are distributed systems boundaries, and boundaries carry ambiguity.

A useful pattern is to isolate vendor-specific behavior inside adapters. The rest of the system should consume normalized inverter readings, not raw vendor payloads. That keeps field-name changes, content-type quirks, and device-type branching from leaking into dashboard code, analytics code, or alerting code.

Why It Matters

The larger lesson from HNGi14 is that resilience is not one feature. It is a set of contracts across storage, workers, APIs, clients, and third-party systems.

A retry engine teaches that failure must be represented as data. If a job can be persisted, inspected, retried, delayed, and eventually marked dead, the system has a fighting chance of recovering without human intervention. If failure exists only as a thrown exception in a vanished process, recovery becomes guesswork.

The metrics pipeline teaches that scale problems often come from putting responsibility in the wrong place. Browsers should not coordinate vendor polling. Controllers should not directly wake pollers. External API rate limits should not be discovered through production overload. Once the system has a single collection path and a separate fanout path, load becomes easier to reason about.

Both systems also show the value of choosing the smallest abstraction that matches the requirement. SSE is enough for one-way live updates. Redis Pub/Sub is enough for ephemeral dashboard fanout. A job table is enough for a first retry engine. None of these choices has to pretend to solve every future problem. They only need clear failure behavior and an upgrade path when requirements become stricter.

That is the difference between architecture as diagram decoration and architecture as operational memory. The retry engine remembers work that would otherwise be lost. The event pipeline remembers that many users can depend on one upstream data source. Both designs accept that real systems fail, then shape the API around that fact.