How backend production systems actually fail

Production systems don't fail because code is bad; they fail because reality isn't always consistent. This article explores three failure patterns—cascading failures, partial failures, and silent failures—that commonly affect backend systems, explaining why they're dangerous and what lessons can be learned to build more resilient systems.

Systems in production tend to experience incidents, though some more than others. Most of the time, when something goes wrong in production, the code is doing exactly what it was written to do. The problem is that production introduces conditions that cannot be fully simulated ahead of time.

In this article, I will discuss how these failures actually happen, group them into three patterns, mention why these patterns are dangerous, and touch on lessons that can be learned. Production systems don't fail because code is bad; they fail because reality isn't always consistent.

Prerequisites

Before I proceed, please note that this article is for:

Backend Engineers
People running production systems
Anyone who has dashboards that say "green" while users complain

Failure Patterns

Failure Pattern #1: Cascading Failures

Cascading failures occur when one service in a system becomes slow or fails, which in turn affects how other parts of the system that depend on the service behave. Cascading failures can even occur from small user actions, for example, retries.

Let me paint a picture: I once worked on a project where cascading failure occurred. Certain DB queries created bottlenecks in the system due to their complexity and the growing size of data in the database. To make matters worse, the connection pool allocated had reached its maximum number of slots, and further database calls could not be processed. This resulted in two likely scenarios:

The user would abruptly get their request cancelled and attempt to try again.
The request will linger in the system, waiting for an open connection to execute the query on the db.

This resulted in a pile-up of requests as the traffic increased. The second option became a cascading failure as the system ended up trying to process more than it should at a given time, while processing the regular incoming requests. This resulted in longer wait times, and simple tasks like logging in to the system ended up taking a long time to process. In some cases, we had the CPU maxing out and the entire system becoming unresponsive, which ended up putting the system in a slow state.

What happens in the background to cause this slow state? Each process is running on a thread during the lifetime of a request. The issue is that the thread pool (see this as an allocation of threads for processes to run) is limited; therefore, when requests pile up, they take up the available threads in the pool, leaving incoming requests stuck in a waiting state for any available resource to run on.

What actually gets exhausted is rarely 'the server', it's thread pools, DB connections, or queue workers.

Why is this dangerous?

For starters, slowness is contagious in production systems. A slow component or service will, in turn, affect the delivery of other services, thereby presenting a broken system to users. A system can look healthy in isolation while failing as a whole due to cascading failures. Depending on the design of the system, some services may act accordingly, but due to their dependence on the affected service, the whole system fails.

Lessons learned

While systems cannot be 100% fool-proof, plans can be put in place to ensure that cascading failures are handled properly. Firstly, timeouts must exist; your system is better off terminating long-running requests or batch processing as opposed to leaving them to run forever, thereby taking up resources. Timeouts can be implemented in sections that prove to be bottlenecks in a system, for example, a request to an external provider or long-running queries that require pulling large amounts of data from databases.

Secondly, Circuit breakers can save the system by ensuring traffic is routed from failing services or dependencies to other healthy ones. A common example of this is a failing third-party payment provider not processing payments for users; a circuit breaker implementation will allow the system route payments to other working providers pending resolution of the downtime from the main provider.

Failure Pattern #2: Partial Failures

Partial failures occur when only a part of a system fails, while other parts continue to function, leading to incomplete or inconsistent results. Partial failures are subtle and can be very expensive when they happen.

I'll paint a scenario from a system I once worked on that handled payments. Users could initiate payments to charge their cards for services they wanted to use. An issue occurred when a particular user attempted to charge their card but got no response in time, and by default, they retried the charge. This later became a problem as the user soon realized they were double-charged for a single payment.

So, what exactly happened? The service responsible for handling payments experienced a downtime and could not fully process payments at that instant, but could receive incoming requests to process in a queue. When the system was finally able to process the payments, it treated each payment as a unique request and processed them blindly, regardless of where they were coming from, thereby resulting in the double charge.

From the user's perspective, retrying is the reasonable thing to do. Now the system sees the same action twice, and if the system doesn't plan for it, duplicates are created. Nothing here is technically a bug, as every step made sense in isolation.

Why is this dangerous?

Partial failures tend to put systems in an "in-between" state. From the user's perspective: "It didn't work," but from the system's perspective: "Part of it did work." Partial failures are tricky in this case because nothing is completely broken. Some succeed, something else fails, and now the system and the user disagree about what happened, basically a divergent truth.

Lessons learned

In backend systems, idempotency is essential to handle partial failures. In production, retries are unavoidable; users refresh pages, client apps resend requests, or other systems retry automatically. The backend must assume that the scenario: "this request might be sent more than once" is handled properly.

Systems can use request identifiers to achieve this, which allows them to treat retries from the same source as the same and avoid duplicates by either responding with the result of the first request or discarding that and processing the latest one, to each system its own.

For distributed systems, transactions don't cross service boundaries. What this implies is that a database transaction can only guarantee consistency inside one service, meaning there is no awareness of what happens in other services in the system. Once multiple services are involved (different databases, failure nodes, or runtimes), there is no single "undo" functionality. Services can fail independently, and a process flow ends up partially completed but in a valid state.

It is very important to design compensating actions, for example, "undo later," where the system finds transactions in a partial/pending state and attempts to reconcile them.

Failure Pattern #3: Silent Failures

I'll regard silent failures as by far the deadliest, as they are the hardest ones to notice. A background job could fail quietly, or a report may not generate. At first glance, everything seems fine until someone notices a mismatch days later.

Silent failures don't necessarily mean the system crashed; it's usually when any of the following happen:

The system continues operating
Requests appear successful
No alerts are fired
Dashboards look "fine."

...but the business outcome is wrong.

Essentially, silent failures are failure modes where error signals do not propagate to the layer that observes correctness. An operation as simple as a cache write failing or an event published but never consumed could be indicators of silent failures in a system.

Why is this dangerous?

For silent failures, users of a system may never notice immediately, and teams assume everything is fine. This, in turn, will lead to accumulated problems until the business is impacted; for example, orders without payments, payments without invoices being sent out, or events not being consumed. This means the backend now carries "historical corruption".

In many cases, fixing the bug doesn't fix the damage because new data is correct, but old data remains wrong. Teams must now employ techniques like backfilling of data, event reprocessing, or writing one-off migration scripts.

Lessons learned

Observability is essential to every backend system; if done incorrectly, it becomes practically useless to the system. Using observability tools correctly will allow you to tell if the system is doing the right thing.

Logging is another very important concept in a backend system. Good logging should be able to track what entity is involved (orderID, transactionReferenceID, etc.), the reason for failure, and what should happen next. With these, you can build alerts and trace flows which can enable systems detect silent failures faster.

Metrics are also very important in detecting silent failures. We've established that silent errors occur when the requests succeed, so domain metrics like orders_count_total, events_published_total, completed_payments_total, abandoned_payments_total, etc can be helpful. These metrics can be used to assert relationships or raise alerts. For example, an alert can be raised if abandoned_payments_total reaches or exceeds a certain threshold or if orders_count_total and completed_payments_total are off by a large margin.

Concerning alerts, they only matter if they're actionable. An alert that just echoes "Error rate increased" is unactionable as there isn't sufficient information to act upon. Actionable alerts, in essence, should inform the user of what broke, where to look, and why it matters. In summary, if an alert doesn't tell you what to do next, it's noise.

It is also imperative to understand that "working" is not "correct", especially when dealing with silent failures. While backend systems optimize for availability, throughput, and resilience, they should also optimize for correctness.

Conclusion

Production failures don't start when alerts fire. They start when assumptions go unchecked. The goal isn't zero failure, it's failure you can see, understand, and recover from. Production systems don't fail loudly by default. They fail quietly, unless we design them not to.

It is imperative as engineers to take failure patterns into account when building systems, understand that production issues are inevitable; it is our response to them that defines how well our systems will stand the test of time.

Heroku

Tired of jumping between terminals, dashboards, and code? Check out this demo showcasing how tools like Cursor can connect to Heroku through the MCP, letting you trigger actions like deployments, scaling, or provisioning—all without leaving your editor. Learn More

#resilience #cascading failures #partial failures #silent failures #Observability

How backend production systems actually fail

Prerequisites

Failure Patterns

Failure Pattern #1: Cascading Failures

Why is this dangerous?

Lessons learned

Failure Pattern #2: Partial Failures

Why is this dangerous?

Lessons learned

Failure Pattern #3: Silent Failures

Why is this dangerous?

Lessons learned

Conclusion

Comments