Real-world failures provide insights that automated fault injection tools cannot replicate, revealing how complex systems actually work and teaching architects about resilience beyond known failure modes.

Failure As a Means to Build Resilient Software Systems: A Conversation with Lorin Hochstein

Understanding Real Failures vs. Synthetic Testing

Lorin Hochstein, Staff Software Engineer for Reliability at Airbnb, has spent years studying how complex systems fail in practice. His journey began at Netflix on the Chaos team, where he worked on Chaos Monkey and the Chaos Automation Platform (CHAAP). While these tools introduced basic robustness by terminating instances or injecting RPC failures, Hochstein discovered a critical limitation: they couldn't replicate the understanding that comes from mitigating complicated software failures in the real world.

"Real incidents happen because of a confluence of different things happening at the same time," Hochstein explains. "Typically when you do a Chaos experiment, you're failing one thing at a time. I've never seen people actually try to do multiple ones. And you don't know, there are so many different possible combinations that you just can't cover that space."

The real value of Chaos Monkey wasn't in preventing specific failures but in forcing architects to think about resilience. "It was a forcing function for the architecture so you needed to be able to withstand a particular instance or pod going down at any point in time," he notes. Once teams internalized these patterns and built appropriate fallback mechanisms, the tool's primary work was done.

The Limits of Known Failure Modes

Hochstein identifies a fundamental challenge in software engineering: we understand how to make systems robust against known failure modes, but we struggle with building resilience to unknown failure modes or failures resulting from evolving system designs and external world changes.

This distinction between robustness and resilience is crucial. Robustness involves designing for anticipated failures using established patterns like circuit breakers, retries, and health checks. Resilience, however, is about preparing for the unexpected—the problems we didn't anticipate when designing the system.

"Engineers are not good at thinking about how do we deal with problems that we cannot anticipate," Hochstein observes. "Prepare to be surprised."

Learning from Incidents: The Blameless Approach

One of the most counterintuitive aspects of reliability engineering is the emphasis on blameless post-incident reviews. Hochstein advocates assuming that individuals behaved rationally based on the information available to them at the time. This approach reveals systemic issues rather than individual failures.

"If you look at it and say somebody did something wrong, they didn't test well enough, for example. And so what do you do? You tell them to test better next time," he explains. "But if there's a problem that makes it harder to test, maybe you could only catch it at an end-to-end testing and their end-to-end tests are flaky and they were failing or we don't have good support for that."

The blameless culture isn't about avoiding accountability but about finding the right level of analysis. True incompetence typically shows up in day-to-day work, not just in incidents. When someone repeatedly makes poor judgments or consistently underestimates tasks, that's a management issue to address through regular performance evaluation, not incident reviews.

The Complexity Paradox

Hochstein describes what colleagues call "Lorin's Law": once systems reach a certain level of reliability, large failures generally occur because either someone was taking action to mitigate a smaller incident and something went wrong during that mitigation, or some subsystem designed to improve reliability had an unexpected interaction with the rest of the system.

This creates a paradox: adding reliability often increases complexity, which can lead to new failure modes. "We talk about simplicity being important for reliability but if you look at any real system, the ones that have gotten more reliable, they've added complexity over time to increase that reliability," he notes.

Examples include seat belts, airbags, and anti-lock brakes in cars—all increases in complexity that improve safety. In software, health checks, load balancers, and monitoring systems serve similar purposes but introduce their own failure modes through unexpected interactions.

Organizational Complexity and the Build vs. Buy Decision

The role of organizational complexity in understanding software failures is often underappreciated, particularly when making build versus buy decisions. When incidents involve interactions between in-house software and vendor software, coordinating across organizational boundaries becomes significantly more challenging.

"If there's an incident that involves some interaction between your software and the vendor software, now you have to coordinate across two different organizations," Hochstein explains. "The further you are organizationally from the people you're working with, the harder it's going to be to resolve that."

This constraint is rarely factored into technology decisions but represents a real cost in incident response capability.

The Holistic View of Reliability Engineering

Reliability engineering differs fundamentally from traditional software engineering because it views the system holistically rather than focusing on individual subsystems. While traditional engineering emphasizes separation of concerns and decomposition, reliability engineering requires understanding how the entire system works together, especially when something breaks.

"When everything is working properly, then analysis works great, you break things down," Hochstein says. "But when something is broken somewhere and the system is not working, now you have to see how does the entire system works to figure out how that goes."

This holistic perspective extends beyond software to include people, processes, and organizational structures. Staffing on-call rotations, for example, is part of the system architecture even though it's not part of the software architecture.

Why These Ideas Haven't Spread Widely

Despite the clear value of reliability engineering principles, Hochstein acknowledges they haven't spread as widely as concepts like agile or DevOps. Several factors contribute to this slow adoption:

First, the ideas originated in academic and research contexts, making them harder to transfer to industry practice. While distributed systems concepts have successfully crossed this gap, resilience engineering principles have been slower to spread.

Second, organizations can survive without embracing these principles. "Once they reach a certain size and momentum, they will eventually decline and fall but they can take a long time," Hochstein observes. The short tenure of engineers in companies (typically two years or less) also limits the spread of institutional knowledge.

Third, reliability work often lacks tangible artifacts. "I cannot show you how many incidents didn't happen because of software reliability work," Hochstein notes. The value is in preventing problems rather than creating visible deliverables.

The Power of Storytelling

One approach Hochstein advocates for spreading these ideas is storytelling. At Airbnb, he co-runs a quarterly event called "Once Upon An Incident" where engineers share stories about impactful incidents. This approach recognizes that humans absorb information more effectively through narratives than through bullet points or metrics.

"We get three storytellers once a quarter, they talk about an older impactful incident," he explains. "It is a way of spreading knowledge and like human beings, we're just wired for that sort of thing."

The Future of Reliability Engineering

Looking ahead, Hochstein sees promise in AI-assisted incident response but remains skeptical about complete automation. "I don't think it'll take over," he says. "You know what I mean? It won't be 100%, which I would love if it was 100% and we didn't have to staff humans on-calls anymore."

He believes AI will make handling common incidents easier but won't address the complex, novel failures that most concern reliability engineers. The human element—understanding context, coordinating responses, and learning from experience—remains essential.

Conclusion

Hochstein's insights reveal that building resilient software systems requires more than technical solutions. It demands a cultural shift toward learning from failures, embracing complexity, and viewing systems holistically. The most valuable lessons come not from synthetic testing but from real incidents where multiple factors interact in unexpected ways.

As software systems become increasingly complex and critical to modern life, the principles of reliability engineering become more important. Organizations that embrace these ideas—learning from failures, building resilience rather than just robustness, and viewing systems holistically—will be better positioned to handle the inevitable challenges of operating complex software at scale.

The challenge remains: how to spread these ideas more effectively and help organizations recognize that the path to true resilience lies not in preventing all failures but in building the capacity to respond effectively when failures inevitably occur.

#Reliability Engineering #resilience #incident response #blameless post-mortem #chaos engineering

Failure As a Means to Build Resilient Software Systems: A Conversation with Lorin Hochstein

Failure As a Means to Build Resilient Software Systems: A Conversation with Lorin Hochstein

Understanding Real Failures vs. Synthetic Testing

The Limits of Known Failure Modes

Learning from Incidents: The Blameless Approach

The Complexity Paradox

Organizational Complexity and the Build vs. Buy Decision

The Holistic View of Reliability Engineering

Why These Ideas Haven't Spread Widely

The Power of Storytelling

The Future of Reliability Engineering

Conclusion

Comments