New essay pushes back against incident response hero myth, urges patience during outages
#DevOps

New essay pushes back against incident response hero myth, urges patience during outages

Dev Reporter
6 min read

A new essay from engineering writer Sean Goedecke challenges the pervasive 'hero engineer' narrative of incident response, arguing most outages resolve on their own and rushing to act often worsens problems.

Featured image

Sean Goedecke, an engineering leader and writer focused on production systems and team culture, recently published a new essay titled "Notes on incidents" on his personal site. The piece draws on years of on-call experience to lay out a series of counterintuitive lessons about incident response, pushing back against the widely shared cultural narrative of the "hero engineer" who improvises a clever fix to save a failing system.

Goedecke starts with a blunt observation that many incident responders will recognize: incidents are mostly boring. Most of the time spent on an incident call is waiting, for a deploy to finish, for a team to investigate, for a change to take effect, or for another on-call engineer to join the call. The majority of incidents resolve on their own, thanks to modern system design. Kubernetes restarts crashing pods, circuit breakers back off when services are overloaded, and queues absorb temporary spikes in traffic instead of taking down the entire system. Goedecke estimates that well over half of the incident calls he has joined would have resolved in roughly the same time with no human intervention at all.

The problem, he argues, is that engineers tend to jump into action too quickly, and those actions often make incidents worse. He gives a common example: an engineer sees a huge queue size, jumps into a production console, and clears the queue, only to realize later that the jobs included critical billing work that wasn't automatically re-queued. A latency incident has now become a billing incident. Another classic mistake is forcing a series of redeploys to fix a weird metric, only for the concurrent deploys to stress the system more than the original issue. For this reason, Goedecke says the first thing you should do in an incident is nothing. He shares that when he was paged late at night, he used to pour a glass of scotch before joining the call, not just for the alcohol, but to force himself to slow down and take a few breaths before acting. A cup of tea or a short walk would work just as well.

When human intervention is actually needed, the fixes are almost always dull. Goedecke writes that effective incident actions usually involve temporarily disabling a problematic feature, reverting a recent change, or adding a cache. These are never complex code changes. Someone spends five minutes putting together a patch, then an hour waiting for reviews, CI, and deployment. The key factor is not speed or heroism, but deep knowledge of the system. Five experienced engineers can troubleshoot a call for hours and get nowhere, while one engineer familiar with the codebase can identify the right feature flag to disable or change to revert in minutes. This kind of knowledge lets you be decisive, which is critical during scary incident calls where teams tend to reach for consensus, hedge statements, and defer to each other. If you know the system, you can say "I'm going to do X", wait thirty seconds, then act. Goedecke notes that executives, who are used to exercising control, can be helpful here, as they are comfortable saying "okay, do it now" even if they don't fully understand the technical details.

The essay also addresses the political dynamics of incident response, a topic rarely covered in technical writing. Goedecke writes that fixing incidents earns a lot of short-term gratitude from managers and executives, who are confronted with their lack of control over technical systems during an outage. They have to trust their team to fix the problem, which is stressful for people used to being in charge. However, this gratitude doesn't translate to long-term career power. Incident response work is so technical that it's opaque to non-technical leadership. They know the incident was fixed, but they can't tell if you did something heroic or just the obvious thing. They also can't claim the success as their own, which is how most executives build alliances with each other. And since incidents are supposed to be fixed, there's no credit for avoiding them in the first place.

Why developers care

This essay resonates because incident response is a near-universal experience for engineers working on production systems. Whether you're an SRE, a backend developer, or a product engineer on an on-call rotation, you have felt the pressure to "do something" when an alert goes off, even if there's nothing useful to do. The hero engineer narrative is pervasive in tech culture, from onboarding materials to postmortem celebrations, and it can lead to burnout, unnecessary risk-taking, and worse outcomes for the system.

Goedecke's advice aligns with modern SRE best practices around error budgets, gradual rollbacks, and avoiding unnecessary intervention, but it's presented in a personal, relatable way that formal documentation often lacks. He validates the experience of waiting on an incident call, the stress of being paged at 2am, and the frustration of making a quick fix that backfires. For engineers new to on-call work, it offers a realistic preview of what to expect, and pushes back against the idea that you need to be a genius to handle incidents. For experienced responders, it's a reminder to slow down, trust the systems you've built, and value deep system knowledge over flashy heroics.

The section on political dynamics is also valuable for engineers who feel like their on-call work goes unrecognized. It explains why fixing incidents might get you a thank-you note from a VP, but won't help you get promoted, and why preventing incidents is more valuable than fixing them, even if that work is less visible.

Community response

As of publication, the essay is gaining traction across developer communities. Goedecke has encouraged readers to share the post on Hacker News, where similar pieces on engineering culture often spark long, thoughtful discussion threads. Early reactions from SREs and on-call engineers on social media echo Goedecke's points, with many sharing their own stories of incidents that resolved themselves, or times they made things worse by rushing to act.

One commenter noted that the advice about doing nothing first aligns with their team's incident response training, which emphasizes waiting 15 minutes before taking any action unless there's a clear, immediate threat. Another shared that they now keep a list of simple, known fixes for their team's services, so they don't have to troubleshoot under pressure. Several readers appreciated the candid discussion of political dynamics, with one writing that they had never seen anyone address why on-call work doesn't lead to career advancement, even though it's a common experience.

Goedecke also teased a related follow-up post, Learning incident response with problem sets, which will outline exercises for building system knowledge and practicing incident response skills. The original essay, along with more of Goedecke's writing on engineering culture, is available on his personal site at seangoedecke.com.

Comments

Loading comments...