Week-Long Outage: Lifelong Lessons - Building Resilience Through Experience
#Regulation

Week-Long Outage: Lifelong Lessons - Building Resilience Through Experience

DevOps Reporter
9 min read

Netflix SRE Molly Struve shares hard-earned lessons from a brutal 6-day Elasticsearch outage that nearly sank her previous company, offering practical insights on technical preparedness and human factors in incident response.

Week-Long Outage: Lifelong Lessons - Building Resilience Through Experience

Featured image

In the world of site reliability engineering, few experiences are as formative as a major outage. As Molly Struve, Staff Site Reliability Engineer at Netflix, recounts in her recent presentation at QCon San Francisco, "Who is excited to hear the story about one of the largest outages of my career?" The story she shares isn't just about technical failure—it's about how a 6-day Elasticsearch upgrade incident at her previous company, Kenna Security, became a crucible for learning that shaped her career and engineering philosophy.

Setting the Stage: The Context

Kenna Security, a cybersecurity company helping Fortune 500 companies manage risk, built its platform around Elasticsearch. The technology was their cornerstone, enabling customers to search through all their cybersecurity data in seconds—a capability that set them apart from competitors. When the time came to upgrade from Elasticsearch 2 to Elasticsearch 5 in March 2017, the stakes were high.

"We were not crazy enough to jump three versions," Struve clarifies. "Despite the fact that the numbers jump by a value of 3, the actual major version bump was only 1." This upgrade had been months in preparation, with everyone excited about the performance gains they'd seen from the previous major version.

The Unfolding Crisis

The upgrade itself went smoothly on a Thursday evening. "The actual upgrade itself, which involves shutting down the cluster, updating all the nodes, then updating the codebase, went off without a hitch," Struve explains. But as soon as they took the app out of maintenance mode, CPU and load spikes began appearing throughout the cluster.

By Friday morning, some nodes had crashed. The team restarted them, but as traffic picked up at 9 a.m., the entire cluster went down. What followed was six days of desperate troubleshooting, 15+ hour days, and mounting customer frustration.

"We were working 15-plus hour days, trying to figure out what was causing all of these issues," Struve recounts. "While at the same time, doing everything we could just to keep our application afloat. We were adding cache statements where cache statements had no business being. Anything we could do to keep load off of the cluster."

The technical challenges were compounded by business realities. With no rollback plan (the estimated time to restore the old cluster was five days), the team was forced to build a new cluster while simultaneously trying to keep the failing one operational.

The Turning Point

On day six, relief came from an unexpected source. "Wednesday, March 29th rolled around. We are six days into this outage at this point, and our team is digging deep, despite the exhaustion, to just keep going. That's when it happened. We finally got the news we had been waiting for."

Jason Tedor, a senior engineer at Elasticsearch, had identified a bug in the source code. "When we implemented the workaround, it was like night and day. The cluster immediately stabilized. I think at this point, we were so happy, our entire team cried."

Technical Lessons Learned

Lesson 1: Have a Rollback Plan, All the Time

"First lesson learned, have a rollback plan," Struve emphasizes. "When doing any sort of change, you must know what rolling back in the event of a problem involves. Can you roll back the software with a simple revert PR? If you can't roll back with a simple revert PR, how would you handle that rollback? How long is a rollback going to take?"

At Netflix, this practice takes the form of Failure Mode and Effects Analysis (FMEA). "During this exercise, you sit down with your team, you put your chaos hats on, and you think about all the ways your code might fail, or the change might go wrong that you are making. Once you have your list of failure modes, usually you capture these in a spreadsheet. Then you go through and you prioritize them based on severity."

Struve shares a cautionary tale from her Netflix experience: "I asked the service owners, can we roll back the data? They said, yes, no problem, we have a runbook for this. Great, when was the last time we did this? We've never done it before, but we wrote the runbook two years ago, it should be fine. Fast forward over an hour later, we finally got the data rolled back after hitting multiple friction points, unexpected behavior."

Lesson 2: Do Performance Testing, Regularly

"No matter how stable and widely used software you are running is, you should performance test it," Struve advises. "We did not think our usage of Elasticsearch was unique. I can confidently say that our data size was not super remarkable. We glossed right over this step."

After the incident, her team developed a process for running shadow traffic against new clusters and indexes. At Netflix, this is done through strategies like long-running canaries and shadow/replay traffic. "A lot of teams who leverage this strategy will actually run a canary overnight, and it's a great way to catch some of those smaller regression issues before they blow up into bigger problems."

Lesson 3: Be Wary of Previous Experience Bias, Always

"We went into this upgrade with some very wrong assumptions and biases," Struve admits. "One of which was, since the last upgrade improved performance, this one will too. Software only gets better. We all know this is wrong."

To combat this, she suggests adding small reminders to PR templates or deploy flows. "Maybe you add these three emojis to the bottom of a PR template or a deploy confirmation box. Emoji is not your style, maybe you add a GIF instead. The point is, it doesn't need to be big or formal. We don't need folks filling out TPS reports every time they want to push to production, that's too much. We just want a subtle reminder for folks to pause and just think about what could go wrong when they're releasing software."

Human and Organizational Lessons

Lesson 4: Widen Your Circle

"Never discount the help you can find by widening the circle of folks involved in an incident," Struve stresses. "In this case, we went really wide, and we tapped into the Elasticsearch community. It can be hard and scary to ask for help, but don't let that stop you."

The team's decision to post on Elastic's discuss forum was pivotal. "It took me maybe an hour to gather all the data and write this message. It was by far the most valuable hour of the incident, because this post is what got us our answer."

Lesson 5: Your Team Matters

"During any incident, your team matters," Struve emphasizes. "We all know being an engineer is not just about working with computers. You're also working with people."

During the 6-day ordeal, her team of three worked through every emotion imaginable. "Rather than these emotions breaking us down, they banded us together. We all supported each other when we needed it. That was a big aha moment for me because it made me realize that character is everything in a time of crisis. You can teach people tech. You can show them how to code. You can instill good architectural principles in their brains, but you can't teach character."

Lesson 6: Leader and Management Support are Crucial

The final lesson focuses on the critical role of leadership. "One of the key reasons we as engineers were able to do our jobs as effectively as we were was because of our VP of engineering," Struve explains. "He not only was there to offer help technically, but more importantly, he was our cheerleader. He was also our defender. He fielded all of those questions and messages from upper management, which allowed us to focus on getting things fixed."

For leaders, she offers this advice: "Be their cheerleader. Be their defender. Be whatever they need you to be for them. Above all else, trust that they can do it. That trust will go a long way towards helping the team believe in themselves."

The Bonus Lesson: Embrace Your Incidents

"The Elasticsearch outage of 2017 is infamous at my former company, but not in the way that you would think," Struve reflects. "As brutal as it was at the time for the team and the company, it helped us build a foundation for the engineering culture. It gave everyone a story to point to that said, this is who we are. What happened there, that's us. It created a deep sense of psychological safety."

She challenges the common practice of treating incidents as taboo. "In our industry, at some companies, outages and downtime can be taboo. When they happen, someone just, hopefully, scribbles out a post-mortem, sometimes they don't, and just sweeps it under the rug and hopes people will forget about it. That's not how it should work."

Instead, she reframes incidents as learning opportunities: "Think of incidents as a withdrawal from your availability piggy bank. You just withdrew some money, now what are you going to do with it? Are you going to squander and throw it away, or are you going to invest it? Embracing incidents gives you that opportunity to leverage those lessons, grow from them, and invest in your team and your software."

Practical Takeaways

Struve's presentation offers several actionable strategies for organizations:

  1. Implement FMEA or pre-mortem exercises before major changes
  2. Establish regular performance testing through canaries and shadow traffic
  3. Create psychological safety around incidents and failures
  4. Train leaders on how to support teams during crises
  5. Develop rollback plans for all changes, not just code
  6. Normalize asking for help early in incident scenarios

Week-Long Outage: Lifelong Lessons - InfoQ

Conclusion

As Struve summarizes: "While the technical side is important, I firmly believe that these last three points are probably the most important. The reason I say this is because despite doing these three things, there will still be times when things go wrong. It's inevitable in this line of work, things are going to break. Incidents are going to happen."

What matters most is how organizations respond to these inevitable failures. "When things do go wrong, if you widen your circle by asking questions early and often, and you have the right team and leadership in place, you can survive any outage or incident that comes your way."

For those interested in learning more about incident management and resilience practices, Struve recommends exploring Netflix's incident management approaches and considering resources like the SEV0 conference materials on pre-mortem processes. As she powerfully concludes: "As you continue on your journey of building incredible software, remember these lessons. They will help you catch problems early, and more importantly, empower you and your team to handle any incident that arises."

For the complete presentation and additional resources, visit the InfoQ presentation page.

Comments

Loading comments...