Overview

Game Days are a form of 'manual' chaos engineering. They provide a safe, controlled environment for teams to learn how their systems behave under stress and to ensure that their monitoring, alerting, and runbooks are effective.

The Process

  1. Planning: Defining the failure scenario and the expected outcome.
  2. Execution: Triggering the failure (e.g., shutting down a database, blocking a network port).
  3. Response: The team follows their incident response process to diagnose and fix the issue.
  4. Debrief: Analyzing what went well and what needs improvement.

Benefits

  • Builds Confidence: Teams are better prepared for real outages.
  • Identifies Gaps: Reveals missing alerts or outdated documentation.
  • Improves Collaboration: Strengthens the working relationship between team members.