Overview
Game Days are a form of 'manual' chaos engineering. They provide a safe, controlled environment for teams to learn how their systems behave under stress and to ensure that their monitoring, alerting, and runbooks are effective.
The Process
- Planning: Defining the failure scenario and the expected outcome.
- Execution: Triggering the failure (e.g., shutting down a database, blocking a network port).
- Response: The team follows their incident response process to diagnose and fix the issue.
- Debrief: Analyzing what went well and what needs improvement.
Benefits
- Builds Confidence: Teams are better prepared for real outages.
- Identifies Gaps: Reveals missing alerts or outdated documentation.
- Improves Collaboration: Strengthens the working relationship between team members.