Overview

Pioneered by Netflix with 'Chaos Monkey,' chaos engineering involves intentionally introducing failures—such as killing a server or injecting network latency—to see how the system reacts and to identify weaknesses before they cause real outages.

Principles

  1. Build a Hypothesis: Predict how the system should behave under stress.
  2. Vary Real-world Events: Introduce failures like server crashes or network spikes.
  3. Run Experiments in Production: To get the most realistic results.
  4. Automate Experiments: To run them continuously.
  5. Minimize Blast Radius: Ensure that experiments don't cause major disruptions for users.

Goal

To build resilient systems that can survive the unpredictable nature of large-scale distributed environments.

Related Terms