Overview

SRE was pioneered by Google. It treats operations as a software problem. SREs use automation to manage large-scale systems and ensure they meet reliability targets.

Key Concepts

  • SLO (Service Level Objective): Target for reliability.
  • Error Budget: The amount of downtime allowed before development must stop to focus on reliability.
  • Toil: Manual, repetitive work that SREs aim to automate away.

Related Terms