Overview
SRE was pioneered by Google. It treats operations as a software problem. SREs use automation to manage large-scale systems and ensure they meet reliability targets.
Key Concepts
- SLO (Service Level Objective): Target for reliability.
- Error Budget: The amount of downtime allowed before development must stop to focus on reliability.
- Toil: Manual, repetitive work that SREs aim to automate away.