Microsoft introduces Agent SRE, extending traditional Site Reliability Engineering concepts to AI agents with Safety SLIs, behavioral circuit breakers, and chaos engineering specifically designed for autonomous systems.
When your microservices have SLOs, error budgets, and circuit breakers, why shouldn't your AI agents? Microsoft's Agent Governance Toolkit now includes Agent SRE, a novel approach to applying proven reliability patterns to autonomous AI systems. This framework addresses critical gaps in current AI observability by introducing Safety SLIs that catch behavioral failures invisible to traditional monitoring, wiring error budgets to agent autonomy for earned trust, and implementing chaos experiments purpose-built for LLM provider outages and reasoning loops.
What Changed: From Model Quality to System Reliability
Most AI observability tools today focus on model quality metrics: hallucination rates, latency, token costs, and task completion rates. While valuable, these metrics don't answer the critical operational questions that SRE teams need: Did the agent act within its authorized scope? Is its behavioral error budget burning at a dangerous rate? Would it survive an LLM provider outage?
The fundamental shift is viewing AI agent reliability not as a property of the model itself, but as a property of the governance infrastructure around it. Agent SRE extends traditional SRE concepts with four new adaptations:
- Safety SLIs instead of latency SLIs - measuring policy compliance rather than response speed
- Autonomy budgets instead of error budgets - expanding and contracting based on behavioral evidence
- Behavioral circuit breakers - opening on wrong behavior, not just failure codes
- Capability rollout instead of code deployment - safely expanding agent scope with SLO gates
"The reliability of an autonomous agent is not a property of the model. It is a property of the governance infrastructure around it," explains the Microsoft team behind the project. "Agent SRE is that infrastructure."
Provider Comparison: Traditional SRE vs. Agent SRE
Traditional SRE and Agent SRE share the same mental model but apply it to different system characteristics:
| Traditional SRE | Agent SRE Equivalent | Key Differences |
|---|---|---|
| Latency SLI | Safety SLI | Measures correctness of action, not speed of response |
| Error budget | Autonomy budget | Burns on policy violations, not just errors |
| Circuit breaker | Behavioral circuit breaker | Opens on wrong behavior, not just failure codes |
| Canary deployment | Capability rollout | Rolls out scope, not just code |
The Safety SLI represents the most significant conceptual shift. While traditional SLIs measure system behavior from the user's perspective (latency, availability, error rate), Safety SLIs answer a different question: Did the agent act within policy?

Comments
Please log in or register to join the discussion