Azure SRE Agent transforms incident response with agentic AI that continuously monitors Azure environments, correlates telemetry with changes, and executes remediation steps under human approval.
If you've followed my work, you know I'm passionate about Azure reliability. Back in the early Azure days, I presented extensively on this topic, which aligned perfectly with my role as Azure Architect, consultant, and trainer. In late 2021, Azure Chaos Studio amazed me—a service that injects faults into your Azure workloads (ideally production!) to build resilience. Now, with Generative AI and Agentic InfraOps/DevOps emerging, welcome Azure SRE Agent, which recently graduated from public preview to general availability.
What Is Azure SRE Agent?
Modern cloud systems are increasingly distributed, dynamic, and failure-prone by design. While DevOps has optimized delivery velocity, operational reliability still demands significant human effort—especially during incident response, root cause analysis, and post-incident follow-up.
Azure SRE Agent is an AI-powered reliability assistant designed to automate and augment Site Reliability Engineering practices for Azure workloads. It continuously observes telemetry (metrics, logs, traces), understands Azure resource topology, correlates incidents with recent changes, and assists with or executes remediation steps.
Unlike traditional monitoring or AI-Ops tools, Azure SRE Agent operates as an agentic system:
- It reasons over multiple data sources simultaneously
- It maintains contextual awareness of your Azure environment
- It can take action via Azure CLI and REST APIs, subject to explicit approval
- It integrates natively with incident management and developer workflows
In effect, Azure SRE Agent acts as a virtual SRE teammate, reducing operational toil and lowering mean time to resolution (MTTR) while preserving human oversight.
Architecture and Core Capabilities
Azure SRE Agent combines four capability pillars:
Continuous Observability Ingestion
The agent consumes signals from Azure Monitor, Log Analytics, Application Insights, and supported external observability systems to build a live understanding of system health and dependencies. The real benefit here is that organizations already have everything in place—adoption is smooth because the agent relies on familiar data sources.
Intelligent Diagnosis and Correlation
When an alert or anomaly occurs, the agent correlates telemetry with:
- Recent deployments or configuration changes
- Resource topology and dependencies
- Historical incident patterns
This enables accelerated root cause analysis without manual log spelunking.
Automated and Approval-Gated Remediation
Azure SRE Agent can execute operational actions like scaling, restarting services, or reverting deployments—basically anything that relies on Azure CLI and REST APIs. All write actions are gated by RBAC and explicit approval, ensuring governance and control.
Workflow and Developer Tool Integration
The agent integrates with Azure Monitor alerts, GitHub, Azure DevOps, ServiceNow, and PagerDuty, allowing incidents to flow naturally into existing operational and engineering processes.
Setup and Deployment
To deploy Azure SRE Agent, you'll need:
- An active Azure subscription
- Permissions to assign RBAC roles (Microsoft.Authorization/roleAssignments/write)
- Network access to the *.azuresre.ai domain
- Deployment in a supported region
Note: I couldn't find information on automating deployment using bicep or az cli—something to explore later.
In the Azure Portal, search for "Azure SRE Agent," select "Create Agent," and:
- Create or select a dedicated resource group for the agent
- Choose your region
- Associate one or more resource groups to monitor
- Complete deployment and wait for initialization
Once deployed, the agent exposes a chat-based interface in the Azure Portal, allowing engineers to interact using natural language to investigate and manage incidents.
Real-World Usage Example
To test this, I deployed an Azure App Service connected to CosmosDB using Managed Identity. After testing the app, I removed the App Service Managed Identity to simulate an outage.
I opened SRE Agent and asked: "Can you investigate my app service outage?"
The agent responded with:
- Initial investigation findings
- Metrics analysis
- Summary of findings with chart views
- Detailed root cause analysis
- Recommended actions
It identified the root cause as an identity problem where the Web App couldn't connect to Cosmos DB. The agent then provided a "Diagnosis Complete - Data Unreachable Root Cause" report in table format, including potential fix steps.
When asked to fix the problem, I responded affirmatively, and the agent executed the remediation steps after my approval. It concluded with "Issue resolved" and a summary of actions taken.
Why This Matters
Azure SRE Agent is, apart from GitHub Copilot, my next favorite use case for Generative AI. Having experienced cloud workload outages myself for years—spending hours or days digging through metrics and logs to pinpoint root causes—I believe this is an amazing service to add to your Azure environment.
Even if you don't initially trust it to take actions, having an AI assistant to help with investigation and outage analysis will be a significant time-saver. Your workload will be back up-and-running faster too.
I haven't even touched on source control integration with GitHub or Azure DevOps, notifications through Outlook or Teams, or expansion to third-party monitoring tools like Grafana and DataDog. There will be many more blog posts on Azure SRE Agent in the near future.
For inspiration, check out the Microsoft Learn lab Optimize Azure Reliability using SRE Agent I published recently. If you've deployed and used it in your environment, I'd love to hear your stories!
Cheers! /Peter
Comments
Please log in or register to join the discussion