Inside Microsoft's Agentic Workflows: Building Azure SRE Agent with AI-Powered Operations

Microsoft reveals how they built Azure SRE Agent using agentic workflows, where specialized AI agents collaborate across the entire software development lifecycle to reduce operational toil and accelerate incident response at enterprise scale.

Microsoft's latest breakthrough in AI-powered operations comes from an unexpected source: their own internal development practices. In a revealing look at how they build Microsoft using Microsoft, the company has detailed the creation of Azure SRE Agent, an AI operations agent that represents a fundamental shift in how enterprise-scale systems are maintained and operated.

The challenge was clear and pressing. Microsoft operates always-on, mission-critical production systems at extraordinary scale—thousands of services, millions of deployments, and constant change are the reality of modern cloud engineering. These titan systems power organizations around the globe, including Microsoft's own infrastructure, with extremely low risk tolerance for downtime.

Traditional operations work like incident investigation, response and recovery, and remediation is essential but disruptive to innovation. For engineers, operational toil often means being pulled away from feature work to diagnose alerts, sift through logs, correlate metrics across systems, or respond to incidents at any hour. On-call rotations and manual investigations slow teams down and introduce burnout.

What's more, in the era of AI, demand for operational excellence has spiked to new heights. It became clear that traditional human-only processes couldn't meet the scale and complexity needs for system maintenance, especially in the AI world where code shipping velocity has increased exponentially.

The Agentic Workflow Revolution

The solution Microsoft developed goes beyond simply automating tasks. Azure SRE Agent was created using the agentic workflow approach—building agents with agents. Rather than treating AI as a bolt-on tool, Microsoft embedded specialized agents across the entire software development lifecycle (SDLC) to collaborate with developers, from planning through operations.

The diagram above outlines the agents used at each stage of development. They come together to form a full lifecycle:

Plan & Code: Agents support spec-driven development to unlock faster inner loop cycles for developers and even product managers. With AI, Microsoft can not only draft spec documentation that defines feature requirements for UX and software development agents but also create prototypes and check in code to staging. This enables PMs, UX, and engineering teams to rapidly iterate, generate, and improve code even for early-stage merges.

Verify, Test & Deploy: Agents for code quality review, security, evaluation, and deployment work together to shift left on quality and security issues. They continuously assess reliability, ensure performance, and enforce consistent release best practices.

Operate & Optimize: Azure SRE Agent handles ongoing operational work from investigating alerts to assisting with remediation, and even resolving some issues autonomously. Moreover, it learns continuously over time, and Microsoft provides Azure SRE Agent with its own specialized instance of Azure SRE Agent to maintain itself and catalyze feedback loops.

While agents surface insights, propose actions, mitigate issues, and suggest long-term code or IaC fixes autonomously, humans remain in the loop for oversight, approval, and decision-making when required. This combination of autonomy and governance proved critical for safe operations at scale.

Microsoft also designed Azure SRE Agent to integrate across existing systems. The team uses custom agents, Model Context Protocol (MCP) and Python tools, telemetry connections, incident management platforms, code repositories, knowledge sources, business process and operational tools to add intelligence on top of established workflows rather than replacing them.

Enterprise-Scale Impact

The impact of Azure SRE Agent is felt most clearly in day-to-day operations. By automating investigations and assisting with remediation, the agent reduces burden for on-call engineers and accelerates time to resolution.

Internally at Microsoft in the last nine months, they've seen:

35,000+ incidents handled autonomously by Azure SRE Agent
50,000+ developer hours saved by reducing manual investigation and response work
Teams experienced reduced on-call burden and faster time-to-mitigation during incidents

To share a couple of specific cases, the Azure Container Apps and Azure App Service product group teams have had tremendous success with Azure SRE Agent. Engineers for Azure Container Apps had overwhelmingly positive (89%) responses to the root cause analysis (RCA) results from Azure SRE Agent, covering over 90% of incidents. Meanwhile, Azure App Service has brought their time-to-mitigation for live-site incidents (LSIs) down to 3 minutes, a drastic improvement from the 40.5-hour average with human-only activity.

One Microsoft engineer shared their experience: "[It's] been a massive help in dealing with quota requests which were being done manually at first. I can also say with high confidence that there have been quite a few CRIs that the agent was spot on/gave the right RCA/provided useful clues that helped navigate my initial investigation in the right direction RATHER than me having to spend time exploring all different possibilities before arriving at the correct one. Since the Agent/AI has already explored all different combinations and narrowed it down to the right one, I can pick the investigation up from there and save me countless hours of logs checking."

Key Learnings from the Agentic Journey

Microsoft's experience building Azure SRE Agent revealed several critical insights about agentic workflows:

Specialization matters: Generic agents plateau quickly. Real impact comes from agents equipped with domain-specific skills, context, and access to the right tools and data.

Deep integration over replacement: Embedding agents into established telemetry, workflows, and platforms rather than attempting to replace them proved essential for adoption and effectiveness.

Human-in-the-loop governance: Autonomy had to be balanced with clear approval boundaries, role-based access, and safety checks to build trust at scale.

Continuous feedback and evaluation: Ongoing measurement proved crucial to improve agents over time and understand where automation added value versus where human judgment should remain central.

Agents build agents: Building agents with agents is essential to scaling, as manual development quickly became a bottleneck. Agents dramatically accelerated inner loop iteration through code generation, review, debugging, security fixes, and more.

The Future of AI-Powered Operations

Azure SRE Agent represents more than just a new tool—it's a new operational system. And at Microsoft's scale, transformative systems lead to transformative outcomes. The company is now applying these agentic workflow principles across other areas of development and operations, creating a blueprint for how AI can transform enterprise software engineering.

The patterns Microsoft has developed are broadly applicable. Organizations facing similar challenges with operational complexity, scale, and the need to free engineers from toil can adapt these approaches to their own environments. The key is understanding that agentic workflows aren't just about automation—they're about creating collaborative systems where AI agents and humans work together to achieve outcomes that neither could accomplish alone.

As Microsoft continues to refine and expand its agentic workflow approach, the lessons learned from building Azure SRE Agent will likely influence how other organizations approach AI-powered operations and development. The future of enterprise software engineering may well be built by agents, for agents, with humans guiding the journey.