Azure SRE Agent: Designing for Multiple Interfaces in a Multi-Cloud World

Microsoft's Azure SRE Agent addresses a critical challenge in cloud operations: serving four distinct caller types through three carefully designed interfaces. The strategic decision to prioritize MCP server delivery first reveals important insights about evolving cloud operations workflows and the intersection of human and automated workflows in incident management.

When Microsoft began developing tooling for Azure SRE Agent, they faced a deceptively simple question: who's actually calling this service? The answer revealed four distinct caller types with fundamentally different needs, ultimately leading to a multi-interface strategy that balances human and automated requirements in cloud operations.

The Caller Conundrum

The Azure SRE Agent team identified four primary caller types:

Humans at terminals during critical incidents, typically at 2 AM
Coding agents mid-session that want SRE capabilities without context switching
Automated PagerDuty SRE agents running triage loops with no human intervention
Other Azure SRE Agent instances needing to delegate sub-tasks

Each caller requires different interaction patterns, yet all share the same backend infrastructure. This fundamental challenge led to the development of three interfaces:

Interactive CLI: For humans at terminals, optimized for incident response
Agent Mode: For coding agents like Copilot CLI that spawn the tool as a subprocess
MCP Server: For humans inside coding environments and remote agents in other ecosystems

Strategic Interface Prioritization

The MCP server ships first, a decision that reflects important shifts in how cloud operations are evolving. The distinction between CLI and MCP approaches reveals critical differences in workflow integration:

The CLI and agent mode both require deliberate invocation—a human typing commands or a coding agent spawning a subprocess. This creates intentional context switching. The MCP server, however, surfaces itself as tools within existing environments, meeting users where they already work.

This approach eliminates context switching for SREs working in Copilot CLI, VS Code, or similar environments. They can ask natural language questions and have the appropriate tools execute without leaving their current workflow. For remote agents in PagerDuty loops or cross-cloud scenarios, MCP provides protocol-based communication without requiring subprocess spawning.

Dual Audiences, Single Protocol

The MCP server serves two distinct audiences through the same protocol:

Humans inside coding agents: SREs working in VS Code Copilot, Claude Desktop, GitHub Copilot CLI, or Cursor need SRE capabilities integrated into their existing workflows. They want tools available without interrupting their deployment scripting, runbook review, or debugging sessions.
Remote agents in other ecosystems: AWS DevOps agents handling cross-cloud incidents need to check Azure resource health without human intervention. PagerDuty SRE agents require incident summaries for automated triage. Other Azure SRE Agent instances may need to delegate work.

Tool Design Considerations

Each MCP tool maps to a specific SRE Agent capability, with careful attention to three critical design elements:

Natural Language Descriptions

Tool descriptions function as system prompts for AI models. A description like "Returns health status for an Azure resource" proves less effective than "Check whether an Azure resource (VM, gateway, database, container) is healthy, degraded, or unreachable. Use this when diagnosing an active outage or validating state after a deployment." The latter provides contextual guidance on when to use the tool, not just what it does.

Unified Response Shape

Despite different needs, all tool responses follow the same contract: defined fields, stable semantics, no preamble, plus a summary field with plain-language explanation. Humans read the summary; automated agents parse the structured fields. This approach avoids maintenance overhead from branching response logic.

Statelessness and Context

While statelessness benefits remote agents, it creates friction for humans. The solution ensures each response is self-sufficient, providing enough context for the model to construct coherent follow-up calls without re-explaining the situation. The tool doesn't maintain state—it returns sufficient information that memory becomes the responsibility of whoever holds it.

Testing and Validation Challenges

Designing for these diverse use cases presents unique testing challenges. Testing human-in-coding-agent scenarios is straightforward—connect the server and observe interactions. Simulating remote agents calling cold with no prior context requires different approaches, focusing on descriptions and schemas that work for models encountering the tools for the first time.

Future Interface Development

The interactive CLI and agent mode follow the same three-node architecture, with the CLI optimized for terminal interactions and agent mode providing direct subprocess access for coding agents. Both interfaces remain in development, with the MCP server providing immediate value by integrating into existing MCP clients.

This multi-interface approach represents a significant evolution in cloud operations tooling, acknowledging that effective SRE requires accommodating both human intuition and automated precision across diverse operational contexts. The Azure SRE Agent strategy demonstrates how cloud providers must design increasingly sophisticated interfaces to serve the complex ecosystems of modern distributed systems.

For more information on Azure SRE Agent, visit the official documentation. To understand the Model Context Protocol (MCP) that powers the server implementation, explore the MCP specification.

#Azure #SRE #MCP #Multi-Interface #Cloud Operations