Microsoft's Azure SRE Agent architecture enables AI agents to collaborate on incidents using direct real-time communication and shared memory systems, eliminating context loss and manual handoffs.
When building the partner ecosystem for Azure SRE Agent, Microsoft encountered a fundamental challenge: how can multiple AI agents working on the same incident share context and preserve that information once the problem is resolved?
This question led to a sophisticated architecture that addresses one of the most frustrating aspects of modern incident response—the loss of critical investigation context when agents work in isolation.
The Problem: Isolated Agents and Lost Context
Most operational AI agents function independently, creating significant gaps when incidents span multiple domains. Your cloud monitoring agent typically cannot access your third-party observability stack. Your Datadog specialist remains unaware of your Azure resource topology. When an incident crosses these boundaries, a human must manually bridge the gap—often at 2 AM with incomplete information.
The situation worsens even when agents do exchange information directly. These conversations are ephemeral, disappearing once the investigation concludes. The next on-call engineer sees only a resolved alert, with no record of what was attempted, what was discovered, or why the remediation succeeded. Each new agent encountering the same pattern starts from scratch.
The Solution: Two Communication Paths
Microsoft's architecture addresses this through two distinct communication channels:
Direct Agent-to-Agent Communication (Real-Time)
During active investigations, the primary agent calls partner agents directly using protocols like MCP or API endpoints. This fast path enables real-time analysis where the partner agent performs domain-specific work—log searches, span analysis, custom metric queries—and returns findings immediately. The primary agent doesn't need to understand the internals of third-party systems like Datadog or Dynatrace; it simply asks questions and receives answers.
Shared Memory (Durable)
After the direct exchange, both agents write their actions and findings to external systems that teams already use during incidents. This durable path creates audit trails and enables seamless handoffs. The shared memory backends include:
- Incident platforms (e.g., PagerDuty): Timeline notes and on-call handoff context
- Issue trackers (e.g., GitHub Issues): Code-level findings, root cause analysis, action comments
- ITSM systems (e.g., ServiceNow): Work notes and ITSM-compliant audit trails
The key advantage: this approach doesn't require adopting new systems. Agents write to whatever your team already uses.
How It Works: The Complete Workflow
- Alert source: Monitoring fires an alert
- Primary agent: Receives alert, triages, and starts investigating with native tools
- Primary agent: Calls partner agent for domain-specific analysis (third-party logs, spans)
- Partner agent: Runs analysis and returns findings in real time
- Primary agent: Correlates partner findings with native data and runs remediation
- Both agents: Write findings, actions, and resolution to external systems
- Agent or human: Verifies resolution and closes incident
Steps 3-5 occur in real time over the direct channel. Nothing gets written to shared memory until the investigation produces actual results, ensuring investigation speed isn't bottlenecked by external writes.
Who Does What
The primary agent owns the full incident lifecycle: detection, triage, investigation, remediation, and closure. The partner agent is called when the primary agent needs access to parts of the stack it cannot reach natively. It performs specialized deep-dive analysis, returns findings, and the primary agent takes over from there.
Primary agent responsibilities:
- Full incident lifecycle ownership
- Calling partner agents
- Writing to shared memory
- Acting on proposed next steps
Partner agent responsibilities:
- Domain-specific deep-dive analysis
- Responding to calls
- Writing enrichment to shared memory
Why Shared Context Should Live Where Humans Already Work
If your agent writes findings to a system nobody checks, you've essentially built an expensive diary. Writing to systems like GitHub Issues, ServiceNow tickets, or Jira epics—places your team already monitors—fundamentally changes the dynamics.
When agents post reasoning and pending decisions to tools engineers already check, anyone can review or correct the process using familiar interfaces. Comments, reactions, and status updates become the oversight mechanism without requiring custom approval UIs.
This persistence also creates operational history. Every entry becomes searchable by both people and agents through the same interface, without requiring separate vector databases. Future investigations can answer: How was this incident type handled before? What did the agent try? What did humans override?
Design Principles
Microsoft's architecture follows several key principles:
- Investigate first, persist second: The primary agent calls partners directly for real-time analysis, writing to shared memory only after collecting findings
- Humans see everything through shared context: The direct path is agent-to-agent only, but shared context allows humans to see the full picture
- Append-only: Writes are additive with no overwrites or deletions, enabling full history reconstruction
- Backend-agnostic: Swapping between PagerDuty, ServiceNow, or GitHub Issues is a simple connector configuration change
The Practical Benefits
The architecture delivers straightforward advantages:
- Investigations don't wait on external system writes
- No context loss when conversations end
- Next on-call engineers pick up where previous ones left off
- Every action from both agents appears in systems humans already monitor
- Adding new partner agents or shared memory backends requires only connector changes
The Fast Path vs. The Durable Path
The architecture distinguishes between two critical paths: the fast path for investigation and the durable path for everything else. This separation ensures that real-time analysis remains responsive while still capturing comprehensive context for future reference and human oversight.
The result is an incident response system where AI agents collaborate effectively, preserve critical context, and integrate seamlessly with existing human workflows—eliminating the midnight handoffs and lost investigations that plague modern operations teams.

Comments
Please log in or register to join the discussion