Shared Agent Context: How Microsoft Tackles Partner Agent Collaboration in Azure SRE

Microsoft's Azure SRE Agent architecture enables AI agents to collaborate on incidents using direct real-time communication and shared memory systems, eliminating context loss and manual handoffs.

When building the partner ecosystem for Azure SRE Agent, Microsoft encountered a fundamental challenge: how can multiple AI agents working on the same incident share context and preserve that information once the problem is resolved?

This question led to a sophisticated architecture that addresses one of the most frustrating aspects of modern incident response—the loss of critical investigation context when agents work in isolation.

The Problem: Isolated Agents and Lost Context

Most operational AI agents function independently, creating significant gaps when incidents span multiple domains. Your cloud monitoring agent typically cannot access your third-party observability stack. Your Datadog specialist remains unaware of your Azure resource topology. When an incident crosses these boundaries, a human must manually bridge the gap—often at 2 AM with incomplete information.

The situation worsens even when agents do exchange information directly. These conversations are ephemeral, disappearing once the investigation concludes. The next on-call engineer sees only a resolved alert, with no record of what was attempted, what was discovered, or why the remediation succeeded. Each new agent encountering the same pattern starts from scratch.

The Solution: Two Communication Paths

Microsoft's architecture addresses this through two distinct communication channels:

Direct Agent-to-Agent Communication (Real-Time)

During active investigations, the primary agent calls partner agents directly using protocols like MCP or API endpoints. This fast path enables real-time analysis where the partner agent performs domain-specific work—log searches, span analysis, custom metric queries—and returns findings immediately. The primary agent doesn't need to understand the internals of third-party systems like Datadog or Dynatrace; it simply asks questions and receives answers.

Shared Memory (Durable)

After the direct exchange, both agents write their actions and findings to external systems that teams already use during incidents. This durable path creates audit trails and enables seamless handoffs. The shared memory backends include:

Incident platforms (e.g., PagerDuty): Timeline notes and on-call handoff context
Issue trackers (e.g., GitHub Issues): Code-level findings, root cause analysis, action comments
ITSM systems (e.g., ServiceNow): Work notes and ITSM-compliant audit trails

The key advantage: this approach doesn't require adopting new systems. Agents write to whatever your team already uses.

How It Works: The Complete Workflow

Alert source: Monitoring fires an alert
Primary agent: Receives alert, triages, and starts investigating with native tools
Primary agent: Calls partner agent for domain-specific analysis (third-party logs, spans)
Partner agent: Runs analysis and returns findings in real time
Primary agent: Correlates partner findings with native data and runs remediation
Both agents: Write findings, actions, and resolution to external systems
Agent or human: Verifies resolution and closes incident

Steps 3-5 occur in real time over the direct channel. Nothing gets written to shared memory until the investigation produces actual results, ensuring investigation speed isn't bottlenecked by external writes.

Who Does What

The primary agent owns the full incident lifecycle: detection, triage, investigation, remediation, and closure. The partner agent is called when the primary agent needs access to parts of the stack it cannot reach natively. It performs specialized deep-dive analysis, returns findings, and the primary agent takes over from there.

Primary agent responsibilities:

Full incident lifecycle ownership
Calling partner agents
Writing to shared memory
Acting on proposed next steps

Partner agent responsibilities:

Domain-specific deep-dive analysis
Responding to calls
Writing enrichment to shared memory

Why Shared Context Should Live Where Humans Already Work

If your agent writes findings to a system nobody checks, you've essentially built an expensive diary. Writing to systems like GitHub Issues, ServiceNow tickets, or Jira epics—places your team already monitors—fundamentally changes the dynamics.

When agents post reasoning and pending decisions to tools engineers already check, anyone can review or correct the process using familiar interfaces. Comments, reactions, and status updates become the oversight mechanism without requiring custom approval UIs.

This persistence also creates operational history. Every entry becomes searchable by both people and agents through the same interface, without requiring separate vector databases. Future investigations can answer: How was this incident type handled before? What did the agent try? What did humans override?

Design Principles

Microsoft's architecture follows several key principles:

Investigate first, persist second: The primary agent calls partners directly for real-time analysis, writing to shared memory only after collecting findings
Humans see everything through shared context: The direct path is agent-to-agent only, but shared context allows humans to see the full picture
Append-only: Writes are additive with no overwrites or deletions, enabling full history reconstruction
Backend-agnostic: Swapping between PagerDuty, ServiceNow, or GitHub Issues is a simple connector configuration change

The Practical Benefits

The architecture delivers straightforward advantages:

Investigations don't wait on external system writes
No context loss when conversations end
Next on-call engineers pick up where previous ones left off
Every action from both agents appears in systems humans already monitor
Adding new partner agents or shared memory backends requires only connector changes

The Fast Path vs. The Durable Path

The architecture distinguishes between two critical paths: the fast path for investigation and the durable path for everything else. This separation ensures that real-time analysis remains responsive while still capturing comprehensive context for future reference and human oversight.

The result is an incident response system where AI agents collaborate effectively, preserve critical context, and integrate seamlessly with existing human workflows—eliminating the midnight handoffs and lost investigations that plague modern operations teams.

#DevOps #AI #Cloud #Infrastructure