Why Continuous Monitoring Is Essential for AI Agents and How Microsoft Foundry Delivers It

Microsoft Foundry’s GA observability suite—Evaluation, Monitoring, and Tracing—extends Azure Monitor to cover AI‑specific failure modes such as hallucinations, policy violations, and retrieval drift. The article explains why classic APM is insufficient for production agents, describes the three pillars of Foundry’s AI observability, shows how they integrate with Azure Monitor, Defender for Cloud, and Microsoft Purview, and outlines cost‑governance and compliance considerations for enterprises.

What changed

Microsoft announced that Evaluations, Monitoring, and Tracing are now generally available in the Foundry Control Plane (March 2026) and are deeply integrated with Azure Monitor. The new "New Foundry" portal experience adds a unified UI for AI‑specific observability, security, and cost governance, moving AI quality from a pre‑deployment checklist to a live operational signal.

Provider comparison – Why AI monitoring needs a different stack

Aspect	Traditional APM (e.g., App Insights, New Relic)	Microsoft Foundry AI Observability
Primary metrics	CPU, memory, latency, error rate	Groundedness, retrieval quality, safety alignment, custom evaluator scores
Failure modes covered	Crashes, timeouts, high latency	Subtle answer drift, hallucinations, policy breaches, data‑leakage, retrieval pipeline drift
Sampling model	Usually 100 % of requests	Configurable sampling (5‑10 % recommended) to balance cost vs. detection risk
Evaluation reuse	Separate test suites, not reused in prod	Same evaluator definitions run locally, in CI/CD, and on live traffic – scores are directly comparable
Alerting & RBAC	Generic alerts, custom RBAC per service	Azure Monitor alerts on any AI metric, inherited RBAC, retention, and audit logging
Dashboarding	Generic service health dashboards	AI‑native Agent Monitoring Dashboard plus Grafana integration for a single pane of glass
Cost visibility	Indirect (through compute metrics)	Direct token‑usage, quota, and cost trend reporting per model/agent
Security & compliance	Optional add‑ons	Built‑in guardrails, Defender for Cloud AI workload protection, Microsoft Purview data governance

Key takeaway: Foundry adds AI‑first signals and ties them to the same Azure Monitor workspace that already holds your infrastructure telemetry, giving you cross‑stack correlation that classic APM tools cannot provide.

Business impact

1. Continuous quality assurance

Problem: Model updates, prompt changes, and evolving data sources can silently degrade output quality. Traditional gate‑based testing catches only pre‑release regressions.
Foundry solution: Built‑in evaluators (Coherence, Groundedness, Retrieval Quality, Safety) run on a configurable sample of live traffic. Because the same evaluator runs in local dev, CI/CD, and production, a drop in a CI score immediately signals a production‑ready issue.
Impact: Early detection of hallucinations or policy violations reduces downstream remediation costs and protects brand reputation.

2. Faster root‑cause analysis

Problem: When a quality metric dips, teams spend hours correlating logs, tracing requests, and guessing whether the cause is a model version, a retrieval index, or an infrastructure bottleneck.
Foundry solution: Evaluation results are linked directly to OpenTelemetry traces that capture every step—model call, tool invocation, retrieval query, orchestration logic. Clicking a low groundedness score jumps to the exact trace that produced it.
Impact: Mean time to identify (MTTI) drops from hours to minutes, enabling SREs to remediate or roll back a model version before users notice.

3. Unified alerting and governance

Problem: Security teams manage separate alert pipelines for infrastructure and AI‑specific risks (prompt injection, data leakage).
Foundry solution: Azure Monitor alerts can be set on any AI metric (e.g., safety‑alignment score < 0.8). Alerts feed into existing incident‑response tools (Teams, PagerDuty, Azure Runbooks). Guardrail policies are enforced through Azure Policy, Defender for Cloud, and Purview, with audit logs stored in the same workspace.
Impact: Consolidated alert fatigue, consistent RBAC, and automatic compliance reporting simplify audits and reduce the overhead of managing disparate security solutions.

4. Cost governance at scale

Problem: Token consumption drives the bulk of generative‑AI spend, yet many organizations lack visibility into per‑agent usage.
Foundry solution: The Agent Monitoring Dashboard surfaces token counts, latency, and success rates per agent. The Operate > Quota pane lists model deployments with real‑time consumption, and Azure Cost Management can trigger budget alerts on token‑usage spikes.
Impact: Teams can identify verbose prompts, over‑sampling, or runaway tool calls early, adjusting sampling rates or prompting strategies to keep spend within budget.

How the three pillars work together

Continuous Evaluation – Runs built‑in or custom evaluators on sampled responses. Results are stored as Azure Monitor metrics.
Integrated Monitoring – Azure Monitor ingests those metrics alongside latency, token usage, and health data. Dashboards, alerts, and RBAC apply automatically.
End‑to‑End Tracing – OpenTelemetry traces are emitted for every request. Evaluation scores are linked to trace IDs, so a low score opens the exact execution path.

This loop turns a quality signal → alert → trace into an operational workflow that mirrors classic incident management for web services.

Practical start‑up guidance (high‑level)

Step	Action	Reason
1	Enable the New Foundry toggle and connect an Application Insights workspace to your project.	Required for telemetry ingestion and dashboard rendering.
2	In the Agent Monitor tab, turn on Continuous Evaluation, select the built‑in evaluators you need, and set a sampling rate (5‑10 %).	Begins the flow of quality metrics into Azure Monitor.
3	Assign the project’s managed identity the Azure AI User role; this grants the service permission to invoke evaluator models.
4	Create Azure Monitor alert rules on any evaluator metric (e.g., groundedness < 0.7). Attach an Action Group that posts to Teams or triggers a runbook.
5	If you have domain‑specific needs, add a custom LLM‑as‑a‑Judge or code‑based evaluator via the portal’s Custom Evaluators UI.
6	Enable Defender for Cloud AI workload protection and, where required, Microsoft Purview data governance from the Operate > Compliance pane.
7	Review the Agent Monitoring Dashboard daily; drill into any outlier trace to see the exact request path and token usage.
8	Adjust sampling or add additional evaluators as you learn the variance of your traffic and the cost impact.

Bottom line

Microsoft Foundry’s GA observability suite gives enterprises the same operational rigor they apply to traditional services—metrics, alerts, tracing, and governance—but tuned for the unique failure modes of AI agents. By embedding evaluation, monitoring, and tracing into Azure Monitor, organizations can detect quality regressions early, pinpoint root causes instantly, enforce security guardrails, and keep token‑driven spend under control. The next article in this series will walk through the exact portal clicks and code snippets needed to wire a real agent to Application Insights, set up continuous evaluation, and configure AI‑specific alerts.

#Azure Monitor #Microsoft Foundry #AI Observability #Continuous Evaluation #Token Cost Governance