Veris AI and Lume Security demonstrate how to turn production AI agent failures into targeted improvements using simulation environments and automated optimization, achieving 40% accuracy gains without regressions.
AI agents are slowly moving from demos into production, where real users, messy systems, and long-tail edge cases lead to new unseen failure modes. We show how a high-fidelity simulation environment built by Veris AI on Microsoft Azure can expand production failures into families of realistic scenarios, generating enough targeted data to optimize agent behavior through automated context engineering and reinforcement learning, all while not regressing on any previous issue. We demonstrate this on a security agent built by Lume Security on Microsoft Foundry.
The Challenge of Production AI Agents
AI agents are slowly moving from demos into production, where real users, messy systems, and long-tail edge cases lead to new unseen failure modes. Production monitoring can surface issues, but converting those into reliable improvements is slow and manual, bottlenecked by the low volume of repeatable failures, engineering time, and risk of regression on previously working cases.
Lume Security's Context-Aware Platform
Lume creates an institutionally grounded security intelligence graph that captures how organizations actually work—this intelligence graph then powers deterministic, policy-aligned agents that reason alongside the user and take trusted action across security, compliance, and IT workflows. Lume helps modern security teams scale expertise without scaling headcount.
Its Security Intelligence Platform builds a continuously learning security intelligence graph that reflects how work actually gets done. It reasons over policies, prior tickets, security findings, tool configurations, documentation, and system activity to retain institutional memory and drive consistent response.
On this foundation, Lume delivers context-aware security assistants that fetch the most relevant context at the right time, produce policy-aligned recommendations, and execute deterministic actions with full explainability and audit trails. These assistants are embedded in tools teams are already using like ServiceNow, Jira, Slack, and Teams so the experience is seamless and provides value from day one.
Microsoft Foundry's Enterprise Control Plane
Microsoft Foundry and Veris AI together provide Lume with a secure, repeatable control plane that makes model iteration, safety, and simulation-driven evaluation practical:
- Vendor flexibility: Swap or test different models from OpenAI, Anthropic, Meta, and many others, with no infra changes
- Fast model rollout: Provide access to the latest models as soon as they are released, making experimentations and updates easy
- Consistent safety: Built-in policy and guardrail tooling enforces the same checks across experiments, cutting bespoke guardrail work
- Enterprise privacy: Models run in private Azure instances and are not trained on client data, which simplifies and shortens AI security reviews
- Made evaluation practical: Centralized models, logging, and policies let Veris run repeatable simulations and feed targeted failure cases back into Lume's improvement loop
The Architecture Behind the Solution
By implementing this system on Azure stack, Lume and Veris benefited from the flexibility and breadth of services enabling this large scale implementation. This architecture allows the agent to evolve independently across reasoning, retrieval, and action layers without destabilizing production behavior.
Microsoft Foundry orchestrates model usage, prompt execution, safety policies, and evaluation hooks. Azure AI Search provides hybrid retrieval across structured documents, policies, and unstructured artifacts. Vector storage enables semantic retrieval of prior tickets, decisions, and organizational knowledge.
Graph databases capture relationships between systems, controls, owners, and historical decisions, allowing the agent to reason over organizational structure rather than isolated facts. Azure Kubernetes Service (AKS) hosts the agent runtime and evaluation workloads, enabling horizontal scaling and isolated experiments.
Azure Key Vault manages secrets, credentials, and API access securely. Azure Web App Services power customer-facing interfaces and internal dashboards for visibility and control.
Why Traditional Evaluation Falls Short
Evaluating and ultimately improving agents is still one of the hardest parts of the stack. Static datasets don't solve this, because agents are not static predictors. They are dynamic, multi-turn systems that change the world as they act: they call tools, accumulate state, recover (or fail to recover) from errors, and face nondeterministic user behavior.
A golden dataset might grade a single response as "correct," while completely missing whether the agent actually achieved the right end state across systems. In other words: static evals grade answers; agent evals must grade outcomes.
Veris AI's Simulation-Driven Approach
At the same time, getting enough real-world data to drive improvement is perpetually difficult. The most valuable failures are rare, long-tail, and hard to reproduce on demand. Iterating directly in production is not possible because of safety, compliance, and customer-experience risk.
That's why environments are essential: a place to test agents safely, generate repeatable experience, and create the signals that allows developers to improve behavior without gambling on live users.
Veris AI is built around that premise. It is a high-fidelity simulation platform that models the world around an agent with mock tools, realistic state transitions, and simulated users with distinct behaviors.
From a single observed production failure, Veris can reconstruct what happened, expand it into a family of realistic scenario variants, and then stress-test the agent across that scenario set. Those runs are evaluated with a mix of LLM-based judges and code-based verifiers that score full trajectories not just the final text.
Crucially, Veris does not stop at measurement. Its optimization module uses those grader signals to improve the agent with automatically refining prompts and supporting reinforcement-learning style updates (e.g., RFT) in a closed loop.
The result is a workflow where one production incident can become a repeatable training and regression suite, producing targeted improvements on the failure mode while protecting performance on everything that already worked.
The Self-Improving Loop in Action
When a failure appears in production, the improvement loop starts from the trace. Veris takes the failed session logs from the observability pipeline, then an LLM-based evaluator pinpoints what actually went wrong and automatically writes a new targeted evaluation rubric for that specific failure mode.
From there, Veris simulator expands the single incident into a scenario family, dozens of realistic variants with branching, so the agent can be trained and re-tested against the whole "class" of the failure, not just one example. This is important as failure modes are often sparse in the real-world.
Those scenarios are executed in the simulation engine with mocked tools and simulated users, producing full interaction traces that are then scored by the evaluation engine. The scores become the training signal for optimization.
Veris optimization engine uses the evaluation results, the original system prompt, and the new rubric to generate an updated prompt designed to fix the failure without breaking previously-good behavior. Then it validates the new prompt on both (a) the specialized scenario set created from the incident and (b) a broader regression held-out set suite to ensure the improvement generalizes and does not cause regressions.
Real-World Results: 40% Accuracy Improvement
In this case study, we focused on a key failure mode of incorrect workflow classification at the triage step for approval-driven access requests. In these cases, tickets were routed into an escalation or in-progress workflow instead of the approval-validation path.
The ticket often contained conflicting approval or rejection signals in human comments - manager approved but they required some additional information, a genuine occurrence in org related jira workflows. The triage agent failed to recognize them and misclassified the workflow state.
Since triage determines what runs next, a single misclassification at the start was enough to bypass the Approval Agent and send the request down an incorrect downstream path.
By running the optimization loop, we were able to improve agent accuracy on this issue by over 40%, while not regressing on any other correct behavior. We ran experiments on a dataset only containing scenarios with the issue and a more general dataset encompassing a variety of scenarios.
Business Impact and Takeaways
This collaboration shows a practical pattern for taking an enterprise agent from "works in a demo" to "improves safely in production," pairing an orchestration layer that standardizes model usage, safety, logging, and evaluation with a simulation environment where failures can be reproduced and fixed without risking real users, then making it repeatable in practice.
In this stack, Veris AI provides the simulations, trajectory grading, and optimization loop, while Microsoft Foundry operationalizes the workflow with vendor-flexible model iteration, consistent policy enforcement, enterprise privacy, and centralized evaluation hooks that turn testing into a first-class system instead of bespoke glue code.
The result is an improved and reliable Lume assistant that can help enterprises spend 35–55% less time on routine requests, and meaningfully deflect repetitive escalations, requiring fewer clarification cycles, enabling faster response times, and a stronger institutional memory where decisions and rationales compound over time instead of getting lost across tools and tickets.
The self-improving loop continuously improves the agent, starting from a production trace, by generating a targeted rubric, expanding the incident into a scenario family, running it end-to-end with mocked tools and simulated users, scoring full trajectories, then using those scores to produce a safer prompt/model update and validating it against both the failure set and a broader regression suite.
This turns rare long-tail failures into repeatable training and regression assets.
If you're building an AI agent, the recommendation is straightforward. Invest early in an orchestration and safety layer, as well as an environment-driven evaluation that can create an improvement loop to ship fixes without regressions. This way, your production failures act as the highest-signal input to continuously harden the system.
To learn more or start building, teams can explore the Veris Console for free or browse the Veris documentation. In this stack, Microsoft Foundry provides the orchestration and safety control plane, and Veris provides the simulation, evaluation, and optimization loop needed to make agents improve safely over time.
Learn more about the Lume Security assistant here.

Comments
Please log in or register to join the discussion