Microsoft's Responsible AI team released ASSERT, an MIT-licensed framework that converts plain-language behavior requirements into runnable evaluation pipelines for models and agents. For teams standardizing on agentic workloads across cloud providers, it offers a portable, inspectable alternative to the vendor-locked evaluation tooling baked into each platform.
Microsoft's Responsible AI team has released ASSERT (Adaptive Spec-driven Scoring for Evaluation and Regression Testing), an open-source framework that converts natural-language behavior specifications into executable evaluations for AI models, applications, and agents. It is available today under the MIT license, with a project site and a worked example included in the repository.
The release matters because evaluation has quietly become one of the harder, more vendor-entangled parts of shipping AI systems. Every major cloud provider now ships its own evaluation tooling, and most of it is wired tightly into that provider's model catalog and runtime. ASSERT takes a different posture: it treats the behavior specification itself as the portable artifact, then generates the test scenarios, datasets, metrics, and scorecards around it. That distinction has real consequences for teams running agents across more than one cloud.

What changed
The core problem ASSERT targets is familiar to anyone who has tried to validate an agent against product-specific rules. Generic evaluators such as helpfulness, relevance, groundedness, toxicity, and faithfulness give you signal, but they don't directly test the boundaries your application actually cares about. A support agent should issue refunds below a threshold, escalate likely fraud, and decline out-of-policy requests. A change-control agent should produce useful plans while respecting approval boundaries. A system can score well on generic metrics while quietly failing every application-specific requirement that matters.
The behaviors a team wants are usually written down somewhere already: a product requirement, a policy document, a system prompt, a launch checklist. The expensive part is turning that intent into an eval suite that is specific enough to run, inspect, and keep current as the system changes. Those evaluations rarely exist off the shelf, are slow to build, hard to validate, and quick to go stale as policies evolve and models improve.
ASSERT's answer is a four-stage pipeline that makes the specification a first-class input rather than background context.
First, systematization turns a broad idea like harmful financial advice or tool-use governance into an explicit concept specification, a structured set of patterns, definitions, edge cases, and operational distinctions rather than a single label. Following the approach in Agarwal et al. (2026), it grounds the concept in prior work and reconciles multiple practical definitions.
Second, taxonomization converts that specification into an editable taxonomy of permissible and impermissible behaviors. Developers and policy experts review and revise it before anything runs.

Third, test-set generation instantiates the taxonomy into executable cases, single-turn prompts or multi-turn scenarios, including benign interactions and adversarial probes. Developers declare the dimensions that matter, such as task type, persona, tool availability, request class, or environment configuration, and ASSERT builds a stratified set so behavior is tested across conditions instead of a narrow slice of easy examples.
Fourth, inference and scoring runs the cases against the target and records the full trace, including tool calls, retrieved context, routing decisions, and intermediate actions. Each trace is then scored against the behavior taxonomy and its policy stance, producing not just a pass-or-flag label but a rationale, a policy citation, and the specific turn or action that justified the verdict.

How it compares to provider-native evaluation
For architects weighing this against what the hyperscalers already offer, the comparison comes down to portability, transparency, and lock-in rather than raw feature count.
Azure AI Foundry, Amazon Bedrock, and Google Vertex AI each ship evaluation services. Bedrock offers model evaluation and LLM-as-a-judge scoring tied to its hosted catalog. Vertex provides its Gen AI evaluation service with similar reference-based and pointwise metrics. Azure's own evaluation SDK overlaps heavily with where ASSERT sits. The practical limitation across all three is the same: the evaluation logic, the judge configuration, and the result artifacts tend to live inside the provider's console and pricing meter. Moving an eval suite from one to another usually means rebuilding it.
ASSERT inverts that. The behavior is described in YAML, pointed at a real agent, and the artifacts come back locally. As CrewAI's open-source lead Lorenze Jay put it, the spec, generated cases, model outputs, judge rationales, and metrics are all inspectable on your own machine, which makes the eval feel auditable rather than like a black box. Because the target can be a model, an agent, or an application-level workflow, the same specification can be run against a model hosted in Azure today and an equivalent model on Bedrock or Vertex tomorrow without rewriting the evaluation. The instrumentation layer captures traces from agentic frameworks directly; the repository's reference example uses a multi-agent LangGraph travel planner rather than any single vendor's agent runtime.
That said, the trade-offs are real and worth pricing honestly. Provider-native tools give you managed infrastructure, integrated billing, and no operational overhead. ASSERT is self-hosted, so you carry the cost of running it and supplying judge models. And the judges themselves are where most of the variance lives. Microsoft's own validation found agreement between LLM judges and human annotators typically in the 80 to 90 percent range, against roughly 90 percent human inter-annotator agreement. Different judge models vary in strictness and boundary sensitivity, so judge selection becomes its own decision with its own cost profile, regardless of which cloud hosts the system under test.
What the validation showed
Microsoft ran a coverage study across five behaviors: social scoring, sycophancy, task adherence, tool-use governance, and unsafe health guidance. Compared with an in-house baseline starting from the same written intent, ASSERT covered roughly 1.2x as much of the intended behavior space, surfaced about 1.5x as many cases worth inspecting, produced more than 4x stronger separation between stronger and weaker systems, and had about half as many saturated cases where every model behaved identically. It also surfaced roughly 2x as many distinct failure patterns, though the team flags that result as directional since failure-type labeling is harder to stabilize.
The design lesson underneath those numbers is that coverage is largely determined upstream. If the behavior is underspecified, the generated dataset will be too. Making the behavior explicit before generation begins is what produced broader, better-aligned evaluation sets. Subject-matter experts who reviewed 15 generated datasets found them generally well aligned with the intended policy and risk boundaries, a complementary check that the output was credible to people inspecting it directly.
Business impact
The worked example makes the strategic value concrete. The travel-planning agent runs in a six-turn budget across five tools (search_flights, search_hotels, check_weather, check_travel_advisories, validate_budget) and is evaluated against six failure modes split across quality and safety: wrong or skipped tool use, fabricated prices, budget violations, stereotyping, prompt injection from tool output, and sycophantic agreement with unsafe itineraries.

The revealing result is that the dimensions stay separated rather than collapsing into a single number. The same five scenarios produced 40 percent over-refusal and 60 percent policy violation, and those are not the same failures. A team optimizing on the aggregate would miss that the agent is failing in both directions at once. That decomposition is the kind of detail that gets flattened by a dashboard metric and is exactly what cost and risk owners need when deciding whether an agent is ready for production.
For organizations running a multi-cloud or hybrid AI strategy, the calculation is straightforward enough. Provider-native evaluation reduces operational burden but deepens lock-in and leaves your evaluation criteria living inside someone else's billing system. A portable, spec-driven framework like ASSERT keeps the institutional knowledge about what correct behavior means as a versioned asset you own, applicable across whichever models and clouds you negotiate onto next. That portability has migration value: the eval suite is no longer a sunk cost that has to be rebuilt every time a procurement decision changes.
ASSERT does not remove the need for judgment. Vague specifications still produce vague scenarios, synthetic interactions can miss production-only failures, and model-based judges can be unreliable when a policy distinction is subtle. It should not be treated as a compliance certification or a substitute for human review and telemetry. Understood correctly, it is a way to make evaluation faster, more explicit, and cheaper to iterate on, and to keep that capability from being captured by any one provider.

Comments
Please log in or register to join the discussion