Answer Synthesis in Foundry IQ and Azure AI Search: A 10,000-Query Quality Analysis
#AI

Answer Synthesis in Foundry IQ and Azure AI Search: A 10,000-Query Quality Analysis

Cloud Reporter
8 min read

Microsoft's agentic retrieval engine now generates grounded, cited answers directly from search results, eliminating the need for separate orchestration. A comprehensive analysis across 10,000 real-world queries reveals how the system performs under different conditions, from full context to missing information, and how it handles multi-language and multi-industry scenarios.

Featured image

Answer synthesis in Foundry IQ and Azure AI Search represents a significant shift in how retrieval-augmented generation (RAG) systems deliver value. Instead of returning raw document snippets that require additional processing, the agentic retrieval engine now generates complete, natural language answers with inline citations, directly from the retrieved content. This integration streamlines the development of end-to-end RAG solutions for internal copilots, customer support bots, and knowledge management tools.

How Answer Synthesis Works

When a query arrives, the system first retrieves the most relevant content from the knowledge base. If answer synthesis is enabled, an LLM processes this retrieved content to generate a coherent response. The answer includes inline citations (e.g., [ref_id:4]) that link factual statements to their supporting documents, and a references array that provides metadata about the original sources.

A key capability is the system's ability to handle partial information. If the retrieved content addresses only some aspects of a query, the system generates a partial answer rather than returning a generic "No relevant content was found" message. This approach provides users with useful information that can guide follow-up queries, improving the overall search experience.

The system also supports natural language steering instructions, allowing developers and users to customize the answer's format, style, and language. Instructions can be provided at the knowledge base level (by developers) or within individual queries (by users). When conflicts arise, user-provided instructions take priority over developer-provided ones, which in turn take priority over system defaults.

Quality Measurement Framework

To evaluate answer quality comprehensively, the team developed metrics across five dimensions:

  1. Percentage of answered queries: Measures how often the system produces a substantive answer versus a rejection message.
  2. Answer relevance: Assesses how well the generated answer addresses the user's query (scored 0-100).
  3. Groundedness: Evaluates the extent to which the answer is supported by retrieved content rather than hallucinated.
  4. Citations quality: Measures whether citation-delimited sections of the answer are actually supported by the cited documents.
  5. Steering instructions compliance: Determines if the answer follows provided formatting and language instructions.

All metrics are calculated using LLMs as judges, with the exception of the answered queries percentage, which uses a simple classifier. For groundedness and citation quality, the team adapted the "nuggets of information" approach from the TREC 2024 RAG Track evaluation, extracting atomic factual claims and verifying their support in the retrieved content.

Experimental Setup and Datasets

The evaluation used three production-relevant datasets:

  • Customer: Document sets shared by Azure customers with permission
  • Support: Hundreds of thousands of publicly available support and knowledge base articles in 8 languages
  • Multi-industry, Multi-language (MIML): 60 indexes representing 10 customer segments and 6 languages, containing over 10,000 queries

Queries were paired with steering instructions from a pool of 70 options to test compliance. Additionally, the team created three information-level benchmarks using the MIML dataset:

  • Full information: Queries paired with documents containing complete context
  • Partial information: Queries paired with documents providing only some relevant context
  • No information: Queries paired with unrelated documents

This structure allowed measurement of how the system behaves when context is complete, partial, or missing.

Key Results Across 10,000+ Queries

Overall Performance

The system demonstrated strong performance across all datasets, with high scores in both answer relevance and groundedness:

Dataset % Answered Answer Relevance Groundedness Citations/Answer Citation Quality
MIML 95.9% 93.9 87.4 5.0 81.6
Support 97.8% 94.5 92.0 5.2 88.7
Customer 93.4% 89.5 95.6 3.7 94.9

The Customer dataset showed the highest groundedness (95.6) and citation quality (94.9), likely because customer documents are typically well-structured and authoritative. The Support dataset performed strongly across all metrics, reflecting the quality of publicly available knowledge base articles.

Language and Industry Variations

Performance remained stable across languages and industry segments, with some notable exceptions:

Language performance (MIML dataset):

  • German: 96.2% answered, 93.6 relevance, 88.4 groundedness
  • English: 94.1% answered, 89.7 relevance, 87.9 groundedness
  • Spanish: 95.1% answered, 91.8 relevance, 87.4 groundedness
  • French: 95.2% answered, 92.6 relevance, 89.0 groundedness
  • Japanese: 96.9% answered, 95.9 relevance, 83.0 groundedness
  • Chinese: 96.9% answered, 95.8 relevance, 82.6 groundedness

Japanese and Chinese showed lower groundedness and citation quality scores. The team hypothesizes this may be due to lower performance of the GPT-4.1-mini model on these languages, a hypothesis they plan to test in future work.

Industry segment performance (MIML dataset):

  • Banking: 97.7% answered, 94.3 relevance, 85.6 groundedness
  • Human Resources: 97.8% answered, 95.2 relevance, 88.4 groundedness
  • Healthcare Administration: 96.1% answered, 94.7 relevance, 82.8 groundedness
  • Legal: 93.9% answered, 90.7 relevance, 87.4 groundedness

Healthcare Administration showed the lowest groundedness (82.8), possibly due to the complexity and specificity of medical content.

Information-Level Analysis

The system's behavior across different information levels reveals important design decisions:

Context Level % Answered Answer Relevance Groundedness Citations/Answer Citation Quality
Full 100.0% 99.4 90.6 4.3 86.3
Partial 95.7% 86.1 84.2 2.9 81.0
No information 1.3% - - - -

With full context, the system achieves near-perfect relevance (99.4) and high groundedness (90.6). With partial information, relevance drops to 86.1 but remains substantial, and groundedness stays at 84.2. Critically, when no relevant information is present, the system generates an answer only 1.3% of the time, effectively preventing ungrounded hallucinations.

Steering Instructions Compliance

The system demonstrates strong compliance with natural language instructions, even with multiple constraints:

Instructions Compliance Score
1 instruction 97.6
2 instructions 96.2
3 instructions 95.3
4 instructions 89.6

When conflicts arise between user-provided and developer-provided instructions, the system correctly prioritizes user instructions, maintaining a compliance score of 91.8 compared to 98.2 for non-conflicting cases.

Importantly, adding steering instructions does not degrade core performance metrics:

Dataset Steering % Answered Answer Relevance Groundedness
MIML Without 95.9% 93.9 87.4
MIML With 97.9% 95.9 86.2
Support Without 97.8% 94.5 92.0
Support With 98.0% 93.2 90.9
Customer Without 93.4% 89.5 95.6
Customer With 95.6% 91.6 94.2

Model Comparison

The choice of LLM for answer generation significantly impacts performance:

Model % Answered Answer Relevance Groundedness Citations/Answer Citation Quality
gpt-4o 91.7% 89.1 86.9 4.0 83.9
gpt-4o-mini 79.9% 77.8 80.9 2.3 69.5
gpt-4.1 96.7% 92.6 86.4 4.1 84.0
gpt-4.1-mini 96.3% 93.9 87.1 4.9 81.3
gpt-4.1-nano 73.3% 73.4 78.6 2.3 59.5
gpt-5 96.2% 88.9 89.3 7.2 88.3
gpt-5-mini 96.4% 90.0 89.3 8.0 84.4
gpt-5-nano 90.2% 83.2 88.3 4.8 75.3

The gpt-4.1-mini model (used in most experiments) strikes a good balance between performance and efficiency. However, gpt-5 and gpt-5-mini show the highest groundedness scores (89.3) and produce more citations per answer (7.2 and 8.0 respectively). The less powerful models (gpt-4o-mini, gpt-4.1-nano) show significant performance degradation.

Implementation and Getting Started

To use answer synthesis, developers set the generateAnswer parameter in their agentic retrieval API call. The response includes:

  1. The synthesized answer
  2. Inline citations within the answer text
  3. A references array with metadata about source documents

This output format is designed for straightforward rendering in chatbots, web applications, or other user interfaces.

For teams implementing RAG solutions, the analysis provides several strategic insights:

  1. Model selection matters: While gpt-4.1-mini performs well, gpt-5 models show higher groundedness and more comprehensive citation practices.

  2. Language considerations: For Japanese and Chinese content, consider additional validation or potentially different model choices to maintain groundedness.

  3. Partial answers are valuable: The system's ability to provide partial answers with 95.7% success rate when context is incomplete significantly improves user experience compared to rejection messages.

  4. Steering instructions work: Developers can confidently add formatting and language instructions without degrading answer quality.

  5. Context quality is critical: The dramatic difference in answer relevance between full context (99.4) and partial context (86.1) underscores the importance of effective retrieval.

Broader Implications for RAG Architecture

This analysis demonstrates that answer synthesis can be successfully integrated into the retrieval layer itself, reducing the need for separate orchestration components. The approach offers several architectural advantages:

  • Simplified pipeline: Eliminates the need for a separate generation step after retrieval
  • Built-in grounding: Citations are generated automatically, making it easier to verify facts
  • Progressive disclosure: Partial answers guide users toward better queries
  • Instruction following: Natural language steering allows customization without code changes

However, the results also highlight important trade-offs. The system's performance is heavily dependent on the underlying LLM's capabilities, particularly for non-English languages. Teams should consider their specific requirements when choosing between different model options.

For organizations building knowledge management systems, customer support tools, or internal copilots, answer synthesis in Foundry IQ and Azure AI Search offers a compelling way to reduce development complexity while maintaining high-quality, grounded responses. The comprehensive metrics across 10,000+ queries provide a solid foundation for evaluating whether this approach fits specific use cases.

Resources:

Comments

Loading comments...