FACTS Benchmark Suite Introduced to Evaluate Factual Accuracy of Large Language Models

A new industry benchmark aimed at systematically evaluating the factual accuracy of large language models has been released with the launch of the FACTS Benchmark Suite.

The AI community has long struggled with a fundamental problem: how do we reliably measure whether a language model is actually telling the truth? Benchmarks have existed for reasoning, coding, and general knowledge, but factual accuracy—especially across the different ways models are used in production—has remained surprisingly difficult to quantify.

This week, the FACTS team, in collaboration with Kaggle, released the FACTS Benchmark Suite to address this gap. It's not just another leaderboard-chasing exercise; the suite represents a more nuanced approach to measuring factual reliability by breaking it down into four distinct dimensions that reflect real-world usage patterns.

Building on FACTS Grounding

The new suite expands on the original FACTS Grounding Benchmark, which focused primarily on whether models could base their responses on provided context. That work established a foundation, but practitioners quickly realized it didn't capture the full picture of how models fail factually.

The FACTS Benchmark Suite adds three new benchmarks while updating the original:

Parametric Benchmark: Tests whether models can answer fact-based questions using only their internal knowledge, without any external tools. Think trivia-style questions that should be answerable from sources like Wikipedia.
Search Benchmark: Evaluates models' ability to correctly retrieve and synthesize information using a standardized web search tool. This often requires multiple retrieval steps to resolve a single query.
Multimodal Benchmark: Measures factual accuracy when answering questions about images, requiring correct visual interpretation combined with background knowledge.
Grounding Benchmark v2: An updated version of the original, assessing whether responses are properly grounded in provided contextual information.

Together, these benchmarks comprise 3,513 curated examples, split between public and private evaluation sets. Kaggle manages the held-out private sets, evaluates participating models, and publishes results through a public leaderboard. Performance is reported as the FACTS Score, calculated as the average accuracy across all benchmarks and both data splits.

Why This Four-Dimensional Approach Matters

The structure reflects how practitioners actually use these models. As Alexey Marinin, a senior iOS engineer, noted about the release: "This four-dimensional view (knowledge, web, grounding, multimodal) feels much closer to how people actually use these models day to day."

Each dimension tests different failure modes:

Parametric failures happen when a model has been trained on information but can't reliably recall it. This is the classic "hallucination" problem—when a model invents facts that sound plausible but are wrong.

Search failures occur when a model can't effectively use retrieval tools. Even with access to accurate information, models might misinterpret search results, fail to synthesize multiple sources, or prioritize the wrong details.

Grounding failures happen when models ignore provided context. This is particularly problematic in enterprise settings where models are given specific documents to work with.

Multimodal failures represent the intersection of vision and language understanding. A model might correctly interpret text in an image but fail to connect it with relevant background knowledge, or vice versa.

Early Results Show Progress and Gaps

Early evaluation results reveal both improvement and significant challenges. Among evaluated models, Gemini 3 Pro achieved the highest overall FACTS Score at 68.8%, showing notable improvements over its predecessor in parametric and search-based factuality.

However, the headline finding is that no evaluated model exceeded 70% overall accuracy. This isn't necessarily damning—it's a difficult benchmark—but it does highlight that even state-of-the-art models have substantial room for improvement.

Multimodal factuality emerged as a particularly difficult area across the board. This aligns with what many practitioners have observed: while vision-language models have made impressive strides, reliably combining visual understanding with factual knowledge remains challenging.

Practical Implications for Teams

For teams building production systems, this benchmark provides several concrete benefits:

1. Better Model Selection

Instead of relying on general-purpose benchmarks, you can now evaluate models against the specific usage patterns your application requires. If your system primarily uses retrieval-augmented generation (RAG), the Search and Grounding benchmarks are more relevant than parametric knowledge tests.

2. Standardized Evaluation

Having a common reference point means you can compare different models on the same terms. This is particularly valuable when considering model migrations or evaluating new releases.

3. Targeted Improvement

By identifying which dimension a model struggles with, you can apply targeted solutions:

Poor parametric performance? Consider fine-tuning or using retrieval instead of relying on internal knowledge.
Weak search performance? Improve your retrieval pipeline or prompt engineering.
Grounding issues? Implement better context compression or attention mechanisms.
Multimodal struggles? Focus on better vision encoders or cross-modal alignment.

How to Use the Benchmark

The FACTS Benchmark Suite is publicly available through Kaggle. Here's how to get started:

Access the datasets: The public portions of the datasets are available for download. You can use these to evaluate your own models or understand the benchmark's structure.
Run evaluations: Kaggle provides a standardized evaluation framework. You can submit your model's predictions for private set evaluation and get back a FACTS Score.
Compare results: The public leaderboard shows how different models perform across all four dimensions, giving you insight into relative strengths and weaknesses.
Contribute: The FACTS team has made this an open research project. If you discover interesting patterns or have suggestions for improvement, they encourage community involvement.

The Broader Context

This release comes at a time when the AI community is increasingly focused on evaluation beyond simple accuracy metrics. We're seeing a shift toward measuring models on dimensions that matter for actual deployment:

Reliability: Can the model be trusted to not make up facts?
Verifiability: Can we trace claims back to sources?
Consistency: Does the model give the same answer to the same question?
Robustness: How does it handle edge cases and ambiguous inputs?

The FACTS Benchmark Suite fits into this broader movement toward more sophisticated evaluation. It acknowledges that "intelligence" in AI systems isn't just about answering questions correctly—it's about doing so reliably, verifiably, and in ways that match how humans actually want to use these tools.

Limitations and Future Directions

The FACTS team is clear that this benchmark is intended to support ongoing research rather than serve as a final measure of model quality. Several limitations are worth noting:

Coverage: 3,513 examples, while substantial, is still limited compared to the vast space of factual claims.
Static nature: The benchmark uses fixed examples, while real-world information constantly evolves.
Cultural bias: The examples may reflect certain cultural or linguistic perspectives more than others.
Task specificity: It focuses on question-answering, while models are used for summarization, translation, creative writing, and other tasks where factual accuracy matters differently.

Future iterations will likely address these gaps. The team has indicated interest in:

Dynamic evaluation sets that update with current events
More diverse cultural and linguistic coverage
Task-specific factual accuracy metrics beyond Q&A
Integration with real-time retrieval systems

What This Means for Practitioners

If you're building systems that rely on language models for factual tasks, the FACTS Benchmark Suite gives you:

A concrete way to measure progress as you iterate on your models or pipelines
A vocabulary for discussing factual accuracy with stakeholders and team members
A set of failure modes to test against when designing safeguards
A foundation for custom evaluation if you need to measure accuracy in your specific domain

The benchmark won't solve the hallucination problem overnight, but it provides something the field has needed: a shared, systematic way to talk about and measure factual accuracy across different usage patterns.

For teams deploying models today, the immediate value isn't in chasing the highest FACTS Score—it's in using the benchmark's structure to understand where your specific models and use cases might fail, and building appropriate guardrails and evaluation systems accordingly.

The FACTS Benchmark Suite is available at Kaggle's FACTS Benchmark page. The Google DeepMind blog post provides additional technical details and the full research paper is available through their publications page.

Author photo

#factual accuracy #Benchmark #Large Language Models #evaluation #AI research