A fresh, real‑world corpus of 1,000 user‑submitted claims shows that on 67 % of cases at least one of the five leading large language models disagrees with the majority verdict, and substantive splits appear in more than a third of the claims. The analysis highlights the limits of treating any single model as a de‑facto fact‑checker and points to the need for multi‑model ensembles and human oversight.

Beyond Benchmarks: Frontier LLM Disagreement on Fact‑Checks

The headline number

When five of the most capable language models—GPT‑5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro + Search, and Sonar Pro—were asked to label the same 1,000 recent user‑submitted claims, 672 claims (67 %, 95 % CI 64–70 %) featured at least one dissenting model. In other words, a strict majority never formed, or at least one model voted against the majority verdict.

How disagreement is measured

Each claim received a forced‑choice label from a four‑bucket rubric (True, Mostly True, Misleading, False). The authors treat the majority label only as a reference point, not as ground truth. Disagreement is quantified in two ways:

Presence of any dissent – any claim where the five models are not unanimous.
Bucket distance – the maximum ordinal distance between any two model verdicts (True → 0, Mostly True → 1, Misleading → 2, False → 3). A distance of 2 or more signals a substantive split rather than a simple calibration shift.

The raw breakdown

Pattern	Claims	Share of corpus
All five agree (unanimity)	328	33 %
One dissent (4‑1)	224	22 %
Two dissent (3‑2)	316	32 %
No strict majority (e.g., 2‑2‑1)	132	13 %
Any dissent	672	67 %
≥ 2 dissenting models	448	45 %

The Krippendorff α for the panel (ordinal) is 0.639, indicating structured but far from perfect agreement.

Substantive versus nuance splits

A 2‑bucket gap (e.g., True vs Misleading) or a full polar split (True vs False) is considered substantive. The analysis finds:

33 % of claims show only nuance (True ↔ Mostly True or Misleading ↔ False).
13 % show a substantive 2‑bucket gap.
21 % are polar (True ↔ False).
Overall, 34 % of claims have a bucket distance of at least 2.

These numbers matter because a “True vs Mostly True” disagreement often reflects confidence calibration, while a “True vs False” split reveals a genuine clash over the factual status of the claim.

Model‑to‑model alignment

Pairwise agreement rates expose which models tend to move together:

Gemini 3 Pro × Gemini 3 Pro + Search agree on 75 % of claims (they share the same base model).
The weakest links are Claude Opus 4.7 × Gemini 3 Pro, Claude Opus 4.7 × Gemini 3 Pro + Search, and Gemini 3 Pro × Sonar Pro, each at 53 %.

These figures suggest that retrieval‑augmented variants do not automatically converge with their parametric counterparts.

Per‑model behavior

Verdict distribution

Model	True	Mostly True	Misleading	False
GPT‑5.4	42 %	16 %	12 %	30 %
Claude Opus 4.7	38 %	26 %	19 %	17 %
Gemini 3 Pro	54 %	3 %	3 %	40 %
Gemini 3 Pro + Search	52 %	4 %	9 %	35 %
Sonar Pro	35 %	23 %	16 %	26 %

Models that concentrate on the poles (Gemini 3 Pro, GPT‑5.4) differ markedly from those that spread more evenly (Claude Opus 4.7, Sonar Pro).

Alignment with the panel majority

When a strict majority exists among the other four models, each model’s verdict matches that majority between 69 % (Sonar Pro) and 81 % (GPT‑5.4). This is peer‑majority agreement, not an accuracy claim.

Domain‑level patterns

Disagreement is not uniform across topics. The highest share of any disagreement appears in Legal (77 %) and Finance (67 %), while History shows the lowest at 53 %. Substantive splits (≥ 2‑bucket gaps) are most common in Science (21 %) and Legal (19 %), reflecting the nuanced nature of those domains.

What the numbers do not tell you

No ground truth: The majority label is not assumed correct. Even the 33 % unanimous cases likely contain shared blind spots.
Rubric ambiguity: The four buckets are treated as equally spaced ordinal categories, a simplification that can inflate the perceived severity of a 2‑bucket gap.
Training contamination: The claims are fresh (submitted after 15 Feb 2026) and not part of public benchmark suites, reducing the chance that models have memorized exact answers, but overlapping topical material is inevitable.
Retrieval opacity: Retrieval‑augmented models may have consulted the live web, including the Lenz platform itself, but the specific sources are not audited.

Why this matters for practitioners

Relying on a single model is risky – on two‑thirds of real‑world fact‑check requests, at least one model will disagree with the rest.
Ensembles can reduce uncertainty – a simple majority vote across diverse models eliminates a large fraction of the “any dissent” cases, but still leaves a sizable 45 % of claims with at least two dissenters.
Human oversight remains essential – the frontier panel’s Krippendorff α of 0.639 is comparable to inter‑annotator agreement in established fact‑checking corpora (κ≈0.62), underscoring that the task itself is hard for both humans and machines.
Domain‑aware routing may help – models like Gemini 3 Pro excel in finance‑type claims but falter on nuanced legal statements; a routing layer that selects the best‑performing model per domain could improve overall consistency.

Next steps announced by the authors

A follow‑up study will human‑label every claim in this corpus, allowing a proper accuracy comparison between the frontier panel, the Lenz platform, and the individual models. The goal is not to crown a winner but to map where systematic divergences arise—whether from rubric ambiguity, temporal framing, or genuine factual uncertainty.

Data source: 1,000 recent user submissions to the Lenz fact‑checking platform (May 21 2026 snapshot). Full CSV and PDF are available via the permanent archive at https://lenz.io/research/llm-disagreement/v1.0.

Citation: doi.org/10.5281/zenodo.20344847.

#Fact-Checking #model disagreement #ensemble methods #Krippendorff alpha #domain bias

Beyond Benchmarks: Frontier LLM Disagreement on Fact-Checks