Article illustration 1

Artificial intelligence benchmarks are crucial for measuring progress, but what happens when the benchmark itself is flawed? A rigorous investigation into Humanity's Last Exam (HLE), a high-profile AI eval designed to test PhD-level knowledge, has uncovered significant accuracy issues in its foundational data. Developed as a response to the saturation of benchmarks like MMLU (where top models score >90%), HLE's 2,500 questions target extreme difficulty. However, research by FutureHouse reveals that approximately 29% of HLE's text-only biology and chemistry questions contain answers directly contradicted by established scientific literature.

The Flawed Foundation: Incentives Over Integrity

HLE's core design principle — selecting questions deemed impossible for current frontier language models — appears to be the root cause. This inadvertently incentivized the creation of convoluted, 'gotcha'-style questions. Combined with a review process where domain experts were explicitly not required to verify answers if it took "more than 5 minutes," errors proliferated.

"The frontier of science isn’t actually objective and univocal. That’s why it’s a frontier," notes the FutureHouse team. "This makes it really time-consuming and difficult to write questions that are very challenging but valid."

Case Studies in Contradiction

  1. The Mythical Noble Gas:

    • HLE Question: "What was the rarest noble gas on Earth as a percentage of all terrestrial matter in 2002?"
    • HLE Answer: Oganesson
    • The Conflict: Oganesson is a synthetic element; only five atoms existed briefly in 2002. Peer-reviewed sources (e.g., Reviews in Mineralogy and Geochemistry, 2002) exclude it from terrestrial noble gas calculations. Predictions suggest it's likely a solid, not a gas, and potentially reactive, challenging its 'noble' status (Angewandte Chemie, 2020). This question exemplifies trivia masquerading as PhD-level research.
  2. The Ampule Anomaly:

    • HLE Question: "What is the BUD for a single dose container ampule from the time of puncture in a sterile environment?"
    • HLE Answer: 1 hour (citing USP <797>)
    • The Conflict: Independent expert review and direct reading of USP <797> indicate the 1-hour limit applies to single-dose vials, not ampules. The correct protocol for ampules is typically "use immediately." The HLE rationale appears based on a misinterpretation, potentially sourced from non-authoritative materials.

Quantifying the Problem: Methodology & Findings

FutureHouse employed its open-source PaperQA2 agent, Crow, to audit the 321 text-only biology/chemistry HLE questions. Crow was prompted to find direct supporting or contradicting evidence for each question-answer-rationale trio, focusing on published research:

# Simplified Crow Prompt Structure
Determine if the question-answer pair is contradicted by data/analysis.
Answer 'contradicted' or 'correct' + explanation.
Only label 'contradicted' if absolutely certain.
Question: <question>
Answer: <answer>
Rationale: <rationale>

Crow flagged 53.3% (171 questions) as directly conflicting with evidence. Human experts then evaluated a 150-question sample:
* Chemistry: 57.0% contradicted in Crow's initial analysis.
* Biology/Health: 51.6% contradicted.
Expert validation confirmed Crow's findings with strong alignment, showing direct disagreement on only 13.7% of Crow's 'contradicted' flags. Extrapolating the expert-validated results suggests:
* 29.3% ± 3.7% (95% CI): Directly conflicted by research.
* 51.3% ± 4.1%: Supported by research.
* 19.3% ± 3.2%: Nuanced (dependent on assumptions/opinion).

Implications for AI Benchmarking and the Path Forward

This analysis raises critical questions about the future of AI evaluation:
1. Validity vs. Difficulty: The pursuit of questions that stump models shouldn't compromise factual accuracy. Benchmarks must balance challenge with rigorous validation.
2. Resource Realism: Verifying PhD-level claims requires significant expertise and time; a 5-minute review limit is fundamentally inadequate.
3. The Nuance Problem: Scientific frontiers are often ambiguous. Benchmarks demanding "objectively correct, univocal answers" in these areas may be inherently flawed concepts.

In response, FutureHouse has released HLE Bio/Chem Gold on HuggingFace – a curated subset of questions validated by both their AI system and human experts. This offers a more reliable foundation for evaluating AI research capabilities in these domains. The incident underscores a pivotal lesson for the AI community: building evals that genuinely reflect expert human understanding requires prioritizing scientific integrity over artificial notions of model-stumping difficulty. The quest for 'impossible' questions, it turns out, can sometimes make the benchmark itself impossible to trust.

Source: Analysis and findings based on research by FutureHouse, published at futurehouse.org.