Humanity's Last Exam Under Scrutiny: Study Finds 29% of AI Benchmark's Bio/Chem Questions Contradicted by Research
A deep analysis of the PhD-level AI benchmark 'Humanity's Last Exam' reveals nearly 30% of its biology and chemistry answers conflict with peer-reviewed literature. Researchers attribute the errors to flawed incentives prioritizing questions 'impossible for frontier models' over scientific accuracy, prompting the release of a validated subset.