A deep analysis of the PhD-level AI benchmark 'Humanity's Last Exam' reveals nearly 30% of its biology and chemistry answers conflict with peer-reviewed literature. Researchers attribute the errors to flawed incentives prioritizing questions 'impossible for frontier models' over scientific accuracy, prompting the release of a validated subset.