NeurIPS Papers Contaminated with AI Hallucinations: 100 Fabricated Citations Found in Accepted Research
#Regulation

NeurIPS Papers Contaminated with AI Hallucinations: 100 Fabricated Citations Found in Accepted Research

Privacy Reporter
5 min read

A GPTZero analysis reveals over 100 AI-generated hallucinations in 51 NeurIPS papers, exposing a systemic breakdown in academic vetting as conference submissions surge 220%. The findings highlight a growing crisis in research integrity, with error rates in major AI conferences increasing by over 50% since 2021.

The academic publishing world is grappling with a crisis of authenticity as AI-generated fabrications infiltrate the highest levels of computer science research. GPTZero, an AI detection company, has identified over 100 hallucinated citations in 51 papers accepted by the prestigious Conference on Neural Information Processing Systems (NeurIPS), revealing a critical vulnerability in the peer review process.

The findings, detailed in a GPTZero blog post, show that authors have been submitting research containing invented authors, non-existent sources, and fabricated text purportedly written by AI. This follows the company's earlier discovery of 50 hallucinated citations in papers under review by the International Conference on Learning Representations (ICLR).

Featured image

The Scale of the Problem

The issue stems from a dramatic surge in paper submissions. Between 2020 and 2025, NeurIPS submissions increased by more than 220%, from 9,467 to 21,575 papers. This exponential growth has forced conference organizers to recruit ever-greater numbers of reviewers, resulting in "issues of oversight, expertise alignment, negligence, and even fraud," according to GPTZero's senior machine-learning engineer Nazar Shmatko, head of machine learning Alex Adam, and academic writing editor Paul Esau.

The problem extends beyond fabricated citations. A pre-print paper published in December 2025 by researchers from Together AI, NEC Labs America, Rutgers University, and Stanford University examined substantive errors in AI papers from three major organizations: ICLR (2018–2025), NeurIPS (2021–2025), and TMLR (Transactions on Machine Learning Research) (2022-2025).

The researchers found that "published papers contain a non-negligible number of objective mistakes and that the average number of mistakes per paper has increased over time." Specifically:

  • NeurIPS: From 3.8 mistakes per paper in 2021 to 5.9 in 2025 (55.3% increase)
  • ICLR: From 4.1 mistakes per paper in 2018 to 5.2 in 2025
  • TMLR: From 5.0 mistakes per paper in 2022/23 to 5.5 in 2025

While correlation doesn't prove causation, the timing is telling: the error rate in NeurIPS papers increased 55.3% following the introduction of OpenAI's ChatGPT in late 2022, suggesting the rapid adoption of generative AI tools cannot be ignored.

The academic world isn't alone in facing this challenge. The legal community has been dealing with similar issues, with more than 800 errant legal citations attributed to AI models flagged in various court filings. These errors often carry serious consequences for attorneys, judges, and plaintiffs involved.

While academics may not face the same formal misconduct sanctions as legal professionals, the consequences of careless AI application extend beyond reputational damage. Invalidated research can undermine years of work, affect funding, and damage institutional credibility.

The AI Arms Race in Publishing

The problem is exacerbated by anti-forensic tools designed to evade detection. For example, a Claude Code skill called Humanizer claims to "remove signs of AI-generated writing from text, making it sound more natural and human." This creates an ongoing arms race between detection and evasion technologies.

The International Association of Scientific, Technical & Medical Publishers (STM) recently published a report addressing these integrity challenges. With academic communication reaching 5.7 million articles in 2024 (up from 3.9 million five years earlier), the report argues that publishing practices must adapt to AI-assisted and AI-fabricated research.

Adam Marcus, co-founder of Retraction Watch and managing editor of Gastroenterology & Endoscopy News, notes that "academic publishers are definitely aware of the problem and are taking steps to protect themselves. Whether those will succeed remains to be seen."

However, Marcus also points to a deeper structural issue: "We're in an AI arms race and it's not clear the defenders can withstand the siege. However, it's also important to recognize that publishers have made themselves vulnerable to these assaults by adopting a business model that has prioritized volume over quality. They are far from innocent victims."

What Changes Are Needed

GPTZero contends that its Hallucination Check software should be part of publishers' AI-detection toolkits. The software specifically identifies fabricated citations and references, addressing one of the most insidious forms of academic fraud.

However, technical solutions alone won't solve the problem. The academic community needs to address several systemic issues:

  1. Reviewer Overload: The 220% increase in submissions has stretched peer review capacity thin, leading to rushed evaluations and missed errors.

  2. Training Deficits: Many researchers lack proper training in responsible AI use for academic writing, leading to unintentional hallucinations.

  3. Incentive Structures: The "publish or perish" culture, combined with the pressure to produce cutting-edge AI research quickly, creates incentives to take shortcuts.

  4. Verification Gaps: Current verification processes aren't designed to catch AI-generated content, requiring new detection methods and protocols.

The Path Forward

The NeurIPS findings represent a critical inflection point for academic integrity in AI research. As the field continues to grow exponentially, the community must balance the need for rapid innovation with the fundamental requirement of accurate, verifiable research.

This means implementing robust AI detection tools, rethinking peer review processes to handle increased volume without sacrificing quality, and establishing clear guidelines for AI-assisted research. It also requires a cultural shift toward valuing quality over quantity in academic publishing.

The stakes are high. AI research drives technological advancement, informs policy decisions, and shapes our understanding of intelligence itself. If the foundational research is contaminated with hallucinations and errors, the entire edifice of AI development rests on shaky ground.

The academic community stands at a crossroads. It can either address these integrity challenges head-on or continue down a path where the line between human and AI-generated research becomes increasingly blurred—and where the credibility of the entire field hangs in the balance.

Comments

Loading comments...