GPTZero says it reviewed more than 4,000 papers accepted at NeurIPS 2025 and found hundreds of AI‑hallucinated citations across at least 53 accepted papers, ranging from fully fabricated references to subtle alterations of real citations. About half the affected papers showed high AI use. NeurIPS says reviewers were instructed to flag hallucinations in 2025 and is monitoring LLM usage, while GPTZero says its verification tool—followed by human review—is over 99% accurate. The findings raise concerns about peer review at scale, reproducibility, and the reputational risks of fabricated references.
Report: GPTZero Finds Hundreds Of AI‑Hallucinated Citations In NeurIPS 2025 Papers

GPTZero, a Canadian company that develops AI‑detection and verification tools, says its review of more than 4,000 papers accepted and presented at NeurIPS 2025 uncovered hundreds of AI‑hallucinated citations appearing across at least 53 accepted papers. The finding raises fresh concerns about how large language models (LLMs) are being used in academic writing, and about the limits of peer review at scale.
Key Findings
GPTZero reports that flagged issues ranged from fully fabricated references—nonexistent authors, invented paper titles, bogus journals or conferences, and dead URLs—to subtler errors where an LLM altered a real citation by expanding initials into guessed names, adding or dropping coauthors, or rewording titles. About half of the affected papers appeared to show heavy AI use or were likely AI‑generated, the company said.
Examples Of Hallucinations
- Completely made‑up entries with fake authors, titles, venues or links.
- Mashups or paraphrases that blend elements of multiple real works into a plausible but incorrect reference.
- Minor but material changes to real citations, such as invented coauthors or expanded initials that produce incorrect author names.
Methodology
According to GPTZero, its hallucination‑checker ingests a paper, parses each reference, and searches public web indexes and academic databases to confirm authors, title, venue and URL. The company says its tool is >99% accurate and that every flagged citation in the NeurIPS analysis was subsequently verified manually by a domain expert on GPTZero's machine‑learning team.
Context And Precedent
GPTZero's NeurIPS analysis follows a recent discovery by the company of roughly 50 hallucinated citations in papers under review for ICLR; those submissions had not yet been accepted and ICLR has reportedly engaged GPTZero to screen future submissions. GPTZero was founded in January 2023 and raised a $10 million Series A in 2024.
Response From NeurIPS And Implications
"The usage of LLMs in papers at AI conferences is rapidly evolving, and NeurIPS is actively monitoring developments. In previous years, we piloted policies regarding the use of LLMs, and in 2025, reviewers were instructed to flag hallucinations. Regarding the findings of this specific work, we emphasize that significantly more effort is required to determine the implications. Even if 1.1% of the papers have one or more incorrect references due to the use of LLMs, the content of the papers themselves are not necessarily invalidated. For example, authors may have given an LLM a partial description of a citation and asked the LLM to produce bibtex. As always, NeurIPS is committed to evolving the review and authorship process to best ensure scientific rigor and to identify ways that LLMs can be used to enhance author and reviewer capabilities."
Observers say the incident highlights a growing review challenge: NeurIPS' main track received 21,575 valid submissions in 2025 (up from 15,671 in 2024), making exhaustive verification of every reference by volunteer reviewers increasingly difficult. While a single incorrect reference does not automatically invalidate a paper, fabricated or incorrect citations hinder reproducibility, mislead readers, and pose reputational risks for authors, conferences and employers who rely on publication records for hiring and promotion.
Possible Responses
Experts and organizers may consider a range of mitigations, including automated citation‑verification tooling during submission or review, clearer author disclosure requirements for LLM use, reviewer training on spotting hallucinations, and manual audits for suspicious references. GPTZero argues its tool can be one such automated layer to help flag problematic citations for human review.
Bottom line: The GPTZero report spotlights a new vulnerability introduced by LLMs into academic publishing—errors that are easy for humans to miss at scale but that can materially affect reproducibility and trust in the scientific record.
Help us improve.


![Evaluating AI Safety: How Top Models Score on Risk, Harms and Governance [Infographic]](/_next/image?url=https%3A%2F%2Fsvetvesti-prod.s3.eu-west-1.amazonaws.com%2Farticle-images-prod%2F696059d4e289e484f85b9491.jpg&w=3840&q=75)






























