Study Finds ChatGPT and Other AI Chatbots Often Confuse Fact with Belief — Potential Risks for Law, Medicine and Journalism

Stanford researchers tested 24 large language models with ~13,000 questions and found many systems still struggle to distinguish fact from belief. Newer models (post–May 2024, including GPT‑4o) scored about 91% accuracy, while older models scored roughly 71.5%–84.8%. The paper warns that inconsistent reasoning rather than true epistemic understanding can produce dangerous errors in law, medicine and journalism, and urges urgent fixes and stronger verification.

00:56, 05.11.2025Technology

Study Finds ChatGPT and Other AI Chatbots Often Confuse Fact with Belief — Potential Risks for Law, Medicine and Journalism

AI Struggles to Tell Truth from Belief, Stanford Study Warns

A new paper in Nature Machine Intelligence, authored by researchers at Stanford University, reports that many leading large language models (LLMs) have difficulty distinguishing facts from beliefs. The authors conclude that most models "lack a robust understanding of the factive nature of knowledge — that knowledge inherently requires truth," a shortcoming that could have real-world consequences as these systems are adopted in sensitive fields.

How the study was done

The team evaluated 24 LLMs (including Claude, ChatGPT, DeepSeek and Gemini) using roughly 13,000 carefully designed questions aimed at probing differences among beliefs, knowledge and facts. Their analysis showed that while newer models have improved, many systems still rely on inconsistent reasoning and appear to use superficial pattern matching rather than a deep epistemic understanding of what constitutes knowledge.

Key findings

Newer models released in or after May 2024 (including GPT-4o) scored between about 91.1% and 91.5% accuracy when distinguishing true from false statements.
Older models performed worse, with accuracy ranging roughly from 71.5% to 84.8%.
The paper attributes remaining errors to inconsistent reasoning strategies rather than robust understanding, increasing the risk of harmful mistakes in high-stakes settings.

Real-world examples and implications

The findings echo recent, circulating examples of AI errors. In one LinkedIn post, UK investor David Grunwald said he asked the Grok model to create a poster of the last ten British prime ministers; the image reportedly contained obvious mistakes (for example, misnaming Rishi Sunak and giving implausible dates for Theresa May).

Researchers warn that such failures could mislead medical diagnoses, distort judicial decisions and amplify misinformation if models are used without appropriate safeguards. Pablo Haya Coll, a computational linguistics expert at the Autonomous University of Madrid (not involved in the study), suggested training models to adopt a more cautious or qualified tone when uncertain — a fix that might reduce harmful hallucinations but could also limit usability in some cases.

Broader usage trends increase concern: an Adobe Express survey found that 77% of U.S. ChatGPT users treat it like a search engine, and about 30% say they trust it more than conventional search tools. In a concrete legal example, a California judge fined two law firms $31,000 after finding they had included machine-generated misinformation in a legal brief without adequate verification.

Recommendations

The study calls for "urgent improvements" before widespread deployment of these models in high-stakes domains such as law, medicine and journalism. Recommended measures include improved training objectives that encode epistemic distinctions, clearer uncertainty signaling, and stronger verification and human oversight when AI is used for factual decision-making.

Bottom line: While LLMs are improving, significant gaps remain in their ability to reliably separate fact from belief. Organizations using these systems should proceed cautiously and build robust verification and oversight into any workflow that depends on accurate knowledge.

AI Struggles to Tell Truth from Belief, Stanford Study Warns

How the study was done

Key findings

Real-world examples and implications

Recommendations

Trending