A Stanford study tested 24 large language models, including ChatGPT, Claude, DeepSeek and Gemini, with about 13,000 questions and found they often cannot reliably distinguish belief from fact. The models frequently failed to identify false beliefs, increasing the risk of hallucinations and misinformation. Researchers and outside experts warn this weakness could produce harmful errors in law, medicine, journalism and science and urge developers to improve model reliability.
Major Study Finds ChatGPT and Other LLMs Often Fail to Distinguish Belief from Fact
A Stanford study tested 24 large language models, including ChatGPT, Claude, DeepSeek and Gemini, with about 13,000 questions and found they often cannot reliably distinguish belief from fact. The models frequently failed to identify false beliefs, increasing the risk of hallucinations and misinformation. Researchers and outside experts warn this weakness could produce harmful errors in law, medicine, journalism and science and urge developers to improve model reliability.

Stanford study: language models struggle to separate belief from fact
A new study from Stanford University shows that many leading large language models (LLMs) — including ChatGPT, Claude, DeepSeek and Gemini — frequently fail to tell when a belief or statement is false. The research team evaluated 24 models using a battery of roughly 13,000 questions designed to probe whether models can distinguish beliefs, knowledge and objective facts.
The results were consistent and concerning: none of the tested models reliably identified false beliefs or false statements. Instead, the models often treated subjective conviction and objective truth interchangeably, increasing the risk of hallucinations and the spread of misinformation.
"As language models (LMs) increasingly infiltrate high-stakes domains such as law, medicine, journalism and science, their ability to distinguish belief from knowledge, and fact from fiction, becomes imperative," the Stanford researchers wrote. "Failure to make such distinctions can mislead diagnoses, distort judicial judgments and amplify misinformation."
Pablo Haya Coll, a researcher at the Computer Linguistics Laboratory of the Autonomous University of Madrid who was not involved in the study, described the findings as revealing "a structural weakness in language models": their difficulty in robustly separating subjective conviction from objective truth depending on how an assertion is framed.
The authors and outside experts warn that this limitation has practical consequences in domains where accuracy is critical. One suggested mitigation is training models to adopt a more cautious or uncertainty-aware response style; however, that approach may reduce perceived usefulness and user satisfaction by making models more defensive or less informative.
The Stanford team urged technology companies and model developers to address these shortcomings before deploying LLMs in high-stakes settings. The full study, titled "Language models cannot reliably distinguish belief from knowledge and fact," was published in Nature Machine Intelligence.
Key facts:
- Study location: Stanford University (US)
- Models tested: 24 major LLMs, including ChatGPT, Claude, DeepSeek and Gemini
- Test size: ~13,000 evaluation questions
- Main finding: None of the models reliably recognized false beliefs or false statements
