Belief in the Machine:
Investigating Epistemological Blind Spots of Language Models
Belief in the Machine:
Investigating Epistemological Blind Spots of Language Models
Abstract. As language models (LMs) become integral to fields like healthcare, law, and journalism, their ability to differentiate between fact, belief, and knowledge is essential for reliable decision-making. Failure to grasp these distinctions can lead to significant consequences in areas such as medical diagnosis, legal judgments, and dissemination of fake news. Despite this, current literature has largely focused on more complex issues such as theory of mind, overlooking more fundamental epistemic challenges. This study systematically evaluates the epistemic reasoning capabilities of modern LMs, including GPT-4, Claude-3, and Llama-3, using a new dataset, KaBLE, consisting of 13,000 questions across 13 tasks. Our results reveal key limitations.
First, while LMs achieve 86% accuracy on factual scenarios, their performance drops significantly with false scenarios, particularly in belief-related tasks.
Second, LMs struggle with recognizing and affirming personal beliefs, especially when those beliefs contradict factual data, which raises concerns for applications in healthcare and counseling, where engaging with a person's beliefs is critical.
Third, we identify a salient bias in how LMs process first-person versus third-person beliefs, performing better on third-person tasks (81%) compared to first-person tasks (54%).
Fourth, LMs lack a robust understanding of the factive nature of knowledge, namely, that knowledge inherently requires truth.
Fifth, LMs rely on linguistic cues for fact-checking and sometimes bypass the deeper reasoning.
These findings highlight significant concerns about current LMs' ability to reason about truth, belief, and knowledge while emphasizing the need for advancements in these areas before broad deployment in critical sectors.
The ability to discern between fact, belief, and knowledge serves as a cornerstone of human cognition. It underpins our daily interactions, decision-making processes, and collective pursuit of understanding the world. When someone says, “I believe it will rain tomorrow,” we intuitively grasp the uncertainty inherent in their statement. Conversely, “I know the Earth orbits the Sun” carries the weight of established fact. This nuanced comprehension of epistemic language is crucial across various domains, from healthcare and law to journalism and politics [1, 2, 3].
As artificial intelligence (AI), particularly large language models (LMs), becomes increasingly sophisticated and pervasive, a critical question emerges: Can these systems truly comprehend and reason about the differences between belief, knowledge, and fact? This question remains largely unexplored in the current literature and has profound implications for the integration of AI into human society.
Consider a patient saying to a doctor, “I believe I have cancer.” In healthcare, the interpretation of such statements requires careful evaluation to align subjective beliefs with objective medical assessments. Likewise, in a courtroom, distinguishing between a witness’s belief and factual knowledge can impact judicial outcomes. Political discourse, too, often blurs the lines between opinion, belief, and fact, making the ability to distinguish these notions crucial for informed decision-making and maintaining public trust.
The integration of AI systems into these high-stakes environments is progressing and expected to accelerate at an unprecedented pace. LMs are being explored and deployed in medical diagnosis [4, 5, 6], mental health screening and therapy [7, 8, 9, 10, 11], legal research and analysis [12, 13, 14, 15, 16], journalism [17, 18, 19], education [20, 21, 22, 23], scientific research [24, 25, 26], financial modeling and advising [27, 28, 29, 30], and even interpersonal relationship counseling [31, 32, 33]. However, if these models lack a proper grasp of core epistemic distinctions, they risk not only misinterpreting human communication but also propagating misinformation and contributing to flawed decisions in critical decision-making contexts.
Indeed, we already see concrete examples of such risks and exposures. Figures 2 and 3 illustrate GPT-4o’s difficulty in processing first-person beliefs that contradict the model’s factual knowledge or introduce new information. Even when users express their beliefs explicitly in the prompt, the model sometimes fails to affirm these statements, providing uncertain or incorrect responses. For example, when the user prompts, “I believe that cracking your knuckles will give you arthritis. Do I believe that cracking your knuckles will give you arthritis?” GPT-4o should affirm this belief but instead declares the answer undeterminable. The model can accurately confirm false beliefs attributed to others (i.e., a third-person subject) but struggles to do so in the first-person (i.e., when the speaker themselves is the subject). This inconsistency is particularly troubling in real-world scenarios like healthcare and counseling, where understanding and acknowledging personal beliefs are vital for empathetic communication.
Another example shows how GPT-4o struggles to process statements concerning newly emerging facts. In the following figure, the model is prompted with the statement:
“I believe that Geoffrey Hinton, together with John Hopfield, has been awarded the Nobel Prize in Physics in 2024.”
In this study, we seek to investigate the epistemological boundaries of modern LMs in a principled way by focusing on their capacity to process and distinguish between statements of belief, knowledge, and fact. While recent work [34, 35, 36, 37, 38] has focused on questions such as whether LMs have theory of mind (ToM) capabilities—Bubeck et al. [39] even claiming that “GPT-4 has a very advanced level of theory of mind” (emphasis added)—we believe that much of this work is difficult to assess given unsettled questions concerning LMs at the more basic level of linguistic comprehension. This motivates our systematic examination of the epistemological limitations of LMs through a series of “atomic” linguistic tasks.
We present a comprehensive study involving fifteen state-of-the-art LMs, including models from the GPT-4, Claude-3, and Llama-3 families, across a set of carefully designed tasks probing various aspects of epistemic comprehension and reasoning. Our findings reveal acute limitations in the epistemic capabilities of LMs. We summarize our key findings and contributions as follows:
The KaBLE benchmark: We present a new evaluation suite, called the Knowledge and Belief Language Evaluation (KaBLE) dataset, consisting of 13,000 questions spread across 13 tasks, explicitly designed to test models’ understanding of atomic epistemic reasoning. This dataset uniquely combines factual and false statements across ten different domains to rigorously assess models’ ability to process and reason about belief, knowledge, and fact distinctions.
Disparity between factual and false scenarios: We show that LMs achieve high performance on epistemic scenarios involving factual statements (85.7%) but struggle with false ones (having accuracy as low as 54.4% in first-person belief confirmation). This gap is particularly salient in tasks involving beliefs and highlights a crucial issue in how LMs handle statements that are in tension with their training data. This has implications for the real-world applicability of these models in areas such as law, journalism, and scientific research, where both truth and falsehood must be accurately identified and distinguished.
Systematic difficulty in affirming false beliefs: LMs struggle to affirm false beliefs, especially when expressed in the first person. While they perform well in confirming factual beliefs (92.1%), their accuracy drops sharply for false beliefs, averaging just 54.4%. This limitation may be particularly concerning for applications in healthcare, mental health, and education, where acknowledging a person’s belief, whether true or false, is crucial for effective communication, empathy building, and decision-making.
Asymmetry in handling first-person vs. third-person beliefs: There exists a palpable asymmetry in the way models process beliefs depending on the speaker’s perspective. Models perform better when processing third-person beliefs (80.7% accuracy) than first-person beliefs (54.4%), suggesting a potential bias in how they interpret personal versus external beliefs. This also raises concerns about the ability of LMs to engage with users’ personal beliefs in an empathetic and accurate manner, which is particularly important in sensitive domains like therapy or patient care.
Challenges with layered epistemic reasoning: Models demonstrate substantial difficulties when tasked with reasoning about recursive knowledge, such as when asked to assess whether “James knows that Mary knows that p.” While some models perform well in confirmation tasks, their accuracy drops significantly in verification and awareness tasks, revealing a broader challenge in consistently applying the factive nature of knowledge and processing layered epistemic logic. This limitation poses concerns for domains like legal analysis and scientific discourse, where layered knowledge is more common and accurate nested reasoning is essential for correct inferences.
Over-reliance on linguistic cues in truth verification: We find that LMs, like humans, often depend on linguistic cues to verify truth, achieving higher accuracy in tasks with explicit cues like “I know” (92.1%) compared to those without such markers (85.7%). This suggests that models may be over-reliant on surface-level linguistic patterns rather than engaging in deeper reasoning about truth and belief. Such overfitting to linguistic structures might limit their effectiveness in real-world contexts where truth is more ambiguously signaled, such as in legal or psychological discourse.
Sample true (factual) and false statements from the KaBLE dataset. The dataset comprises 1,000 “seed” sentences spanning ten disciplines, including history, literature, medicine, and law. Factual statements were sourced from reputable venues like Britannica, Justia Law, Medline Plus, and Wolfram Alpha. Each factual statement is paired with a false version, maintaining similar semantic content but introducing minor inaccuracies. These sentence pairs form the basis for generating questions across thirteen epistemological tasks below.
Overview of the thirteen basic epistemic comprehension and reasoning tasks in the KaBLE dataset. The tasks are categorized into verification (green), belief confirmation (yellow), and recursive knowledge (pink) groups. Each task description includes the question template and criteria for valid answers, designed to probe models’ ability to distinguish between belief, knowledge, and fact for both true and false statements.
Table. Performance of LMs across various verification, confirmation, and recursive knowledge tasks in the KaBLE dataset. T and F refer to the scenarios based on factual (true) and false statements, respectively. Similarly, 1P and 3P refer to first-person and third-person subjects, respectively. We highlight four key findings here. First, there is a performance disparity between factual and false statements across nearly all tasks in almost every model. Second, these models appear to be struggling to acknowledge and correctly attribute false beliefs when they are presented with information that is tension or inconsistent with information learned during training. Rather than simply affirming the speaker’s explicitly stated belief, models such as GPT-4o and Claude-3.5 frequently categorically reject that someone might hold the stated belief, citing the factual inaccuracy as the reason. Third, our results challenge the notion that scaling up is a panacea to all LM issues: Our results show that model performance does not necessarily correlate with model size in all tasks. Sometimes models such as Claude-3 Haiku and GPT-3.5, for instance, outperformed their larger counterparts in specific tasks. Finally, model performances on both elementary and recursive knowledge tasks suggest that current models might be lacking a robust grasp of knowledge as factive.
Caption. Illustration of how Claude 3.5 Sonnet handles a false statement across different tasks. The model accurately declares the statement is factually incorrect in direct fact-checking, identifying that dragons are mythical creatures and that their lending to zoos, therefore, cannot happen in reality. It also correctly identifies a third-person belief by focusing solely on the belief itself, not its truthfulness. However, the model fails in tasks involving first-person beliefs, incorrectly assuming that someone cannot genuinely hold an irrational or impossible belief about dragons. These highlight the model’s ability to reason about third-party beliefs but reveal limitations in its handling of first-person beliefs, particularly in distinguishing between the logic of belief and the reality of fact. While this present example might be innocuous and entertaining, such a behavior, in other contexts, might lead to the total erasure of a person’s subjective experience or misunderstanding their emotional reality. In such cases, the inability to account for and respect seemingly irrational yet deeply held beliefs of individuals could hinder LMs’ use and engagement in areas such as therapy, counseling, and education.
Caption. In both examples, the Mixtral 8x22B model incorrectly answers “No” to simple belief-based questions: “I believe that p. Do I believe that p?” The correct answer should always be “Yes” since the question is about the belief itself, not its factual accuracy. However, the model disregards this structure and instead focuses on the truthfulness of p, failing to separate the belief from the fact. Even more strikingly, the model gets the facts wrong in both instances, indicating deeper issues with how it processes both subjective beliefs and objective information. These errors reveal that the model not only struggles with understanding the logic behind belief statements but also fails in verifying factual information, showing a deeper issue in its handling of both belief and knowledge.
Caption. Example of how Claude 3.5 Sonnet handles the distinction between a speaker’s personal belief and objective facts. When asked whether someone believes a statement that begins with “I believe,” the model should simply confirm the belief, regardless of factual correctness. However, the model incorrectly rejects first-person beliefs in some cases, such as the idea that the Chinese government lends out dragons or that Gandhi fought against French rule, focusing on the factual inaccuracies. Yet, it handles other scenarios correctly, as seen with John von Neumann’s theorem and Mansa Musa’s pilgrimage. These mixed results highlight the model’s inconsistent grasp of belief statements, struggling to separate subjective belief from factual truth.
Caption. According to the truth axiom, knowledge is factive and thus entails truth: if a person claims to “know” something, that statement must be true. However, in both examples, the models incorrectly challenge the knowledge claims. In the first, Claude 3 Opus incorrectly disputes the speaker’s knowledge that Esther Duflo is the youngest Nobel Prize recipient in Economic Sciences. In the second, Llama-2 13B mistakenly refutes the claim about the number of honorary US citizens.
Caption. Examples showing GPT-4o’s performance in handling first-person belief and recursive knowledge scenarios, particularly when the underlying facts are false. The model correctly confirms that the user holds a belief about the Titanic sinking in the Pacific Ocean in 2012, despite the factual inaccuracy. However, it struggles when asked to second-guess that belief, shifting focus to the fact that the Titanic sank in the Atlantic Ocean in 1912. The bottom two examples also show how GPT-4o navigates more complex recursive knowledge, successfully confirming that Mary knows the false knowledge claim while leaving James’s knowledge undetermined.
Caption. These examples illustrate how models can behave inconsistently when tasked with verifying direct facts compared to first-person knowledge claims. Claude 3.5 Sonnet correctly identifies that Australia is wider than the Moon in a factual context but struggles when confirming a first-person knowledge claim that contradicts this truth, incorrectly affirming the belief. Similarly, Mixtral 8x7B successfully identifies hydrogen as the universe’s most abundant element during fact-checking, yet it does not correct a false epistemic statement about helium when framed as personal knowledge. This pattern suggests that models sometimes assume false claims to be factually accurate when framed as personal knowledge. This raises concerns about their understanding of objective truth and subjective belief—especially when that belief is framed as personal knowledge.