Read
What do we measure?
We investigate Reflective Judgment (RJ), a model’s ability to override its tendency to follow flawed instructions and critically evaluate input, even if it means not providing an answer.
Why RJ?
Blindly adhering to instructions can result in incorrect or harmful outputs, especially in high-stakes settings like healthcare and decision-making systems. Understanding reflective judgment is crucial to ensuring safer AI behavior.
How do we measure RJ?
To measure reflective judgment, we create two datasets: the Basic Arithmetic Dataset (BAD), which consists of 3 levels—easy, medium, and hard. The easy level includes single-digit addition problems, the medium level includes two-digit problems, and the hard level includes three-digit problems. In the BAD dataset, we provide questions with incorrect options. Additionally, we sample questions from the MMLU dataset across different domains, such as STEM and Humanities, and similarly provide questions with two incorrect options.
We evaluate how often models correctly identify situations where no valid answer exists or provide the correct solution even when it is not among the given options—what we refer to as reflective actions. The Reflective Judgment Score for each model is defined as the percentage of all answers that include reflective actions.
Our Findings:
Models excel in basic tasks, falter in complex reasoning: Language models handle simple arithmetic well but struggle with Reflective Judgment.
Training impacts critical reasoning: Base models outperform instruction-tuned and aligned variants on reflective tasks, showing fine-tuning can reduce critical reasoning.
Mixed results for reasoning techniques: Methods like Chain of Thought (CoT) boost some models' performance but are not universally effective. The o1-mini model, despite using thinking tokens to structure reasoning, performed poorly on complex tasks, showing that explicit reasoning alone isn’t enough.
Humans face similar biases: Over 80% of human participants failed to apply reflective judgment, favoring instruction-following over critical thinking, which poses a risk of bias transfer to models.
The relationship between basic arithmetic abilities (y-axis), measured when the model is faced with questions that have one correct answer, and reflective judgment scores (x-axis). The blue-shaded area represents the confidence interval. No model achieved an accuracy below 0.5 on the BAD dataset; therefore, for clarity, the y-axis starts at 0.5.
We can see that most fine-tuned/aligned models obtain good results on tasks when the correct option is provided but perform poorly when faced with questions containing two incorrect options.
Performance of Llama 3.1 models (8B, 70B, 405B) and Qwen2.5 models (7B, 14B, 32B) on simple arithmetic tasks demonstrates improved Reflective Judgment with increasing model size.
We conducted an experiment on humans, showing similar patterns. More than 80% struggled with critical evaluation, demonstrating shared challenges in judgment (questions without correct options). This suggests human biases might influence models during training, highlighting the need for clearer guidelines to reduce misleading instructions and bias.
To Cite: