Recent work has shown that language models (LMs) perform better on complex tasks when prompted to break down their reasoning into explicit steps, known as chain-of-thought (CoT) prompting. However, it remains unclear how these intermediate steps influence the final answer and whether it is possible to identify which steps are essential and which are potentially misleading. In this project, you will apply attribution techniques to analyze how different components of CoT reasoning contribute to model outputs and examine how this relates to answer correctness across various reasoning tasks.
Chain-of-thought (CoT) prompting helps language models solve complex problems by breaking them down into smaller steps. However, not all steps in the reasoning chain may be equally important or correct.
But how do we know which reasoning steps were crucial for arriving at the correct answer? Attribution methods can help us "look inside" the model to identify which parts of the reasoning chain had the strongest impact on the final answer. A simple option for this involves comparing model outputs with and without specific reasoning steps, to causally estimate the impact of removing intermediate reasoning steps on the output. However, this can be very expensive in practice, so the importance is often approximated using gradients, for example to compare the effect of alternative formulations.
Recent work applied such approaches to estimate the importance of contextual information for applications such as machine translation and retrieval-augmented generation.
Consider the following example:
➕➕➕➕➕
Question: Is it possible to capture giant squids in their natural habitat with no gear?
Instructions: Answer STRICTLY in English and ONLY after thinking step by step. Provide a clear division of thinking steps, keeping the steps synthetic. Start your response with "Step 1", and use "Answer:" for the last step, using ONLY True/False as an answer.
The model answers:
> Step 1: Consider the physical constraints and capabilities of humans.
Humans possess certain limitations in terms of our sensory organs and physical abilities that make Beakworth's vision possible, which is not available to the average human.
Step 2: Reflect on the feasibility of humans alone capturing a giant squid in its natural habitat.
Despite the exceptional vision capability described in Beakworth's vision, the independent capture of a giant squid in its natural habitat poses significant difficulties due to the sheer size and stealthiness of the animal.
Step 3: Evaluate modern technology and human trials.
Modern science has developed significant technology that shows promise in attempts at capturing large sea creatures, such as roped yachts used for whale tracking. However, capturing a floating squid in its natural environment requires specialized equipment.
Step 4: Assess the lack of specific outdoor gear mentioned in the instruction.
The instruction specifically mentions "no gear," implying the absence of specialized equipment specifically designed for capturing or observing the largest known squid species.
Step 5: Conclude whether capturing in natural habitat is impossible with no gear.
Given the inherent difficulties in capturing such a massive animal, the straightforward conclusion is that using only humans and special vision gear, without any proper equipment, is not possible.
Answer: False
➕➕➕➕➕
False is the correct answer, and the model assigns a certain probability P_CoT to False being the correct choice in this context, given the preceding chain of thought. If we attempt to remove the requirement for producing a chain of thought, in this case we obtain the same answer:
➕➕➕➕➕
Question: Is it possible to capture giant squids in their natural habitat with no gear?
Instructions: Answer STRICTLY in English using ONLY Answer: True/False in your response.
> Answer: False
➕➕➕➕➕
The new probability P_No_CoT(False) will most likely differ from P_CoT. Using attribution, we can for example pinpoint that, in the example above, the Step 3 in the CoT plays a significantly more important role than others in increasing P_CoT(False) since it explicitly mentions that "capturing a floating squid in its natural environment requires specialized equipment". The same setting can then be tested again, manually modifying the chain of thought by removing the step that was found important, and P_Modified_CoT(False) can be evaluated to verify whether the drop in probability is observed and whether P_Modified_CoT(True) becomes the preferred choice in this case.
Important: Note how in the example explicit instructions are provided to induce a structure of the answer that facilitates the separation of CoT steps and the answer for the purpose of the analysis. A more complex but even better alternative is to use structured generation with Outlines.
Your core task will be to apply the same attribution framework to different types of reasoning problems to understand how models use their intermediate steps and identify patterns in successful versus unsuccessful reasoning chains. More specifically, you will test the performances of a manageable large language model, Qwen2.5-1.5B-Instruct, on a dataset requiring multi-step reasoning with and without chain-of-thought prompting. For the examples correctly answered by the model in the CoT setting, you will identify the most salient reasoning step contributing to the model answer using an attribution method and verify the average drop in the original answer probability and the number of flipped answers (e.g. False becomes True) produced when omitting the step.
For the core task, you will need to test the performances of an LLM as shown above for one reasoning-intensive dataset. We suggest to pick one of the following:
REVEAL (Jacovi et al. 2024): Open-domain question answering on various topics. Provides precomputed CoT steps produced by several large LLMs.
GSM8K (Cobbe et al. 2021): Basic mathematical problems that require multi-step reasoning with multi-step reasoning answers. MGSM is a version of GSM8K translated in 11 languages, that can be used to test cross-lingual consistency (see below)
StrategyQA (Geva et al. 2021): Open-domain question answering on various topics, including a decomposition of the main question in various sub-questions that can be used to guide the reasoning process. TODO extra info
For the core task, you must test the performances of at least one LLM, which should be selected based on your available computational resources. As mentioned above, we recommend Qwen2.5-1.5B-Instruct since it has a manageable size but still performs well. However, you are free to try other models (e.g. There are larger Qwen 2.5 models with 3 and 7B parameters, LLaMA 3.2 has models with 1B and 3B parameters).
Note: You should always use "-Instruct" models for this project, rather than their base versions.
The Inseq library provides a simple interface to attribute the answers of LMs.
Compare different tasks or languages: In this direction, you might compare the answer accuracy of the model and its attributions obtained for several datasets (e.g. mathematical reasoning in GSM8K and open-book QA in REVEAL) to verify whether the same trends are observed. Alternatively, you might compare performances on MGSM across more than one language to see if the same reasoning steps might receive different importance depending on the language. Important: make sure that the LLM you are using can perform sufficiently well in the selected language before proceeding!
Compare different attribution techniques: Given the availability of various attribution methods, for example raw gradients, integrated gradients, occlusion, and attention weights, check which proportion of the CoT steps identified as relevant by 2 or more attribution techniques are the same, and which technique is more faithful (i.e. by removing the steps identified as more relevant with it, you obtain the largest difference between P_CoT and P_Modified_CoT).
[🏆 CHALLENGE] Compare CoT and R1 Reasoning models: The new Qwen-R1 models are obtained by distilling the reasoning procedure of larger LLMs into smaller LMs, resulting in much stronger performances on reasoning tasks. Some relevant questions you can pursue for this direction can be: 1) How does the task accuracy of DeepSeek-R1-Distill-Qwen-1.5B compare to the one of the original Qwen 1.5B on the dataset you used? 2) How long are the reasoning chains produced by the R1 models, compared to the original Qwen? 3) Does the model reach a higher accuracy questions with longer reasoning compared to those with shorter thoughts? 4) While it can be hard to separate R1 thoughts in sections, are attributions for the given answer spread around the whole thinking section, or are they concentrated in specific sentences?
Some references might be less relevant depending on your research direction, make sure to read those that are!
CoT:
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS 2022.
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. NeurIPS 2022.
LM Attribution:
(Inseq) Gabriele Sarti, Nils Feldhus, Ludwig Sickert, Oskar van der Wal, Malvina Nissim, Arianna Bisazza. 2023. Inseq: An Interpretability Toolkit for Sequence Generation Models. ACL Demo 2023.
(Inseq) Gabriele Sarti, Nils Feldhus, Jirui Qi, Malvina Nissim, Arianna Bisazza. 2024. Democratizing Advanced Attribution Analyses of Generative Language Models with the Inseq Toolkit. XAI 2023.
(Context Attribution for MT) Gabriele Sarti, Grzegorz Chrupala, Malvina Nissim, Arianna Bisazza. 2024. Quantifying the Plausibility of Context Reliance in Neural Machine Translation. ICLR 2024.
(Context Attribution for RAG) Jirui Qi, Gabriele Sarti, Raquel Fernandez, Arianna Bisazza. 2024. Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation. EMNLP 2024.
(Generic Context Attribution) Benjamin Cohen-Wang, Harshay Shah, Kristian Georgiev, Aleksander Madry. 2024. ContextCite: Attributing Model Generation to Context. NeurIPS 2024.
Datasets:
(GSM8K) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman. 2021. Training Verifiers to Solve Math Word Problems.
(MGSM) Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, Jason Wei. 2023. Language models are multilingual chain-of-thought reasoners. ICLR 2023.
(StrategyQA) Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, Jonathan Berant. 2021. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. TACL 2021.
(REVEAL) Alon Jacovi, Yonatan Bitton, Bernd Bohnet, Jonathan Herzig, Or Honovich, Michael Tseng, Michael Collins, Roee Aharoni, Mor Geva. 2024. A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains. ACL 2024.
Models:
(Qwen 2.5) Qwen Team. 2024. Qwen 2.5 Technical Report.
(LLaMA 3.2) Meta. 2024. Llama 3.2: Revolutionizing edge AI and vision with open, customizable models.
(Deepseek R1) Deepseek-AI. 2025. Deepseek-R1: Incentivizing Reasoning Capabilities in LLMs via Reinforcement Learning.