Can LLMs Learn by Teaching
for Better Reasoning?
Xuefei Ning*, Zifu Wang*, Shiyao Li*, Zinan Lin*, Peiran Yao*,
Tianyu Fu, Matthew B. Blaschko, Guohao Dai, Huazhong Yang, Yu Wang
[NeurIPS 2024] [ArXiv] [Code] [Video] [Poster]
Abstract
Teaching to improve student models (e.g., knowledge distillation) is an extensively studied methodology in LLMs. However, in human education, teaching enhances not only the students but also the teachers by fostering more rigorous and clearer reasoning, as well as deeper knowledge building. We ask: Can LLMs also learn by teaching (LbT) for better reasoning? If the answer is yes, we can potentially unlock the possibility of continuously advancing the models without solely relying on human-produced data or stronger models.
In this paper, we provide a preliminary exploration of this question. We show that LbT ideas can be incorporated into existing LLM training/prompting pipelines and bring improvements. Specifically, we design three methods, each mimicking one of the three levels of LbT: observing students' feedback, learning from the feedback, and learning iteratively, with the goal of improving answer accuracy without training or improving models' inherent capability with fine-tuning. We reveal some findings:
Teaching materials that make it easier for students to learn (via in-context learning) have clearer and more accurate logic.
Weak-to-strong generalization: LbT might help improve strong models by teaching weak models.
Diversity in students might help: teaching multiple students could be better than teaching a single student or the teacher alone.
We hope that our exploration can inspire future research on LbT and, more broadly, the adoption of advanced education techniques to improve LLMs.
Left: The popular "learning from teacher" paradigm, where the goal is to improve the student model
Right: Our "learning by teaching" paradigm, where the goal is to improve the teacher model
Poster
Video
Method Overview
To explore this question, we draw on learning science literature to summarize three levels of LbT in human education:
L1: Observing students' feedback. The teacher instructs the students, who then provide feedback (e.g., taking exams and reporting the scores, asking questions about unclear logic).
L2: Learning from the feedback. Based on the feedback, the teacher can analyze which logic and concepts the students might have (mis)understood. This information is useful for the teachers to improve their teaching strategy, and further enhance the teacher's own understanding of the concepts.
L3: Learning from the feedback iteratively. The teacher can teach the students, observe the feedback (L1), and learn from the feedback (L2) iteratively.
In this paper, we study the viability of instantiating these LbT ideas in LLMs. There is a range of possibilities in terms of the objective, the pipeline, and the implementation. As an initial exploration, we study three methods, each for one of the three LbT levels.
M1 aims at improving LLMs' answer quality by directly utilizing students' feedback (L1). More specifically, given a set of generated answers, we score each rationale based on its ability to teach student models using in-context learning (ICL) to correctly answer similar problems. We show that aggregating multiple rationales with LbT-based scores can improve the answer accuracy. Notably, M1 improves GPT-4o's accuracy on the MATH dataset from 87.84% to 96.69%.
M2 aims at improving LLMs' inherent ability by learning from students' feedback (L2). We use the approach in M1 to score teacher-generated rationales. Then, we apply direct preference optimization (DPO) to fine-tune the teacher model with the rationale-score pairs. We show that M2 is better than using DPO with correctness scores.
M3 aims at improving LLMs' answer quality by iteratively learning from students' feedback (L3). Specifically, we prompt the LLM to reflect on the failure cases of multiple students and devise new positive and negative exemplars. We show that the LLM can improve the exemplars based on feedback from multiple students. These improved exemplars used in prompts not only improve the learning outcomes for multiple students but also enhance the teacher's performance.
To show the universality and the potential of LbT, our methods and experiments by design cover different approaches (with training or pure prompting), different objectives (improving answer quality, improving models' inherent capability, or prompt optimization), different reasoning problems (math and coding), and different LbT levels.
Method 1
Approach. One common teaching strategy in education is that the teacher first teaches students how to solve a class of problems by giving them the example rationale (named Teaching Rationale, or TR in short) and the answer (named Teaching Answer, or TA in short) to a particular question (named Teaching Problem, or TP in short). Then, the teacher asks students to solve other similar problems (named Exam Problem, or EP in short) to test if the students understand the concepts. The teacher can also learn from this process by observing the feedback (i.e., LbT level 1): if the students can answer EPs well, then it likely means that the TR-TA pair is of high quality. Method 1 simulates this process to select the best answer TA for a given TP, as depicted in the following figure.
Method 1 (Approach: prompting; Goal: improving answer accuracy). We first instruct the teacher model to solve a given TP multiple times, resulting in multiple TR-TA pairs. Then, each TR-TA pair is used as an in-context learning (ICL) example to guide the student model in solving a series of EPs. With the produced ERs and EAs, each student will then receive an exam score (i.e., the accuracy of EAs), denoted as the LbT score. The LbT score can be used as a quality assessment of the corresponding TR-TA pair. The best TA is then selected as the final answer to TP.
Results of Method 1 on 181 MATH test problems. The best results of each row are highlighted in green. "Greedy" denotes greedy decoding; "SC" denotes self-consistency; The two columns with "M1" are our methods. The “Improv.” column calculates the improvements of average performance achieved by M1 (SUM) over SC.
Results. The results on MATH are shown in the table above. Key takeaways are:
M1 is effective with various model settings and surpasses baselines. M1 exceeds self-consistency (SC) with various model settings: strong-teach-weak (e.g., GPT-4o teaches GPT-4o mini), weak-teach-strong (e.g., Mistral-7B teaches LLaMA3-8B), and self-teaching (e.g., LLaMA3-8B teaches itself).
M1 can further benefit from multiple students. Using GPT-3.5 to teach both LLaMA3-8B and Mistral-7B achieves a significant improvement than teaching LLaMA3-8B or Mistral-7B separately.
These suggest the promise of M1 as a tool to continuously advance the models.
Please refer to the paper for results on the competition-level code synthesis task. We make a small interactive demo as follows:
Method 2
Approach. In education, after identifying which teaching materials (e.g., TR-TA pairs) can enhance student performance, teachers can use this information to improve their knowledge or teaching strategies. For example, if students perform poorly due to unclear or inaccurate teaching materials, teachers can correct their knowledge and avoid generating similar TR-TA pairs in the future. M2 simulates this approach to finetune and improve the LLM, as depicted in the following figure.
Method 2 (Approach: finetuning; Goal: improving LLMs' capability). We collect the LbT scores of many TR-TA pairs using M1, and use them to finetune the teacher LLM with DPO.
Results of Method 2 on 500 MATH test problems. "Correctness-DPO" is the same as our M2 except that the scores in based on the answer correctness instead of LbT.
Results. The results on MATH are shown in the table above. We can see that M2 achieves better results compared to solely using the correctness scores in DPO. This improvement is because LbT provides more informative scores than those purely based on correctness. This experiment demonstrates that LbT can also be used to improve the inherent capability of LLMs via finetuning.
Method 3
Approach. In the previous two methods, we utilize students' exam scores as the learning signal. In M3, we explore whether reflecting on students' detailed exam responses can help the teacher iteratively refine its teaching materials. Notably, we aim to verify whether these refinements can enhance the teacher's own performance by providing more effective knowledge. If so, we can assert that the iterative process of teaching, reflection, and material refinement facilitates some form of "knowledge building'' for the teacher as in humans. Additionally, we are interested in whether having multiple and diverse LLMs as students offers further benefits. The method is depicted in the following figure.
Method 3 (Approach: prompting; Goal: deriving better ICL examples). Given a classification task, we first sample positive and negative exemplars from the teacher, and then run multiple refinement iterations. Each iteration contains the following steps: (1) The current exemplars are used as the ICL examples to teach students to answer a set of EPs. The EPs are randomly sampled from the training data in each iteration. (2) We select the EPs that students answered incorrectly and prompt the teacher to reflect on why the current exemplars might have misled students in these instances. (3) Based on the reflection, the teacher generates multiple updated exemplar sets. (4) We keep the exemplar set that achieves the best teacher performance on the training data when the set is used as the ICL examples. Finally, the resulting ICL examples are used in teachers for solving future tasks.
Results of Method 3 on Liar dataset. The numbers are the F1 score at the end of iteration T , where LLaMa3-70B is used as the teacher for all settings. The best results are in bold.
Results. The above table shows that LLMs are able to reflect on the failure cases of students and propose revised exemplars that improve the teacher’s performance. More importantly, we observe a performance gain brought by having dedicated students (as opposed to using a single LLM in prompt optimization as in previous work). Comparing to the scenario where the teacher and student are the same, having one or multiple LLMs different to the teacher as the student improves the quality of the teaching material faster. This demonstrates LbT as a case of weak-to-strong generalization. We speculate that the benefits are brought by more diverse error types made by a different (weaker) student model.
Discussions and Future Work
Insights into In-Context Learning. Currently, we conduct student “learning” with ICL, based on the assumption that students can effectively “learn” from ICL examples and apply similar strategies to solve EPs. Interestingly, prior work found that a correct input-output pairing in ICL examples does not matter much. At first glance, this finding seems to challenge our design, as it suggests that the TA accuracy may not affect the EA accuracy, which means the LbT score cannot reflect the quality of TR+TA. However, we find that, as opposed to only providing labels in the ICL examples in prior work, providing rationales is important. LLMs can follow the problem-solving logic in the detailed rationale in the ICL examples well. This may be because the rationale provides more information, making the ICL examples easier to follow. Consequently, we see that students can use similar logic as the TR when solving the EP. This means that better TR+TA can indeed lead to improved ER and thus higher EA accuracy (i.e., LbT score).
Weak-to-Strong Generalization. Improving models with human-generated/annotated data or synthetic data from stronger models is the dominant paradigm. However, how can we continuously improve the strongest model without relying on human-generated and annotated data? A recent work conducts an exploration on using weak model supervision to train a larger model. Our work is another attempt towards the “weak-to-strong generalization” prospect by drawing from how humans continuously acquire new knowledge without direct instruction. We demonstrate that stronger models can further improve their own results (M1), parameters (M2), and prompts (M3) by utilizing the feedback of weaker models.
M1 and M2 rely on generating/selecting similar EPs. We verify that LbT-based scoring can help select high-quality TR-TAs but require the TP and EPs to have similar problem-solving strategies. In our experiments, suitable EPs are selected according to human-provided information in the dataset. One extension is to let a model automatically identify EPs similar to a TP from a large pool. Another direction is to synthesize similar problems based on a group of problems and exploit the LbT principle to score many rationales for the new problems. Specifically, as a “self-instruct” extension to M2, we can generate a new problem P based on a group of problems S = {P1, · · · , Pk} that are already known to be similar. The generating-scoring pipeline can then be applied to P to obtain rationale-score pairs, where the LbT score can be easily obtained using S as the EPs.
Additional inference cost. LbT-based scoring in M1 and M2 requires additional inference costs, which aligns with recent studies that show that increasing inference costs might be a promising way to improve models’ reasoning capabilities (e.g., OpenAI O1). Nevertheless, designing efficient inference algorithms and systems is needed to make these approaches more usable.
Borrowing Education Strategies to Improve LLMs. We believe that this work only scratches the surface of the potential of LbT, and other more general educational principles, in LLMs. As LLMs are becoming increasingly powerful, more advanced approaches in pedagogy can potentially help with the inference and training of LLMs. In the paper, we provide a roadmap towards this broader research agenda.
Acknowledgements
We thank Sergey Yekhanin from Microsoft Research for their support and suggestions for the work. We thank Zixuan Zhou, Chao Yu, Boxun Li for their discussions. We thank the anonymous reviewers for their insightful suggestions.
BibTeX
@inproceedings{ning2024lbt,
title={Can {LLM}s Learn by Teaching for Better Reasoning? A Preliminary Study},
author={Xuefei Ning and Zifu Wang and Shiyao Li and Zinan Lin and Peiran Yao and Tianyu Fu and Matthew B. Blaschko and Guohao Dai and Huazhong Yang and Yu Wang},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024},
url={https://openreview.net/forum?id=0ZZMUjZJYF}
}