Improving performance of LLMs on Science and Math Problems

LLMs don't perform well on answering Science and Math problems!

This is very clear, as science and mathematics problems do not deal specifically with just text - they include special symbols to denote abstract objects. In this subpage, we explore some of the research done on improving LLMs on the task of solving science and math problems.

Relevant Research

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. (2021). Measuring mathematical problem solving with the math dataset. https://arxiv.org/abs/2103.03874
Large Language Models are Zero-Shot Reasoners https://arxiv.org/pdf/2205.11916
Solving Quantitative Reasoning Problems with Language Models https://arxiv.org/abs/2206.14858
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models https://arxiv.org/abs/2201.11903
Self-Consistency Improves Chain of Thought Reasoning in Language Models https://arxiv.org/abs/2203.11171
Large Language Models for Mathematical Reasoning: Progresses and Challenges https://arxiv.org/abs/2402.00157
Tree of Thoughts (ToT): Deliberate Problem Solving with Large Language Models https://arxiv.org/abs/2305.10601
Buffer of Thoughts (BoT): Thought-Augmented Reasoning with Large Language Models https://arxiv.org/pdf/2406.04271 (Given a specific problem, they retrieve a relevant thought-template and adaptively instantiate it with specific reasoning structures to conduct efficient reasoning)
TO COT OR NOT TO COT? CHAIN-OF-THOUGHT HELPS MAINLY ON MATH AND SYMBOLIC REASONING https://arxiv.org/pdf/2409.12183 (Sept. 2024) Main result: CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks
Better Zero-Shot Reasoning with Role-Play Prompting: https://arxiv.org/html/2308.07702v2 (can apply role play with zero shot in class)
Faithful Logical Reasoning via Symbolic Chain-of-Thought https://arxiv.org/abs/2405.18357
MAMMOTH: BUILDING MATH GENERALIST MODELS THROUGH HYBRID INSTRUCTION TUNING https://arxiv.org/pdf/2309.05653 (uses hybrid Chain of Thought, and Program of Thoughts to achieve higher accuracy on MATH Data set.
MATHVISTA: EVALUATING MATHEMATICAL REASONING OF FOUNDATION MODELS IN VISUAL CONTEXTS https://arxiv.org/pdf/2310.02255 A new data set, MATHVISTA, to benchmark on mathematical and visual tasks. This paper also discusses the emergent "self-verification" capability of GPT-4V, where GPT-4V is able to verify that some answers are not possible due to contraints of a problem.
CHAIN-OF-VERIFICATION REDUCES HALLUCINATION IN LARGE LANGUAGE MODELS https://arxiv.org/pdf/2309.11495 Chain-of-Verification (COVE) steps are as follows. For a given question, model first (i) drafts an initial response; then (ii) plans verification questions to fact-check its draft; (iii) answers those questions independently so the answers are not biased by other responses; and (iv) generates its final verified response.

Page updated

Google Sites

Report abuse