Code-assisted reasoning has been instrumental in enhancing the reasoning abilities of Large Language Models (LLMs). However, the key factors underlying its effectiveness are not well-understood. In this paper, we analyze code-assisted reasoning through the lens of calibration , using Expected Calibration Error (ECE) as our metric. Our extensive evaluation spans 5 datasets and 5 LLMs (Open Scouced - LLAMA, VICUNA and closed source - GPT-3.5, Text-Davinci etc. ), demonstrating that LLMs using code-assisted reasoning not only increase their accuracy by 16%, but crucially, also improve their calibration, with an average reduction of 20% in the ECE score. Our qualitative analysis uncovers a recurring structure in the reasoning process across all tasks, where code-based prompts induce an initialization, result calculation, and result output procedure. We find that the steps of initialization and output are standardized, which improves the consistency of code generation in these stages and reduces the potential for errors in subsequent steps. Our findings suggest that better calibration, aided by consistent reasoning structure in the code prompts, could be a factor in the effectiveness of leveraging code for reasoning .
Literature Survey : https://docs.google.com/document/d/1jYzNnOH51_SKbXAw6XqCsmj3wJmcHvaG4m5apuShU_Y/edit
This is work in progress and we have some exciting insights.
Please reach out to anubhak@andrew.cmu.edu if this idea excites you / to learn more.