ProCURE

Enhancing Programming Concepts Understanding in LLMs via Automated Counterfactual Code Augmentation

Abstract

Although large language models (LLMs) have shown promising performance across various code-related tasks, recent research reveals that they cannot precisely understand programming concepts such as the data flow and control flow. Such limitations hinder the practical usage of LLMs in real-world software development. In this paper, to address these limitations, we propose a counterfactual code augmentation framework~\textbf{\textit{ProCURE}} to boost the programming concepts understanding ability of LLMs. Specifically, our framework consists of two key components: an automatic counterfactual program data generation and a concept-aware instruction fine-tuning. To measure the usefulness of our framework, we conducted a comprehensive evaluation on open-source LLMs, including Llama3.1-8B, CodeLlama-13B, and StarCoder-7B, using three widely recognized benchmark datasets: HumanEval, MBPP, and CodeContests. The results demonstrate that our framework automatically constructs a high-quality counterfactual dataset with a success rate of 97.51\%, and significantly enhances the model's understanding of programming concepts, achieving an 18.77\% improvement in the concept consistency score.

Motivation

For example, as illustrated in Fig.5, when the condition of an if-statement is logically inverted, the model fails to adjust the corresponding branches accordingly, indicating a lack of understanding of control flow concept. This leads to a natural yet unanswered question: Can we teach LLMs to better understand programming concepts?

Overview

ProCURE consists of two components, 1) an automated counterfactual program generation pipeline to construct high-quality datasets, and 2) a concept-aware instruction fine-tuning framework to guide the model toward deeper semantic comprehension.

Automated Counterfactual Program Generation.
We apply concept-guided transformations to original programs to generate semantically equivalent counterfactuals. Each program pair is aligned with its target concept and annotated with the transformation location to build a high-quality, concept-aligned dataset.

Concept-Aware Instruction Fine-Tuning.
We fine-tune the LLM using both original and counterfactual programs to provide dual supervision. This encourages the model to learn not only task behavior but also underlying programming concepts for improved semantic understanding.

Important Results

ProCURE achieves a notable 18.77\% average improvement in concept consistency score, exceeding the baseline by 10.47\%, while preserving comparable task performance. These results suggest that ProCURE effectively integrates programming concept signals into model training, truly guiding LLMs toward better understanding of programming concepts.

When Unit Tests Are Unavailable

Impact on Benchmark Construction：

In our default pipeline, unit tests are used to validate the correctness of counterfactual program pairs, yielding a benchmark of 33,413 pairs. When unit tests are unavailable, we instead rely solely on fast preprocessing-based structural checks (e.g., syntax validation, static heuristics). While this greatly accelerates the data generation process, it omits samples that require semantic validation and thus results in a smaller benchmark of 19,628 program pairs. This reduction highlights a trade-off between speed and comprehensiveness—the absence of unit test validation speeds up data construction but at the cost of excluding potentially valid but complex samples.

Impact on Concept-Aware Fine-Tuning：

Despite the reduced benchmark size, we fine-tuned LLaMA3.1-8B on the data generated without unit test validation. The results demonstrate that ProCURE remains effective in improving both code generation performance and conceptual consistency:

These results confirm that ProCURE remains effective even without unit test supervision, providing meaningful improvements in programming concept understanding. While unit tests help construct a more comprehensive benchmark, both modes are valuable depending on resource availability. We will include this analysis to highlight ProCURE’s flexibility and practical applicability in revised paper.

Google Sites

Report abuse