Automatically fixing compilation errors can greatly raise the productivity of software development, by guiding the novice or AI programmers to write and debug code. Recently, learning-based program repair has gained extensive attention and became the state-of-the-art in practice. But it still leaves plenty of space for improve-ment. In this paper, we propose an end-to-end solution TransRepair to locate the error lines and create the correct substitute for a C program simultaneously. Superior to the counterpart, our approach takes into account the context of erroneous code and diagnostic compilation feedback. Then we devise a Transformer-based neural network to learn the ways of repair from the erroneous code as well as its context and the diagnostic feedback. To increase the effectiveness of TransRepair, we summarize 5 types and 74 fine-grained sub-types of compilations errors from two real-world program datasets and the Internet. Then a program corruption technique is developed to synthesize a large dataset with 1,821,275 erroneous C programs. Through the extensive experiments, we demonstrate that TransRepair outperforms the state-of-the-art in both single repair accuracy and full repair accuracy. Further analysis sheds light on the strengths and weaknesses in the contemporary solutions for future improvement
Since the space constraints of the paper, we provide more details and supplementary materials on the website. The following is the hyperlink to each part:
3) Dataset Diversity Analyisis
4) Statistics of Repaired Results for 74 Compilation Errors
5) Case Study
TransRepair consists of three sequential modules:
(1) Data Synthesis:
In this module, we first empirically summarize the common compilation errors from multiple error sources including DeepFix, TRACER and a self-curated dataset from StackOverflow. Then we design a set of perturbation strategies based on the summarized compilation errors to corrupt the correct programs from DeepFix and construct a new high-quality dataset that is in line with the real scenario.
(2) Data Parsing:
In this module, we first compile a broken program to obtain the diagnostic feedback provided by the compiler. Then we design a context analyzer to extract the context for each statement as one of the input to the model.
(3) Model Architecture:
In this module, we take each statement, its context as well as the diagnostic feedback provided by data parsing module as the input to our neural network. Specifically, it consists of a transformer encoder followed by a multi-layer perceptron to locate the position of the error statement and a pointer-deocder to generate a correct statement for fixing.