An Empirical Investigation into
Learning Bug-Fixing Patches in the Wild via Neural Machine Translation
Millions of open-source projects with numerous bug fixes are available in code repositories. This proliferation of software development histories can be leveraged to learn how to fix common programming bugs. To explore such a potential, we perform an empirical study to assess the feasibility of using Neural Machine Translation techniques for learning bug-fixing patches for real defects. First, we mine millions of verified bug-fixes from the change histories of GitHub repositories, in order to extract meaningful examples of such bug-fixes. Next, we abstract the buggy and corresponding fixed code, and use them to train an Encoder-Decoder model able to translate buggy code into its fixed version. In our empirical investigation we found that such a model is able to fix thousands of unique buggy methods in the wild. Overall, this model is capable of predicting fixed patches generated by developers in 9-50% of the cases, depending on the number of candidate patches we allow it to generate. Also, the model is able to capture a variety of different AST operations and generate candidate patches in a split second.
The figure above shows an overview of the NMT approach that we experiment with. The black boxes represent the main phases, the arrows indicate data flows, and the dashed arrows denote dependencies on external tools or data. We mine bug-fixing commits from thousands of GitHub repos using GitHub Archive. From the bug-fixes, we extract method-level pairs of buggy and corresponding fixed code named bug-fix pairs. BFPs are the examples that we use to learn how to fix code from bug-fixes (buggy → f ixed). We use GumTree to identify the list of edit actions (A) performed between the buggy and fixed code. Then, we use a Java Lexer and Parser to abstract the source code of the BFPs into a representation better suited for learning. During the abstraction, we keep frequent identifiers and literals we call idioms within the representation. The output of this phase are the abstracted BFPs and their corresponding mapping M , which allows reconstructing the original source code. Next, we generate two datasets of BFPs grouping together fixes for small and medium methods, respectively. Finally, for each set, we use an encoder-decoder model to learn how to transform a buggy code into a corresponding fixed version. The trained models can be used to generate patches for unseen buggy code.