Data
We report here all the data used, extracted, and generated in our study:
Bug-Fixes
Datasets
Predictions
Idioms
Bug-Fixes
Bug-Fixing Commits
Bug-Fixing commits metadata extracted during the mining. The CSV file contains the following fields:
ID : Commit HASH ID
Repo_URL : GitHub URL of the repository
Commit_URL : GitHub URL of the bug-fixing commit
Message : Commit message of the bug-fixing commit
Data
Download CSV file (900 MB)
Code from Bug-Fixes
Raw source code extracted from the bug-fixing commits.
Each bug-fixing commit is represented by a folder named as the commit hash ID. In each folder there are two sub-folders:
P_DIR: Java source code files before the bug-fixing commit
F_DIR: Java source code files after the bug-fixing commit
Data
Download data (15 GB)
Extracted Bug-Fix Pairs (BFP)
Method pairs extracted from the bug-fixing commits.
Each bug-fix is represented by a folder with the corresponding commit hash ID. In each bug-fix folder there is a first level of folders representing the files, then a second level of folders representing the methods. In each method folders there are the following files:
before.java : Method's source code before the fix
after.java : Method's source code after the fix
operations.txt : AST operations performed on the method as extracted by GumTreeDiff
signature.txt : Fully qualified signatures of the method before/after the fix
Data
Download data (7 GB)
Datasets
Dataset of Bug-Fix Pairs for small and medium methods. Each dataset is formed by training (80%), validation (10%), and test (10%) sets, as well as the combined files. Each set contains the buggy and the corresponding fixed code, with line-by-line alignment (i-th line in buggy corresponds to the i-th line in fixed). The original source code of the methods is also available. The mapping file allows to map the abstract code with the original source code.
Small Methods (50 tokens)
Medium Methods (50-100 tokens)
Predictions
Predictions of the models for small and medium methods. For each model, there is a folder for each beam size containing files of the predictions performed on the test set of the corresponding dataset (above):
Predictions - Small Methods (50 tokens)
Predictions - Medium Methods (50-100 tokens)
Each beam size folder contains the following files:
prediction.beam.mul.txt : the k predictions performed by the model separated by the <SEP> token. The i-th prediction refer to the i-th buggy code in the test set;
prediction.beam.vis.txt : same predictions but displayed one by line;
perfect.beam.mul.txt : the perfect fixes generated by the model. The file shows first the buggy code, then the fixed code, then the predicted code (equal to fixed);
pred.operations.txt : AST fine grained operations emulated by the model when performing the perfect fixes.