We designed this simulation task to evaluate the actual performance in a real-world editing scenario.
A simulated programmer executes a git commit and restores the child version of the commit from the parent version.
Given the parent version of a commit, one edit hunk is modified to the child version as the initial edit.
The locator will suggest the next edit locations from files that contain edits, based on the commit message and prior edits.
All suggested locations will be ranked based on the confidence score.
If one of the suggested locations matches the ground truth, this location shall be fed into the generator for edit suggestions.
Otherwise, the virtual programmer shall randomly pick 1 ground-truth location for the generator.
Given the predicted / randomly selected location with ground-truth labels, the generator will provide 10 candidate edit suggestions.
We compare them with the ground truth.
If max BLEU-4 is below 50, return as a failure.
If between 50~100, return as human intervention.
Otherwise, return as a success.