Experiments

RQ1

Auto-tuned prompt performance on human-labelled edit order dataset

Evaluated prompt auto-tuning for edit-order recovery using 100 annotated commits (772 hunks, 1,747 edges).
Auto-tuned prompt achieved 87.26% accuracy, 88.01% precision, and 87.54% F1-score.
Outperformed all baselines by large margins — +63.81% accuracy improvement over the best baseline (hand-crafted prompt at 53.27%).
Demonstrated that automated prompt optimization captures underlying edit-order principles better than zero-/few-shot and manual designs.

Auto-tuned prompt performance on real-world edit order dataset

Input:
- Ground-truth edit order sequence;
- Predicted partial order graph, where each node is an edit hunk, and each edge is a partial order relation inferred by LLM.

Initialize visisted edit hunk set as empty set;
Initialize the number of forbidden partial orders that are contracted by ground-truth edit sequence as 0;
For each edit order:
1. Update visited hunk set
2. Find the subset of visited hunk set \hat{H} that lie in the same weakly connected componenet as the last edit h_i
3. Find the forbidden transition from last edit h_i to h_k, where despite h_k lies in the same weakly connected component as h_i, there's no direct edge from node in \hat{H} to h_k
4. If edit order exist in forbidden transition set, count this as a violation, and FN += 1

Tested on an industry dataset of 500 real commits (3,059 hunks) from a large IT company.
Measured reliability via a violation-based metric, counting inconsistent orderings against ground-truth edit sequences.
Auto-tuned prompt produced only 30 violations, compared to 195 (zero-shot), 203 (few-shot), and 121 (hand-crafted).
Reduced contradictions by over 75% relative to the best baseline, confirming strong alignment with real editing behaviors.

Large-scale edit simulation via digital twin for original subsequent edit recommendation systems

On the large-scale benchmark: 87.42% improvement in flow-keeping edits, 22.42% reduction in flow-breaking edits
Gains are not due to bias toward the model’s own inferred edit order, Consistent trend across both benchmarks

66.99% improvement in precision, 7.09% decrease in recall, and 46.64% improvement in F0.5
Demonstrates a favorable trade-off: higher correctness with stable coverage

Page updated

Google Sites

Report abuse