Auto-tuned prompt performance on human-labelled edit order dataset
Evaluated prompt auto-tuning for edit-order recovery using 100 annotated commits (772 hunks, 1,747 edges).
Auto-tuned prompt achieved 87.26% accuracy, 88.01% precision, and 87.54% F1-score.
Outperformed all baselines by large margins — +63.81% accuracy improvement over the best baseline (hand-crafted prompt at 53.27%).
Demonstrated that automated prompt optimization captures underlying edit-order principles better than zero-/few-shot and manual designs.
Auto-tuned prompt performance on real-world edit order dataset
Input:
Ground-truth edit order sequence;
Predicted partial order graph, where each node is an edit hunk, and each edge is a partial order relation inferred by LLM.
Initialize visisted edit hunk set as empty set;
Initialize the number of forbidden partial orders that are contracted by ground-truth edit sequence as 0;
For each edit order:
Update visited hunk set
Find the subset of visited hunk set \hat{H} that lie in the same weakly connected componenet as the last edit h_i
Find the forbidden transition from last edit h_i to h_k, where despite h_k lies in the same weakly connected component as h_i, there's no direct edge from node in \hat{H} to h_k
If edit order exist in forbidden transition set, count this as a violation, and FN += 1
Output: FN
Tested on an industry dataset of 500 real commits (3,059 hunks) from a large IT company.
Measured reliability via a violation-based metric, counting inconsistent orderings against ground-truth edit sequences.
Auto-tuned prompt produced only 30 violations, compared to 195 (zero-shot), 203 (few-shot), and 121 (hand-crafted).
Reduced contradictions by over 75% relative to the best baseline, confirming strong alignment with real editing behaviors.
Large-scale edit simulation via digital twin for original subsequent edit recommendation systems
Integrated the learned prompt into flow-aware recommendation optimization for Cursor, Claude Code, and CoEdPilot.
EditFlow increased “Keep” (flow-aligned) recommendations and decreased “Break” (irrelevant) ones substantially:
Cursor: Keep ↑ from 24.00% → 38.93%; Break ↓ by 17.47%.
Claude Code: Keep ↑ from 30.76% → 47.45%; Break ↓ by 15.52%.
CoEdPilot: Keep ↑ from 13.30% → 34.00%; Break ↓ by 25.08%.
Corresponding F1-scores improved by 19.37%, 7.82%, and 34.81%, respectively.
Slight recall drop reflects intentional filtering of misaligned yet technically correct suggestions.
Qualitative analysis (GPT-Engineer case) showed EditFlow filtered 13/14 irrelevant suggestions, achieving 100% flow-keeping precision.