Performance of language models on the studied program generation datasets
Key Findings: State-of-the-art language models like CodeT5, CodeReviewer, and CodeT5+ demonstrate superior performance compared to other models across diverse program generation datasets. However, their inconsistent accuracy and significant variability across different tasks and datasets raise concerns about their reliability.
Data Duplication between Training and Testing Sets.
Model Performance Before and After Removing High-Similarity Test Instances
Key Findings: Multiple program generation datasets contain substantial duplications between their training and testing sets. The increased similarity between these sets typically leads to exaggerated performance metrics, raising concerns about the models' generalization capabilities and suggesting potential flaws in data handling.
Data Duplication cross Testing Sets
In 10 out of the 12 datasets, there were many test instances that had identical input data but different expected outputs.
Distribution of Data Duplication within Testing Sets and Corresponding Model Performance Metrics
Key Findings: In many datasets, there are duplicated source code sequences within the test instances, despite requiring models to generate different target code. Such inconsistencies in test design lead to evaluations that may not accurately reflect the true capabilities of program generation approaches.
Output-Input Similarity Analysis
Key Findings: In code review and code repair tasks, language models frequently generate outputs that are identical to the input sequences, presenting a potential limitation in their ability to generate novel code solutions.
Key Findings: Explainable AI methods can effectively identify feature importance when generating code sequences. Identifiers and keywords are consistently assigned higher importance scores compared to operators and separators, indicating that language models prioritize syntactic and semantic understanding.
Performance upon input token reduction using different strategies
Key Findings: The substantial performance decrease observed when crucial tokens are removed from input sequences underlines a vulnerability in language models, showing a need for enhanced model robustness.