Results

RQ1 - Performance

Performance of language models on the studied program generation datasets

Key Findings: State-of-the-art language models like CodeT5, CodeReviewer, and CodeT5+ demonstrate superior performance compared to other models across diverse program generation datasets. However, their inconsistent accuracy and significant variability across different tasks and datasets raise concerns about their reliability.

RQ2 - Reliability

Data Duplication between Training and Testing Sets.

Model Performance Before and After Removing High-Similarity Test Instances

Key Findings: Multiple program generation datasets contain substantial duplications between their training and testing sets. The increased similarity between these sets typically leads to exaggerated performance metrics, raising concerns about the models' generalization capabilities and suggesting potential flaws in data handling.

Data Duplication cross Testing Sets

In 10 out of the 12 datasets, there were many test instances that had identical input data but different expected outputs.

Distribution of Data Duplication within Testing Sets and Corresponding Model Performance Metrics

Key Findings: In many datasets, there are duplicated source code sequences within the test instances, despite requiring models to generate different target code. Such inconsistencies in test design lead to evaluations that may not accurately reflect the true capabilities of program generation approaches.

Output-Input Similarity Analysis

Key Findings: In code review and code repair tasks, language models frequently generate outputs that are identical to the input sequences, presenting a potential limitation in their ability to generate novel code solutions.

RQ3 - Explainability

Key Findings: Explainable AI methods can effectively identify feature importance when generating code sequences. Identifiers and keywords are consistently assigned higher importance scores compared to operators and separators, indicating that language models prioritize syntactic and semantic understanding.

Performance upon input token reduction using different strategies

Key Findings: The substantial performance decrease observed when crucial tokens are removed from input sequences underlines a vulnerability in language models, showing a need for enhanced model robustness.

Page updated

Google Sites

Report abuse