The quality and diversity of the data used to train and evaluate LLMs are important. However, if the evaluated datasets are not representative, diverse, or realistic enough, the performance metrics (e.g., accuracy, BLEU score, or ROUGE score) may not reflect the true quality of the code generated or reviewed by the LLMs (She et al.). For instance, Liu et al. show that temporal inconsistencies in prior android malware detection research lead to an over-optimistic accuracy of 99%. Moreover, datasets for software development tasks are often derived from open-source projects (Hou et al.), which may introduce noise and bias. For example, Zhang et al. trained Transformer-based methods on a dataset that contained both bot-generated and trivial code commit messages, achieving a 42.4% BLEU-4 score. However, when they removed the noisy data, the performance dropped significantly to 26.2% BLEU-4. Therefore, the reliability of existing benchmark datasets warrants further investigation.
In this study, we conducted a comprehensive empirical evaluation of the reliability and explainability of popular pre-trained language models for automated program generation tasks. Our approach involved the following key steps:
Model Selection: We selected eight state-of-the-art pre-trained language models, including T5, CodeT5, CoTexT, CodeTrans, CodeGPT, CodeBERT, CodeT5+, and CodeReviewer, to assess their performance and characteristics.
Dataset Curation: We evaluated the models on five representative datasets spanning four program generation tasks: code repair, code review, code translation, and code generation. The datasets included Tufano et al., Bugs2Fix, CodeReview, CodeTrans-Dataset, and CONCODE.
Performance Evaluation: We fine-tuned the selected models on each dataset and evaluated their performance using accuracy and BLEU-4 scores. This allowed us to compare the models' effectiveness in generating accurate and relevant code sequences.
Reliability Analysis: To assess the reliability of the evaluation approaches, we investigated potential experimental biases, such as data duplication between training and testing sets, duplication within testing sets, and the similarity between model outputs and inputs. We analyzed the impact of these factors on model performance and the validity of evaluation results.
Explainability Analysis: We employed the SHAP (SHapley Additive exPlanations) approach, a model-agnostic explainable AI technique, to interpret the models' decision-making processes. By examining the feature importance of different token types and the models' robustness to input perturbations, we gained insights into their learning patterns and limitations.