Additional Resources

List of Relevant Concepts

We conducted semi-structured and structured surveys with developers of software companies to understand the nature of the software development concepts that manifest in relevant comments. Based on the survey outcomes, we have released a set of concepts and their classes in this link [7] (some examples in Table 2).

Participants can use this list to analyse comments and match them for relevant concepts to get the list of Comprehension Relevant concepts. Match here signifies string matching, n-gram matching, similarity matching based on an optimal threshold (to be set by the participants) using cosine similarity, or edit distance methods.

Pre-trained Embeddings for Software Development Concepts

Finding similar words helps us to locate concepts in semantically related code and comment pairs that cannot be captured through syntactical matching alone. Existing pre-trained embeddings are trained on general English corpora, such as Fasttext [1] and GloVe [11] that may be unable to completely capture word senses and meaning within the software development domain that manifest in code comments.

Therefore, we trained our own contextualised embeddings (SWVec) using the context aware ELMo architecture [10] and a data corpus from multiple sources related to software development: (i) 19GB of posts from Stack Overflow; and (ii) 11GB of computing and books and papers from our institutional library, related to programming, algorithms, architecture, memory, metrics, etc.

The 200 dimensional SWVec pre-trained embeddings have been released in [6] for this task which can be employed for semantic analysis of the concepts in the comment and associated code.

References

[1] Ben Athiwaratkun, Andrew Gordon Wilson, and Anima Anandkumar. Probabilistic fasttext for multi-sense word embeddings. arXiv preprint arXiv:1806.02901, 2018.

[2] Sergio Cozzetti B de Souza, Nicolas Anquetil, and Kathia M de Oliveira. A study of the documentation essential to software maintenance. Conference on Design of communication, pages 68–75. ACM, 2005.

[3] Len Erlikh. Leveraging legacy system dollars for e-business. IT professional, 2(3):17–23, 2000.

[4] Jose Luis Freitas, Daniela da Cruz, and Pedro Rangel Henriques. A comment analysis approach for program comprehension. Annual Software Engineering Workshop (SEW), pages 11–20. IEEE, 2012.

[5] Natasa Gisev et al. Interrater agreement and interrater reliability: key concepts, approaches, and applications. Research in Social and Administrative Pharmacy, 9(3):330–338, 2013.

[6] Srijoni Majumdar. Pre-trained word embeddings for software development related concepts. Open Source, 2022. https://github.com/SMARTKT/WordEmbeddings, Last Accessed: April 01, 2022.

[7] Srijoni Majumdar. Software development ontology. Open Source, 2022. https://github.com/SMARTKT/CommentProbe/tree/master/Comment_Examples/SD_ONTOLOGY, Last Accessed: April 01, 2022.

[8] Srijoni Majumdar, Ayush Bansal, Partha Pratim Das, Paul D Clough, Kausik Datta, and Soumyan Kanti Ghosh. Automated evaluation of comments to aid software maintenance. Journal of Software: Evolution and Process, Wiley, 2022.

[9] Johannes Martin and Hausi A Muller. Strategies for migration from c to java. European Conference on Software Maintenance and Reengineering, pages 200–209. IEEE, 2001.

[10] Yifan Peng, Shankai Yan, and Zhiyong Lu. Transfer learning in biomedical natural language processing: An evaluation of bert and elmo on ten benchmarking datasets. arXiv preprint arXiv:1906.05474, 2019.

[11] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.

[12] Lutz Prechelt. An empirical comparison of c, c++, java, perl, python, rexx and tcl. IEEE Computer, IEEE, 33(10):23–29, 2000.

[13] Tobias Roehm et al. How do professional developers comprehend software? International Conference on Software Engineering (ICSE), pages 255–265. IEEE, 2012.