Text Analytics for Software Engineering

Presentation Slides of the technical briefing at ESEC/FSE 2011

Here are bibliography on text analytics for software engineering.

Here is the web site (including presentation slides) for a technical briefing on "Management of Unstructured Information during Software Evolution: Applications of Text Retrieval" by Andrian Marcus at ESEC/FSE 2011.

Software engineering data contains a rich amount of natural language text: requirements documents, code comments, identifier names, commit logs, release notes, mailing list discussions, etc. The natural language text is essential in the software engineering process to help software engineers and researchers better understand and maintain software. Given the overwhelming amount of available natural language text, there is a high demand of text analytics including natural language processing (NLP) and text mining techniques to automatically analyze the natural language text to improve software quality and productivity. The history of applying NLP and text mining techniques to analyze software engineering data can date back to about a decade ago. In recent five years, text analytics for software engineering has become an emerging topic in the software engineering area. Various recent studies showed that automated analysis of natural language text can improve software reliability, programming productivity, software maintenance, and software quality in general. 

This technical briefing (1) provides a quick overview of major text mining techniques as well as NLP techniques (e.g., Part-Of-Speech tagging, chunking, semantic labeling, semantic pattern matching, and negative-expression identification), machine learning techniques (e.g., clustering and decision-tree-based classification), and data mining techniques (e.g., frequent itemset mining); (2) introduces popular text analysis tools (e.g., WordNet and Weka); (3) summarizes major research work done in the area of text analytics for software engineering; and (4) outlines future research directions and highlights research challenges. More information on the technical briefing could be found at https://sites.google.com/site/text4se/.

Lin Tan is an Assistant Professor in the Department of Electrical and Computer Engineering at the University of Waterloo, Canada since 2009 after she received her Ph.D. in Computer Science from the University of Illinois, Urbana-Champaign. Her research interests include software reliability and security with a focus on applying natural language processing and machine learning techniques to improve software system reliability. Her recent work (ICSE’11, MSR’11, ICSE’09, SOSP’07) has been on analyzing natural language text, such as code comments and commit logs, to improve software reliability and quality. URL: https://ece.uwaterloo.ca/~lintan/


Tao Xie is an Associate Professor in the Department of Computer Science at North Carolina State University, USA since 2005 after he received his Ph.D. in Computer Science from the University of Washington at Seattle. His research interests are in automated software testing and mining software engineering data including recent work (ASE’10, MSR’10, ASE’09, ICSE’08) on applying NLP and text mining on software engineering data. He co-presented a number of tutorials on mining software engineering data and software testing at past ICSEs. URL: http://www.csc.ncsu.edu/faculty/xie/

Subpages (1): Bibliography