Recommending and Localizing Change Requests for Mobile Apps Based on User Reviews

Abstract

Researchers have proposed several approaches to extract information from user reviews useful for maintaining and evolving mobile apps. However, most of them just perform automatic classification of user reviews according to specific keywords (e.g., bugs, features). Moreover, they do not provide any support for linking user feedback to the source code components to be changed, thus requiring a manual, time-consuming, and error-prone task. In this paper, we introduce ChangeAdvisor, a novel approach that analyzes the structure, semantics, and sentiments of sentences contained in user reviews to extract useful (user) feedback from maintenance perspectives and recommend to developers changes to software artifacts. It relies on natural language processing and clustering algorithms to group user reviews around similar user needs and suggestions for change. Then, it involves textual based heuristics to determine the code artifacts that need to be maintained according to the recommended software changes. The quantitative and qualitative studies carried out on 44683 user reviews of 10 open source mobile apps and their original developers showed a high accuracy of ChangeAdvisor in (i) clustering similar user change requests and (ii) identifying the code components impacted by the suggested changes. Moreover, the obtained results show that CHANGEADVISOR is more accurate than a baseline approach for linking user feedback clusters to the source code in terms of both precision (+47%) and recall (+38%).

Threshold Evaluation

As explained in the paper, the output of our approach is represented by a ranked list where the links having the highest similarity values are reported at the top. Pairs of (cluster, component) having a Dice similarity coefficient higher than a threshold are considered to be a link by ChangeAdvisor. We experimented different values to set this threshold and the best result were achieved when considering the third quartile of the distribution of the Dice similarity coefficients achieved on a given application. The results achieved by using different thresholds are reported in the following table:

Cohesiveness of User Feedback Clusters

In the paper we reported the aggregated results of the evaluation of the cohesiveness of user feedback clusters. In this file are reported the results achieved on all the apps:link

Direct Linking User Reviews to Source Code Components

Comparison between LDA, HDP-LDA and LDA-GA

As explained in the paper, to cluster user feedback we experimented three types of techniques, namely the LDA technique exploited by Asuncion et al., the HDP-LDA solution proposed by Teh et al, and the LDA-GA algorithm devised by Panichella et al. Specifically, we ran the three techniques on the apps in our dataset, and we manually evaluated (i) the execution time, and (ii) the differences in the formed clusters. Moreover, we asked to the external developers involved in the context of RQ1 to evaluate the cohesiveness of the clusters created by the three approaches. Considering that the underlying approach (i.e., LDA) is the same, the three techniques mainly differ for the parameter alpha, which has been manually set in the case of LDA (note that we used the configurations suggested by Asuncion et al., namely alpha=10, 20, 30) or has been automatically derived by the other approaches experimented. This file contains the results achieved on the apps in our dataset: link

Comparison between ChangeAdvisor and BLUiR

In this file are reported the detailed results of the comparison between ChangeAdvisor and BLUiR: link

Prototype

We provide a prototype of the approach, together with the dataset we exploited for the study. The prototype is a runnable Docker container which allows to exactly replicate the experiment we conducted with the provided dataset, but also to apply the approach to any given app data (i.e., reviews and source code).