Recommending and Localizing Change Requests for Mobile Apps Based on User Reviews

Abstract

Researchers have proposed several approaches to extract information from user reviews useful for maintaining and evolving mobile apps. However, most of them just perform automatic classification of user reviews according to specific keywords (e.g., bugs, features). Moreover, they do not provide any support for linking user feedback to the source code components to be changed, thus requiring a manual, time-consuming, and error-prone task. In this paper, we introduce ChangeAdvisor, a novel approach that analyzes the structure, semantics, and sentiments of sentences contained in user reviews to extract useful (user) feedback from maintenance perspectives and recommend to developers changes to software artifacts. It relies on natural language processing and clustering algorithms to group user reviews around similar user needs and suggestions for change. Then, it involves textual based heuristics to determine the code artifacts that need to be maintained according to the recommended software changes. The quantitative and qualitative studies carried out on 44683 user reviews of 10 open source mobile apps and their original developers showed a high accuracy of ChangeAdvisor in (i) clustering similar user change requests and (ii) identifying the code components impacted by the suggested changes. Moreover, the obtained results show that CHANGEADVISOR is more accurate than a baseline approach for linking user feedback clusters to the source code in terms of both precision (+47%) and recall (+38%).

Threshold Evaluation

As explained in the paper, the output of our approach is represented by a ranked list where the links having the highest similarity values are reported at the top. Pairs of (cluster, component) having a Dice similarity coefficient higher than a threshold are considered to be a link by ChangeAdvisor. We experimented different values to set this threshold and the best result were achieved when considering the third quartile of the distribution of the Dice similarity coefficients achieved on a given application. The results achieved by using different thresholds are reported in the following table:

 app
thresholdprecision thresholdprecision thresholdprecision 
 Frostwire3rd Quartile812nd Quartile661st Quartile 34
 K-9 Mail3rd Quartile842nd Quartile621st Quartile  27
 AC Display3rd Quartile822nd Quartile60 1st Quartile  41 
 Wordpress3rd Quartile 792nd Quartile 551st Quartile  33
 Solitaire3rd Quartile752nd Quartile631st Quartile  42 
 Shortyz Crossword3rd Quartile 712nd Quartile 661st Quartile  52
 SMS Backup +3rd Quartile 672nd Quartile 641st Quartile  44 
 Focal3rd Quartile832nd Quartile 811st Quartile  64
 Cool Reader3rd Quartile792nd Quartile 771st Quartile  52
 FB Reader3rd Quartile 83 2nd Quartile 74 1st Quartile 45 
Overall3rd Quartile812nd Quartile641st Quartile37

Cohesiveness of User Feedback Clusters

In the paper we reported the aggregated results of the evaluation of the cohesiveness of user feedback clusters. In this file are reported the results achieved on all the apps:link

Direct Linking User Reviews to Source Code Components

 app precision
 Frostwire14
 K-9 Mail19
AC Display 8
 Wordpress11
 Solitaire4
 Shortyz Crossword3
 Focal
 Cool Reader12 
 FB Reader9
Overall 9

Comparison between LDA, HDP-LDA and LDA-GA

As explained in the paper, to cluster user feedback we experimented three types of techniques, namely the LDA technique exploited by Asuncion et al., the HDP-LDA solution proposed by Teh et al, and the LDA-GA algorithm devised by Panichella et al. Specifically, we ran the three techniques on the apps in our dataset, and we manually evaluated (i) the execution time, and (ii) the differences in the formed clusters. Moreover, we asked to the external developers involved in the context of RQ1 to evaluate the cohesiveness of the clusters created by the three approaches. Considering that the underlying approach (i.e., LDA) is the same, the three techniques mainly differ for the parameter alpha, which has been manually set in the case of LDA (note that we used the configurations suggested by Asuncion et al., namely alpha=10, 20, 30) or has been automatically derived by the other approaches experimented. This file contains the results achieved on the apps in our dataset: link

Comparison between ChangeAdvisor and BLUiR
In this file are reported the detailed results of the comparison between ChangeAdvisor and BLUiR: link

Prototype

We provide a prototype of the approach, together with the dataset we exploited for the study. The prototype is a runnable Docker container which allows to exactly replicate the experiment we conducted with the provided dataset, but also to apply the approach to any given app data (i.e., reviews and source code).

Dataset: download