Crowdsourced testing is increasingly dominant in mobile application (app) testing, but it is a great burden for app developers to inspect the incredible number of test reports. Many researches have been proposed to deal with test reports based only on texts or additionally simple image features. However, in mobile app testing, texts contained in test reports are condensed and the information is inadequate. Many screenshots are included as complements that contain much richer information beyond texts. This trend motivates us to prioritize crowdsourced test reports based on a deep screenshot understanding.
In this paper, we present a novel crowdsourced test report prioritization approach, namely DeepPrior. We first represent the crowdsourced test reports with a novelly introduced feature, namely DeepFeature, that includes all the widgets along with their texts, coordinates, types, and even intents based on the deep analysis of the app screenshots, and the textual descriptions in the crowdsourced test reports. DeepFeature includes the Bug Feature, which directly describes the bugs, and the Context Feature, which depicts the thorough context of the bug. The similarity of the DeepFeature is used to represent the test reports' similarity and prioritize the crowdsourced test reports. We formally define the similarity as DeepSimilarity. We also conduct an empirical experiment to evaluate the effectiveness of the proposed technique with a large dataset group. The results show that DeepPrior is promising, and it outperforms the state-of-the-art approach with less than half the overhead.
We construct an integrated dataset group including 4 datasets focused on crowdsourced test report prioritization:
a large-scale crowdsourced test report dataset with 536 reports from 10 apps
a large-scale widget image dataset containing 36,573 widget images of 14 different types
a large-scale test report keyword database with 8,647 domain keywords
a large-scale text classification dataset with 4,340 textual segments of 2 categories
In this section, we provide part of the datasets for reference.
*Note: The data are in Chinese language, and with the developing of NLP technologies, processing of other languages is also adaptable.
We provide 200 crowdsourced test reports from the dataset.
We provide 100 widget image for each type of widget from the dataset.
We provide 1,000 keywords from the database.
We provide 400 text segments of 2 different categories, including Bug Description and Reproduction Step.
We provide a full list of one app to make it possible to reproduce the experimental results, and we also attach all the code file for your reference.
RQ1: How effective can DeepPrior identify the widget type extracted from the app screenshots?
RQ2: How effective can DeepPrior classify the textual descriptions from the crowdsourced test reports?
RQ3: How effective can DeepPrior prioritize the crowdsourced test reports?