GIFdroid: Automated Replay of Visual Bug Reports for Android Apps

Demo

Approach

Automated Evaluation of GIFdroid

Accuracy of Keyframe Location

Accuracy of GUI Mapping

Performance of Trace Generation

Usefulness Evaluation

Procedure

Overall Result

Real World Examples #1

Real World Examples #2

Real World Examples #3

Real World Examples #4

Repository

Demo

Approach

The overview of GIFdroid method to automatically replay the video (GIF) based bug reports for Android apps, in three main phases: (i) the Keyframe Location phase, which identifies a sequence of keyframes of an input visual recording, (ii) the GUI Mapping phase that maps each located keyframe to the GUIs in UTG, yielding an index sequence, and (iii) the Execution Trace Generation phase that utilizes the index sequence to detect an optimal replayable execution trace.

Automated Evaluation of GIFdroid

Accuracy of Keyframe Location

The performance of our method is much better than that of other baselines, i.e., 32%, 106%, 14% boost in recall, precision, and F1-score compared with the best baseline (ILS- SUMM, PySceneDetect).

Accuracy of GUI Mapping

Our method outperforms in all metrics, 85.4%, 90.0%, 91.3% for Precision@1, Precision@2, Precision@3, respectively. Our method that combines SSIM and ORB leads to a substantial improvement (i.e., 9.7% higher) over any single feature, indicating that they complement each other. In detail, ORB addresses the image distortion that causes false GUI mapping considering only SSIM.

Performance of Trace Generation

Our method achieves 89.59% sequence similarity which is much higher than that of baselines. The hard requirements in V2S limit its generality in the real testing environment especially those open-source software development. In addition, adding LCS can mitigate the errors introduced in the first two steps in our approach, resulting in a boost of performance from 82.63% to 89.59%. Although applying LCS takes a bit more runtime (i.e., 13.25 seconds on average), it does not influence its real-world usage as it can be automatically run offline.

Usefulness Evaluation

Procedure

We recruit 8 participants including 6 graduate students (4 Master, 2 Ph.D in SE degree) and 2 software developers (Alibaba) to participate in the experiment. We first give them an introduction to our study and also a real example to try. Each participant is then asked to reproduce the same set of 10 randomly selected visual bug recordings from GitHub which are of diverse difficulty ranging from 6 to 11 steps until triggering bugs. The study involves two groups of four 1029 participants: the experimental group 𝑃1, 𝑃2, 𝑃3, 𝑃4 who gets help with the generated execution trace by our tool, and the control group 𝑃5, 𝑃6, 𝑃7, 𝑃8 who starts from scratch. have co Each pair of participants ⟨𝑃𝑥 , 𝑃𝑥 +4 ⟩ have comparable development experience, so that the experimental group has similar capability to the control group in total. Note that we do not ask participants to finish half of the tasks with our tool while the other half without assisting tool to avoid potential tool bias. We record the time used to reproduce the visual bug recordings in Android. Participants have up to 10 minutes for each bug replay.

Overall Result

The experiment group reproduces the visual bug recording faster than that of the control group (with average of 171.4 seconds versus 65.0 seconds). In fact, the average time of the control group is underestimated, because three bugs fail to be reproduced within 10 minutes, which means that participants may need more time.

The testing result suggests that our tool can significantly help the experimental group reproduce bug recordings more efficiently (p − value < 0.01).