FSE Response

Detailed experiment settings on viewing angle

We select some GUI elements from our dataset and categorize them into one of the following six categories based on the angle from which they were viewed in the screenshots:

Direct front. The front of GUI element directly faces the viewer.
Direct back. The back of GUI element directly faces the viewer.
Direct side. The side of GUI element directly faces the viewer.
Direct top. The top of GUI element directly faces the viewer.
Eyelevel. The viewer observes the GUI element at eye level with an oblique angle.
Overlook. The viewer overlooks the GUI element with an oblique angle.

Although an image may contain several GUI elements at different angles, we only focus on one of them and assume it is the only GUI element, removing all other GUI elements’ labels. In this setting, calculating the Precision metric is meaningless, so we compare the performance using the Recall metric. The images are divided into different splits according to the angle of the focused GUI element it contains.

Since the Recall calculation in our paper involves Precision, we instead use the Average Recall. It is calculated by averaging the Recall over 10 IoU thresholds ranging from 0.5 to 0.95 in the step of 0.05, then over all categories, as in the official COCO API.

The results are shown as follows:

Regarding interactability, only one GUI element in each image is easy for the model to cover, resulting in high performance. The Overlook angle yields significantly higher performance than other angles. This might be because more sides of GUI elements can be seen from this perspective compared to the others, suggesting that viewing from different perspectives, especially with oblique angles, may provide more features to the model and, thus, improve its performance. Among the direct angles, the front one performs best, suggesting that the front side of GUI elements, which usually directly faces the users, may contain more features.

Regarding semantics, the Overlook angle maintains advances over other angles and then follows the Direct top and Direct front, which is consistent with the performance in the interactability. Specifically, the average recall of the Direct back angle even drops to 0, indicating the difficulty of inferring the semantics from the back side, which usually contains the least features.

Overall, different angles of view do affect the detection effectiveness. Viewing GUI elements from the angle with more features can increase the model’s performance.

Performance Comparison of SOTA VLM/GUI Agents

The experiment settings are the same as that in the paper.

Comparison on automated testing experiment

The experiment settings are the same as that in the paper.

Visualized experiment results

Page updated

Google Sites

Report abuse