Qualitative Analysis on RQ3 Component-wise Study
Qualitative Analysis on RQ3 Component-wise Study
The InputBox can sometimes be confused with TextButton, CombinedButton, and TextView, they are indeed visually similar.
The InputBox does not have clear boundaries.
The Chart does not have clear boundaries.
We only observe a 1 out of 100 failure rate. The failure case is because the object detector didn't detect the relevant widget in the first place, thus that widget is not highlighted when feeding into VLM and is thus being ignored.
Example: The action is "See all channels". Therefore, the expected answer is to click the "See all >" button. However, the widget is not detected by the object detector, so there is no way for VLM to locate that widget.