Kunpeng Li, Chen Fang, Zhaowen Wang, Seokhwan Kim, Hailin Jin and Yun Fu
Northeastern University, Adobe Research
Screencast tutorials are videos created by people to teach how to use software applications or demonstrate procedures for accomplishing tasks. It is very popular for both novice and experienced users to learn new skills, compared to other tutorial media such as text, because of the visual guidance and the ease of understanding. In this paper, we propose visual understanding of screencast tutorials as a new research problem to the computer vision community. We collect a new dataset of Adobe Photoshop video tutorials and annotate it with both low-level and high-level semantic labels. We introduce a bottom-up pipeline to understand Photoshop video tutorials. We leverage state-of-the-art object detection algorithms with domain specific visual cues to detect important events in a video tutorial and segment it into clips according to the detected events. We propose a visual cue reasoning algorithm for two high-level tasks: video retrieval and video captioning. We conduct extensive evaluations of the proposed pipeline. Experimental results show that it is effective in terms of understanding video tutorials. We believe our work will serves as a starting point for future research on this important application domain of video understanding.
Overview of the proposed two-stage screencast tutorial understanding pipeline. The first low-level stage aims at segmenting a video into short clips, each of which corresponds to an atomic software operation. The high-level semantics of each clip is further analyzed in the second stage for downstream use cases including video clip retrieval and video captioning using learned models based on labeled data collected through crowdsourcing.
Below, we also show PsTuts data statistics including the distributions of selected tools (left one with top 18 tools shown), video clip lengths (middle) and word frequencies (right). The high frequency words are software-specific.
The general structure of our visual cue reasoning (VCR) method for text-to-video retrieval and tutorial video captioning is shown as follows. The tutorial encoding is generated considering correlations between different visual cues as well as video frames.
[PDF]
Github link: https://github.com/KunpengLi1994/PsTuts
Source data: https://drive.google.com/drive/folders/1osWW6dnsnvlWNseOtivIdhdpVct1r38x?usp=sharing
@inproceedings{li2020pstuts,
author = {Li, Kunpeng and Fang, Chen and Wang, Zhaowen and Kim, Seokhwan and Jin, Hailin and Fu, Yun},
title = {Screencast Tutorial Video Understanding},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2020}
}
kunpengli@ece.neu.edu