Video Skimming with Audio-Visual Reconstruction from
Bag of Words in a Hierarchical Structure
Digital video is an emerging force in today’s computer and telecommunication industries. Many companies, universities and even ordinary families already have large repositories of videos both in analog and digital formats, such as the broadcast news, training and education videos, advertising and commercials, sports, movies, monitoring, surveying and home videos. All of these trends are indicating a promising future for the world of digital video.
Video consists of a collection of video frames, where each frame is a picture image. When a video is being played, each frame is being displayed sequentially with a certain frame rate. No matter what kind of video format is used, this is a huge amount of data, and it is inefficient to handle a video by using all the frames it has. To address this problem, video is divided into segments, and more important and interesting segments are selected for a shorter form — a video abstraction.
There are two types of video abstraction, video summary and video skimming. Video summary, also called a still abstract, is a set of salient images (key frames) selected or reconstructed from an original video sequence. Video skimming, also called a moving abstract, is a collection of image sequences along with the corresponding audios from an original video sequence.
A video summary can be built much faster, since generally only visual information is utilized and no handling of audio and textual information is needed. More salient images such as mosaics could be generated to better represent the underlying video content instead of directly sampling the video frames. Besides, the temporal order of all extracted representative frames can be displayed in a spatial order so that the users are able to grasp the video content more quickly.
Video skimming ore replaying is also called a preview of an original video, and can be classified into two sub-types: highlight and summary sequence. A highlight contains the most interesting and attractive parts of a video, while a summary sequence renders the impression of the content of an entire video. Among all types of video abstractions, summary sequence conveys the highest semantic meaning of the content of an original video.
There are also advantages using video skimming. Compared to a still-image abstract, it makes much more sense to use the original audio information since sometimes the audio track contains important information such as those in education and training videos.
We propose a new video summarization approach: it includes extracting visual and audio features at the frame level, followed by unsupervised learning of shot concept patterns and scene structures, and finally solving a global optimization problem with the goal of selecting interesting shots to perverse the following two aspects: video highlights and information coverage.
We first extract visual, motion and audio features from the sampled frames in each shot: (1) SIFT (scale invariant feature transform) features in a whole video frame; (2) Motion vectors of a moving object; (3) Matching Pursuit decomposition of overlapped short-term audio segments in each shot.
Next, we analyze the high-level concepts and structures of the original video. Video shots with similar content are grouped into shot concept patterns as the followings: we extract Bag-of-Words (BoW) descriptors (SIFT based visual BoW descriptor, local motion BoW descriptor and Matching-pursuit based aural BoW descriptor) for each shot from the visual, motion and audio features extracted in the previous step, and then cluster the three types of BoW descriptors into several groups by spectral clustering, respectively. Each concept pattern (cluster) represents a set of video shots with similar visual, motion or aural content. Moreover, a number of interrelated shots unified by location or dramatic incident constitute a video scene. We can associate each shot with its semantic label -- visual concept pattern, and then identify the label subsequences that are of minimal length and contain recurring labels in a scene transition graph (STG).
Finally, we summarize the original video from the viewpoint of shot sequence reassembly. We generate the condensed video excerpt of the desired skimming length by concatenating a group of shots that not only contain maximum achievable saliency accumulation but also span and distribute uniformly over the entire video. The former criterion tries to preserve the video highlights such as interesting video scenes and shot concept patterns, and the latter one attempts to provide good information coverage of the whole video. In order to meet the above criteria, we formulate a global optimization framework to address the shot selection problem and solve it by an efficient Dynamic Programming (DP) algorithm.
original video A part of movie "Cold Mountain" (5 minutes).
(Notes: for comparison, use an original source the same to the latest saliency-based method.)
skimming ratio=0.1 replaying by frame level. replaying by shot concept level.
skimming ratio=0.2 replaying by frame level. replaying by shot concept level.
skimming ratio=0.3 replaying by frame level. replaying by shot concept level.
original video A part of cartoon "Big-Buck-Bunny" (4 minutes).
skimming ratio=0.1 replaying by frame level. replaying by shot concept level.
skimming ratio=0.2 replaying by frame level. replaying by shot concept level.
skimming ratio=0.3 replaying by frame level. replaying by shot concept level.
1. Huang Y., Yu H., A Survey of Video Editing: Retargeting, Replaying, Repainting, and Reusing (R^4), May, 2009, Tech. Report (pdf), Huawei Technologies (Bridgewater, NJ).
2. J. Gao, Y. Huang, H. Yu, Video Summarization: A Dynamic Programming-based Global Optimization Approach with Aural and Spatial-temporal Visual Features, Huawei Technology (USA). US Patent Pending, 2010.
3. Lu T., Yuan Z., Huang, Y., Wu D., Yu H., Video skimming by the perspective of hierarchical audio-visual reconstruction with saliency-masked bags of words features, Huawei Technologies (USA), US Patent Pending. 2010.