Video and Action Classification Problems

last update: 6 Oct, 2016

Video Classification

The video classification problem is very similar to the image classification problem. Just your input is a video. One trivial approach would be to treat video frames as images and apply whatever image classification for every image. Then aggregate all classification scores to come up with a final classification score for the video. However, we are dropping all temporal information we have. The dynamics in a scene holds much information that can help us improve our progress.

Like any video processing task, "the task is much more computationally demanding even for processing short video clips since each video might contain hundreds to thousands of frames, not all of which are useful" [1].

There were some recent basic trials to involve the temporal dimension of the data [2], however the state of the art now days comes from using the Long short-term memory network such as in [1] [3] .

There are many available datasets of small scale. A recent large scale one is the The Sports-1M Dataset. However, due to its extreme size, almost no one works over it. Only titans such as Google can work smoothly over such datasets, but not research labs. Google annoauned a much larger dataset, the YouTube-8M dataset. Same conclusion :(. One trick they made is computing feature vectors for the dataset. That should encourage little people to use it.

Action Recognition

A more popular problem is the action classification (recognition) problem. Action Classification is a video classification problem. However, inside the video there are some actions (e.g. man jumps, lady swims). The major stream is for Human Action Recognition. However, other styles exist like recognizing some sub-actions by a lady in the kitchen.

In the era of deep learning, much work has been focusing in this problem (e.g. strong competition). For example, two stream approach [5], visual attention models [4], contextual actions [6]

In the era of manual feature vectors, the improved dense trajectory is a remarkable work [7]. It is even still has comparable performance with deep learning ones.

Action Localization

Action localization focuses on detecting the actions inside the videos. There are little work in this area and most of them are doing very similar work. This is a good research area to have your contribution. Many work relies on doing image object detection and linking them.

One of the nice work to read is Action tubes [8]. Other work: [10], [11], [12], [9].


[1] Beyond Short Snippets: Deep Networks for Video Classification, CVPR 2015

[2] large scale video classification with convolutional neural networks, CVPR 2014

[3] Long-term Recurrent Convolutional Networks (LRCN), CVPR 2015

[4] Action recognition using visual attention, 2015

[5] two-stream convolutional networks for action recognition in videos, NIPS 2014

[6] contextual action recognition with r*cnn, ICCV 2015

[7] Action Recognition with Improved Trajectories, ICCV 2013

[8] finding action tubes, CVPR 2015

[9] Fast Action Proposals for Human Action Detection and Search, CVPR 2015.

[10] Multi-region two-stream R-CNN for action detection, ECCV 2016

[11] learning to track for spatio-temporal action localization

[12] End-to-end Learning of Action Detection from Frame Glimpses in Videos