STAP: Spatial-Temporal Attention-aware Pooling

Introduction

Human action recognition is valuable for numerous practical applications, e.g., gaming, video surveillance, and video search. In this article, we hypothesize that the classification of actions can be boosted by designing smart feature pooling strategy under the prevalently used bag-of-words based representation. Founded on automatic video saliency analysis, we propose the Spatial-Temporal Attention-aware Pooling (STAP) scheme for feature pooling. Firstly, the video saliencies are predicted using the video saliency model, and the localized spatial-temporal features are pooled at different saliency levels and video-saliency-guided channels are formed. Saliency-aware matching kernels are thus derived as the similarity measurement of these channels. Intuitively, the proposed kernels calculate the similarities of the video foreground (salient areas) or background (non-salient areas) at different levels. Finally the kernels are fed into popular support vector machines for action classification.

Figure 1. The illustration of the spatial-temporal attention-aware feature pooling for action recognition. The figure shows our work is superior over spatial pyramid matching due to the implicit background/foreground matchings. The local features are pooled according to (b) traditional SPM pooling with 2.x.2 channels in spatial-temporal domain and (c) the proposed saliency-aware feature pooling with video saliency guided channels.

Paper [Download]

Dense sampling executable [Download]

We use the dense sampling method which extracts HOG, HOF, MBH features from input video.

The executable file is tested on Windows 7.

Usage:

DenseSampling [Input Video Filename] > [Ouput Descriptor Filename]

Example:

DenseSampling .\samples\Diving_Side_001.vob > .\descriptors\Diving_Side_001.vob.txt

Arguments:

- S10: Sampling time = 10 (default: 10)

- L15: Temporal length = 15 (default: 15)

- P32: Patch size = 32 (default: 32)

The dense sampling features are computed one by one, and each one in a single line, with the following format:

frameNumber x y length scale HOG HOF MBHx MBHy

The first 5 elements are information about the sampling information:

frameNumber: The sampling ends on which frame

x: The value of the x coordinates of the sampling point

y: The value of the y coordinates of the sampling point

length: The temporal length

scale: The sampling descriptor is computed on which scale

The following elements are four descriptors concatenated one by one:

HOG: 96 dimensions

HOF: 108 dimensions

MBHx: 96 dimensions

MBHy: 96 dimensions

Convert RAW dense sampling descriptors to BOW using VLFEAT [Download]

Saliency extraction

Itti and Koch’s saliency prediction model (3 maps) [Download]
AIM [Download]
ICL [Download]
SIM [Download]
FT [Download]
LSK [Download]
SR [Download]
GBVS [Download]
Signature-LAB [Download]
SUN [Download]
Cerf et al. [Download]
Human central bias map [Download]
Motion map [use OpenCV function: calcOpticalFlowFarneback]

Kernel SVM with MATLAB code [Download]

If you happen to use our work, please cite the following paper

Tam V. Nguyen, Zheng Song, Shuicheng Yan. “STAP: Spatial-Temporal Attention-aware Pooling for Action Recognition”, IEEE Transactions on Circuits and Systems for Video Technology (T-CSVT), 2014.

Contact or password request:

Please drop an email to: vantam@gmail.com

Page updated

Google Sites

Report abuse