Convolutional Tensor-Train LSTM for Spatio-Temporal Learning

Jiahao Su* (UMD), Wonmin Byeon* (NVIDIA), Jean Kossaifi (NVIDIA), Furong Huang (UMD), Jan Kautz (NVIDIA), Animashree Anandkumar (NVIDIA)

(*) Equal contribution, NeurIPS 2020

Links: [paper] [code] [slides]

Correspondence to:
Jiahao Su <jiahaosu@terpmail.umd.edu>, Wonmin Byeon <wbyeon@nvidia.com>


'Mixed Precision Training for Convolutional Tensor-Train LSTM' at ECCV 2020 Tutorial [program page]

Abstract

Learning from spatio-temporal data has numerous applications such as human-behavior analysis, object tracking, video compression, and physics simulation.However, existing methods still perform poorly on challenging video tasks such as long-term forecasting. This is because these kinds of challenging tasks require learning long-term spatio-temporal correlations in the video sequence. In this paper, we propose a higher-order convolutional LSTM model that can efficiently learn these correlations, along with a succinct representations of the history. This is accomplished through a novel tensor train module that performs prediction by combining convolutional features across time. To make this feasible in terms of computation and memory requirements, we propose a novel convolutional tensor-train decomposition of the higher-order model. This decomposition reduces the model complexity by jointly approximating a sequence of convolutional kernels as a low-rank tensor-train factorization. As a result, our model outperforms existing approaches, but uses only a fraction of parameters, including the baseline models.Our results achieve state-of-the-art performance in a wide range of applications and datasets, including the multi-steps video prediction on the Moving-MNIST-2and KTH action datasets as well as early activity recognition on the Something-Something V2 dataset.

Convolutional Tensor-Train LSTM

The preprocessing module first groups the previous hidden states into overlapping sets with a sliding window, and reduces the number of channels in each group using a convolutional layer. The convolutional tensor-train module takes the results, aggregates their spatio-temporal information, and computes the gates for the LSTM update.

  • Early Activity Recognition Results

The videos show the recognition comparisons among 3D-CNN, Conv-LSTM and our Conv-TT-LSTM on the Something-Something V2 dataset when the input frames are partially seen.
The time-frame of the video corresponds to an amount of video frames seen by the models. (.) is the confidence for Correct/Wrong prediction.

  • Video Prediction Results

We show comparisons of SSIM and LPIPS on Moving MNIST and KTH Action datasets. The bubble size is the model size.

The quality degradation of images can be attributed to blurriness (uncertainty about the shape) and/or distortion (mis-prediction of the shape). The metric SSIM reflects local blurriness, while the metric LPIPS tends to capture overall distortion and blurriness.

The prediction with Conv-TT-LSTM is much sharper than other methods including PrenRNN++ and E3D-LSTM. Conv-TT-LSTM still produces slightly distorted output but in more reasonable shape compare to ConvLSTM (baseline). The visualization results are presented below.

Moving-MNIST-2

MNIST

30 frames prediction on given 10 input frames.

KTH Action

KTH action

20 frames prediction given 10 input frames.