Robust Human Motion Forecasting using Transformer-based Model

Esteve Valls Mascaró

Ma Shuo

Hyemin Ahn

Dongheui Lee

IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2022

Paper

Project

Video

Abstract

Comprehending human motion is a fundamental challenge for developing Human-Robot Collaborative applications. Computer vision researchers have addressed this field by only focusing on reducing error in predictions, but not taking into account the requirements needed to facilitate its implementation in robots.

In this paper, we propose a new model based on Transformer that simultaneously deals with the real time 3D human motion forecasting in the short and long term. Our 2-Channel Transformer (2CH-TR) is able to efficiently exploit the spatio-temporal information of a shortly observed sequence (400ms) and generates a competitive accuracy against the current state-of-the-art. 2CH-TR stands out for the efficient performance of the Transformer, being lighter and faster than its competitors. In addition, our model is tested in conditions where the human motion is severely occluded, demonstrating its robustness in reconstructing and predicting 3D human motion in a highly noisy environment.

Our experiment results show that the proposed 2CH-TR outperforms the ST-Transformer, which is another state-of-the-art model based on the Transformer, in terms of reconstruction and prediction under the same conditions of input prefix. Our model reduces in 8.89% the mean squared error of ST-Transformer in short-term prediction, and 2.57% in long-term prediction in Human3.6M dataset with 400ms input prefix.

How does our 2CH-TR works?

Architecture of 2-Channel Transformer (2CH-TR). The observed skeleton motion sequence X is projected independently for each channel into an embedding space (ES and ET) and then positional encoding is injected. Each embedding is fed into L stacked attention layers that extracts dependencies between the sequence using multi-head attention. Finally, each embedding (ÊS and ÊT) is decoded and projected back to skeleton sequences. Future poses (X̂pred) are then the result of summing the output of each channel (X̂S and X̂T) with the residual connection X from input to output.

Qualitative results

Qualitative results over Human3.6M dataset.

We estimate the human pose using FrankMocap as the observed prefix (in blue) for our 2CH-TR model to predict the next future 3D human motion (gradient of green colour) in a video of a human walking in the wild.

Publication

Robust Human Motion Forecasting using Transformer-based Model

Esteve Valls Mascaró, Shuo Ma, Hyemin Ahn, Dongheui Leein IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2022

@INPROCEEDINGS{9981877,

author={Mascaro, Esteve Valls and Ma, Shuo and Ahn, Hyemin and Lee, Dongheui},

booktitle={2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},

title={Robust Human Motion Forecasting using Transformer-based Model},

year={2022},

volume={},

number={},

pages={10674-10680},

doi={10.1109/IROS47612.2022.9981877}

}