Intention-Conditioned Long-Term Human Egocentric Action Forecasting

Esteve Valls Mascaró

Hyemin Ahn

Dongheui Lee

IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2023

Winner of Ego4D Long-Term Anticipation Challenge in CVPR 2022 and ECCV 2022


Abstract

To anticipate how a person would act in the future, it is essential to understand the human intention since it guides the subject towards a certain action. In this paper, we propose a hierarchical architecture which assumes a sequence of human action (low-level) can be driven from the human intention (high-level). Based on this, we deal with long-term action anticipation task in egocentric videos. 

Our framework first extracts this low- and high-level human information over the observed human actions in a video through a Hierarchical Multi-task Multi-Layer Perceptrons Mixer (H3M). Then, we constrain the uncertainty of the future through an Intention-Conditioned Variational Auto-Encoder (I-CVAE) that generates multiple stable predictions of the next actions that the observed human might perform. By leveraging human intention as high-level information, we claim that our model is able to anticipate more time-consistent actions in the long-term, thus improving the results over the baseline in Ego4D dataset. 

This work results in the state-of-the-art for Long-Term Anticipation (LTA) task in Ego4D by providing more plausible anticipated sequences, improving the anticipation scores of nouns and actions. Our work ranked first in both CVPR@2022 and ECCV@2022 Ego4D LTA Challenge.

How does our framework works?

Overall proposed framework. Preextracted features for N observed videos are fed to our Hierarchical Multitask MLP Mixer model (H3M) to obtain low-level action labels and high-level intention. Results are fed into our Intention-Conditioned Variational AutoEncoder (I-CVAE) that anticipates subsequent Z  actions.

Detailed structure of I-CVAE architecture. Illustrating the encoder (top) and decoder (bottom) of our Transformer-based CVAE model. Given a sequence of N+Z actions and an Intention label, the encoder outputs distribution parameters that encode all sequence information into a latent space. The decoder takes the N observed actions and the intention to sample from the latent space and output the representation sequence of Z actions to anticipate.

Publication

Intention-Conditioned Long-Term Human Egocentric Action Forecasting

Esteve Valls Mascaró, Hyemin Ahn, Dongheui Leein IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023
Winner in EGO4D LTA ECCV@2022Ranked 1st in Ego4D Long-Term Anticipation Challenge for European Conference on Computer Vision  (ECCV), 2022
Winner in EGO4D LTA CVPR@2022Oral presentation in Ego4D Computer Vision and Pattern Recognition (CVPR) 2022 Workshop.Ranked 1st in Ego4D Long-Term Anticipation Challenge for IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2022

@InProceedings{Mascaro_2023_WACV,

    author    = {Mascaro, Esteve Valls and Ahn, Hyemin and Lee, Dongheui},

    title     = {Intention-Conditioned Long-Term Human Egocentric Action Anticipation},

    booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},

    month     = {January},

    year      = {2023},

    pages     = {6048-6057}

}