Human-Object Interaction Prediction in Videos through Gaze Following

Zhifan Ni

Esteve Valls Mascaró

Hyemin Ahn

Dongheui Lee

Abstract

Understanding the human-object interactions (HOIs) from a video is essential to fully comprehend a visual scene. This line of research has been addressed by detecting HOIs from images and lately from videos. However, the video-based HOI anticipation task in the third-person view remains understudied. In this paper, we design a framework to detect current HOIs and anticipate future HOIs in videos. We propose to leverage human gaze information since people often fixate on an object before interacting with it. These gaze features together with the scene contexts and the visual appearances of human-object pairs are fused through a spatio-temporal transformer. To evaluate the model in the HOI anticipation task in a multi-person scenario, we propose a set of person-wise multi-label metrics. Our model is trained and validated on the VidHOI dataset, which contains videos capturing daily life and is currently the largest video HOI dataset. Experimental results in the HOI detection task show that our approach improves the baseline by a great margin of 36.3% relatively. Moreover, we conduct an extensive ablation study to demonstrate the effectiveness of our modifications and extensions to the spatio-temporal transformer. 

How does our framework works?

We initially detect and track the humans and objects from a sequence of RGB frames. We capture the human interest in the scene through gaze-following networks. Then, we utilize a pre-trained backbone to extract visual and spatial features per objects and humans. We also generate semantic features for all objects encountered in the scene using a word embedding model. Finally, we propose a spatio-temporal transformer encoder that first summarizes the context of the scene and then refines the human-object pair representations though cross-attention mechanisms, ultimately predicting the HOI interaction probabilities for each human-object pair. 

Qualitative results

Publication

@article{NI2023103741,

  title = {Human–Object Interaction Prediction in Videos through Gaze Following},

  journal = {Computer Vision and Image Understanding},

  volume = {233},

  pages = {103741},

  year = {2023},

  issn = {1077-3142},

  doi = {https://doi.org/10.1016/j.cviu.2023.103741},

  url = {https://www.sciencedirect.com/science/article/pii/S1077314223001212},

  author = {Zhifan Ni and Esteve Valls Mascaro and Hyemin Ahn and Dongheui Lee},

}