A Unified Masked Autoencoder with Patchified Skeletons for Motion Synthesis

Esteve Valls Mascaró

Hyemin Ahn

Dongheui Lee

mib_welcome.mp4

Abstract

The synthesis of human motion has traditionally been addressed through task-dependent models that focus on specific challenges, such as predicting future motions or filling in intermediate poses conditioned on known key-poses. In this paper, we present a novel task-independent model called MASK-M, which can effectively address these challenges using a unified architecture. Our model obtains comparable or better performance than the state-of-the-art in each field. Inspired by Vision Transformers (ViTs), our MASK-M model decomposes a human pose into body parts to leverage the spatio-temporal relationships existing in human motion. Moreover, we reformulate various pose-conditioned motion synthesis tasks as a reconstruction problem with different masking patterns given as input. By explicitly informing our model about the masked joints, our MASK-M becomes more robust to occlusions. Experimental results show that our model successfully forecasts human motion on the Human3.6M dataset. Moreover, it achieves state-of-the-art results in motion inbetweening on the LaFAN1 dataset, particularly in long transition periods.

How does our framework works?

Let a human motion  X and its respective binary mask M. We first interpolate Xg to obtain Xfill and provide consistency to the input . Then, our Pose Decomposition module (PD) deconstructs each human pose pt into a sequence of patches t, which we project and flatten to a sequence of tokens E. We add the embmix to E to inform the transformer-based encoder and decoder about the masked tokens and the spatio-temporal structure. Our ViT-based encoder and decoder reconstruct the patch-based sequence of tokens. Our Pose Aggregation module (PA) regroups the decoded tokens into poses using an MLP layer. Finally, each pose is projected back to the joint representations and summed to our reference motion Xref.

Qualitative results

Motion Forecasting

Forecasting.mp4

Motion Completion

Completion.mp4

Motion In-Betweening

inbetweening1.mp4
inbetweening2.mp4

Publication

@article{EsteveVallsMaskM,

  title = {A Unified Masked Autoencoder with Patchified Skeletons for Motion Synthesis},

  journal = {ArXiV},

  year = {2023},

  author = {Esteve Valls Mascaro and Hyemin Ahn and Dongheui Lee},

}