Zan Wang1, 2 *, Jingze Zhang2, 3 *, Yixin Chen2, Baoxiong Jia2, Wei Liang1, 4 †, Siyuan Huang2 †
* indicates equal contribution † indicates corresponding authors
1 School of Computer Science & Technology, Beijing Institute of Technology
2 State Key Laboratory of General Artificial Intelligence, BIGAI
3 Department of Automation, Tsinghua University
4 Yangtze Delta Region Academy of Beijing Institute of Technology, Jiaxing
Despite significant advancements in human motion generation, current motion representations---typically formulated as discrete frame sequences---still face two critical limitations: (i) they fail to capture motion from a multi-scale perspective, limiting the capability in complex patterns modeling; (ii) they lack compositional flexibility, which is crucial for model's generalization in diverse generation tasks. To address these challenges, we introduce MSQ, a novel quantization method that compresses the motion sequence into multi-scale discrete tokens across spatial and temporal dimensions. MSQ employs distinct encoders to capture body parts at varying spatial granularities and temporally interpolates the encoded features into multiple scales before quantizing them into discrete tokens. Building on this representation, we establish a generative mask modeling model to effectively support motion editing, motion control, and conditional motion generation. Through quantitative and qualitative analysis, we show that our quantization method enables the seamless composition of motion tokens without requiring specialized design or re-training. Furthermore, extensive evaluations demonstrate that our approach outperforms existing baseline methods on various benchmarks.
MSQ separately compresses motion features at various spatial and temporal granularities, producing a multi-scale discrete token representation. These dequantized features are merged and passed to a decoder for motion reconstruction.
Task-specific adaptations of our model for diverse tasks:
Motion Composition
Motion Control
Text-based Motion Editing
Conditional Motion Generation
The light green and yellow colors indicate that the tokens are derived from two different motions.
@article{wang2025spatial,
title={Spatial-Temporal Multi-Scale Quantization for Flexible Motion Generation},
author={Wang, Zan and Zhang, Jingze and Chen, Yixin and Jia, Baoxiong, and Liang, Wei and Huang, Siyuan},
journal={arXiv preprint arXiv:2508.08991},
year={2025}
}