Learning Pareto-Optimal Models for 3D Human Pose Estimation

1.Overview

We propose the Graph and Skipped Transformer (G-SFormer) to achieve an optimal trade-off between accuracy and computational cost. It emphasizes capturing holistic spatial interactions and long-range temporal dynamics, leading to a comprehensive performance enhancement. G-SFormer comprises two key modules: a Part-based Adaptive GNN, which utilizes coarse-grained body parts to construct structural correlations through a fully adaptive graph topology, and a Frameset-based Skipped Transformer, which captures long-range dynamics from multiple perspectives of movement. The proposed G-SFormer has the following advantages:

Stable performance and low cost: G-SFormer series models can compete with and outperform the state-of-the-arts on large-to-small datasets but takes only a fraction of parameters and computational cost.
Compact architecture: G-SFormer integrates Part-based GNN and Skipped Transformer to exploit spatial and temporal information, respectively. It employs a global approach that eliminates the redundant and iterative spatio-temporal connections in mainstream architectures, providing a streamlined solution for high-accuracy 3D HPE.
Robustness: G-SFormer exhibits outstanding robustness to inaccurately detected 2D poses. Meanwhile, given its lightweight model size and low computational cost, G-SFormer holds significant practical value for 3D HPE tasks in complex real-world scenarios.

overall flops.pdf

Left: Comparison of spatio-temporal correlation modeling methods: (a) building joint-wise connections and frame-wise connections for the pose representation (b) building joint-wise connections and frame-wise connections for each joint (c) our G-SFomer: constructing part-based spatial alignments and long-range temporal skipped-connections. Right: MPJPE (mm) vs. MFLOPs of the proposed G-SFormer and competitors on Human3.6M dataset, where marker size indicates model size. Methods lower and to the left produce models with better trad-offs between two conflicting objectives: accuracy and cost, with G-SFormer achieving Pareto-optimal performance among competitive approaches.

main3.pdf

Architecture of Graph and Skipped Transformer (G-SFormer)

2.Abstract

Recent works in 2D-to-3D pose uplifting for monocular 3D Human Pose Estimation (HPE) have shown significant progress. GNN-based methods and Transformer-based methods have become mainstream architectures due to their advanced spatial and temporal feature learning capacities. However, existing approaches typically construct dense joint-frame connections and iterative correlations in spatial and temporal domains, resulting in complex models with large parameters and high computational cost. To achieve efficient 3D HPE with low resource consumption, we propose to learn Pareto-optimal models with better trade-offs between computational cost and accuracy. The models leverage human physical structure and long-range dynamics to learn spatial part- and temporal frameset-based representations. Specifically, in the Spatial Encoding stage, coarse-grained body parts are used to construct structural correlations with a fully adaptive graph topology. This spatial correlation representation is integrated with multi-granularity pose attributes to generate a comprehensive pose representation for each frame. In Temporal Encoding and Decoding stages, Skipped Self-Attention is performed in framesets to establish long-term temporal dependencies from multiple perspectives of movement. On this basis, a compact Graph and Skipped Transformer (G-SFormer) is proposed, which can flexibly work in both seq2frame and seq2seq workflows. Extensive experiments on Human3.6M, MPI-INF-3DHP and HumanEva benchmarks demonstrate that G-SFormer series models can compete with and outperform the state-of-the-art methods with small-scale model size and minimal computational cost. It also exhibits outstanding robustness to inaccurately detected 2D poses.

3. Robustness to Inaccurately Detected 2D Poses

4.Demo

sample_video_stride_OURS.mp4

We provide demo of in-the-wild videos with noisy 2D poses, including typical cases of joint position deviation, left-right switch, confusion caused by self-occlusion and miss/coincidence detection. G-SFormer demonstrates stable performance in such challenging scenarios.

6.Code

The source code is available at github.com/Mona9955/G-SFormer/tree/master /tree/master . Multiple versions of G-SFormer/-S/-L have been released, including the pretrained models and code.

Page updated

Google Sites

Report abuse