BKinD-3D: Self-Supervised 3D Keypoint Discovery from Multi-View Videos
Jennifer J. Sun1*, Lili Karashchuk2*, Amil Dravid3*, Serim Ryou4†, Sonia Fereidooni2, John C. Tuthill2, Aggelos Katsaggelos3, Bingni W. Brunton2, Georgia Gkioxari1, Ann Kennedy3, Yisong Yue1, Pietro Perona1
1Caltech 2University of Washington 3Northwestern 4SAIT
*Equal contribution †Work done outside of SAIT
Summary
We introduce self-supervised 3D keypoint discovery, which discovers 3D pose from real-world multi-view behavioral videos of different organisms, without any 2D or 3D supervision.
We propose a novel method (BKinD-3D) for end-toend 3D discovery from video using multi-view spatiotemporal difference reconstruction and 3D joint length constraints.
We demonstrate quantitatively that our work significantly closes the gap between supervised 3D methods and 3D keypoint discovery across different organisms (humans and rats).
Method
BKinD-3D: 3D keypoint discovery using 3D volume bottleneck. We start from input multi-view videos with known camera parameters, then unproject feature maps from geometric encoders into 3D volumes for timestamps t and t + k. We next aggregate 3D points from volumes into a single edge map at each timestamp, and use edges as input to the decoder alongside appearance features at time t. The model is trained using multi-view spatiotemporal difference reconstruction. Best viewed in color.
Qualitative Results
Our method works across a range of organisms. We evaluate on the Human3.6M and Rat7M datasets. Please refer to our paper for the quantitative results. Below we display uncurated visualizations.
Acknowledgements
This work is generously supported by the Amazon AI4Science Fellowship (to JJS), NIH NINDS (R01NS102333 to JCT), and the Air Force Office of Scientific Research (AFOSR FA9550-19-1-0386 to BWB).
Correspondence to jjsun (at) caltech (dot) edu.