BKinD-3D: Self-Supervised 3D Keypoint Discovery from Multi-View Videos

CVPR 2023
[paper] [code]

Jennifer J. Sun1*, Lili Karashchuk2*, Amil Dravid3*, Serim Ryou4, Sonia Fereidooni2, John C. Tuthill2, Aggelos Katsaggelos3, Bingni W. Brunton2, Georgia Gkioxari1, Ann Kennedy3, Yisong Yue1,   Pietro Perona1

1Caltech    2University of Washington    3Northwestern    4SAIT

*Equal contribution    †Work done outside of SAIT 


Summary

Method

BKinD-3D: 3D keypoint discovery using 3D volume bottleneck. We start from input multi-view videos with known camera parameters, then unproject feature maps from geometric encoders into 3D volumes for timestamps t and t + k. We next aggregate 3D points from volumes into a single edge map at each timestamp, and use edges as input to the decoder alongside appearance features at time t. The model is trained using multi-view spatiotemporal difference reconstruction. Best viewed in color. 

Qualitative Results

Our method works across a range of organisms. We evaluate on the Human3.6M and Rat7M datasets. Please refer to our paper for the quantitative results. Below we display uncurated visualizations. 

Acknowledgements

This work is generously supported by the Amazon AI4Science Fellowship (to JJS), NIH NINDS (R01NS102333 to JCT), and the Air Force Office of Scientific Research (AFOSR FA9550-19-1-0386 to BWB). 

Correspondence to jjsun (at) caltech (dot) edu.