CVPR 2021 Workshop on 3D Vision and Robotics

June 19th, 2021

In recent years, there has been tremendous progress on 3D vision for analysis and understanding of 3D data, such as 3D semantic segmentation, 3D object detection and tracking. These advances have however not yet translated to significant progress in several fundamental challenges for the domain of robotics. Active perception in static and dynamic environments, inference of spatial relations in 3D scenes, activity recognition, and behavior prediction in real-world settings are a few examples of challenging robotics problems. To successfully tackle these problems, we should leverage the inherent 3D nature of the physical world, and apply deep learning approaches that learn 3D representations that are robust to input perturbation and generalize to real-world variations with high sample efficiency (e.g., transformation invariance). This workshop presents a timely opportunity to bring together researchers in computer vision, machine learning, and robotics communities together to discuss the unique challenges and opportunities in 3D vision for robotics.

Topics of Interest

This workshop explores the use of 3D perception for robotics. 3D perception is critical in robotic applications such as manipulation and navigation. A robot needs to understand the 3D world in order to perform tasks in 3D space. 3D vision techniques can be used to tackle many challenges in robot perception. In addition, connecting 3D vision with robotics inspires exploration of active vision, interactive perception, and reinforcement learning since robots are able to interact with a 3D environment to obtain feedback.

In this workshop, we aim to bring together experts spanning visual computing, machine learning, and robotics to discuss challenges in 3D vision and how it can help with perception, control, and planning in robotics. One of the key challenges of 3D vision is what type of representation is appropriate and how do we design machine learning algorithms for these different representations. In contrast to 2D images that have a standard representation as regular pixel grids, 3D data can come as irregular 3D point cloud (e.g., acquired with LIDAR sensors), meshes of varying topology, or as volumetric data. Intelligent agents also need to take temporal sequences of sensory observations and make decisions on how to act. Will accumulating sensory observations into a 3D representation of the world lead to better models? This workshop will provide a venue for people interested in 3D and robotics to come together to discuss the various challenges and problems in this area. We will accept submissions on the following list of topics, broadly interpreted.

This workshop will focus on related discussion topics such as the ones below:

  • Is 3D useful for robotics? What kind of 3D representations are useful for robotics?

  • How can a robot learn a 3D representation of its environment and relevant objects from raw sensory input under noisy sensors and actuators? What kind of machine learning algorithms are needed?

  • What is the right interface between 3D perception and planning & control?

  • What are the underexplored areas of 3D perception for robotics (e.g. instance recognition, few-shot learning)?

  • What is the role of 3D simulation for robotics?

  • Both robotics and 3D data are fields that are research areas with high barriers to entry. How can we enable researchers from other fields such as ML to more easily work in these areas?


All times are in CDT (Central Daylight Time)

9:00am - 9:15am Welcome (Yuke Zhu)


9:15am - 9:45am Sanja Fidler, University of Toronto - Image GANS meet 3D Engines

9:45am - 10:15am David Held, CMU - Perceptual Robot Learning

10:15am - 11:00am Spotlight Session 1: Representation and Learning

Synergies Between Affordance and Geometry: 6-DoF Grasp Detection via Implicit Representations

Zhenyu Jiang (University of Texas - Austin)*; Yifeng Zhu (University of Texas at Austin);

Maxwell Svetlik (University of Texas at Austin); Kuan Fang (Stanford University); Yuke Zhu (University of Texas - Austin)


Data-driven 6D Pose Tracking by Calibrating Image Residuals in Synthetic Domains

Bowen Wen (Rutgers University)*; Chaitanya Mitash (Amazon Robotics); Kostas Bekris (Rutgers University)

PDF, Supplementary

Visionary: Neural Architecture Search for Robot Learning

Iretiayo Akinola (Columbia University); Anelia Angelova (Google); Yao Lu (Google Research); Yevgen Chebotar (Google);

Dmitry Kalashnikov (Google Inc.); Jacob Varley (Google); Julian Ibarz (Google); Michael S Ryoo (Google; Stony Brook University)*


Spotlight Session 2: Recognition with Point Clouds

Recurrently Estimating Reflective Symmetry Planes from Partial Pointclouds

Mihaela C Stoian (FiveAI Ltd.)*; Tommaso Cavallari (FiveAI Ltd.)

PDF, Supplementary

Latent-Polar Transformer for LiDAR Point-cloud 3D Object Detection

Manoj Bhat (Qualcomm)*; Shizhong Han (Qualcomm); Fatih Porikli (Qualcomm)


SeqNetVLAD vs PointNetVLAD: Image Sequence vs 3D Point Clouds for Day-Night Place Recognition

Sourav Garg (Queensland University of Technology)*; Michael Milford (ACRV and QUT, Australia)


11:00am - 11:30am Break

11:30am - 12:00pm Kristen Grauman, UT Austin - Sights, sounds and spaces: Audio-visual learning in 3D environments

12:00pm - 12:30pm Manolis Savva, Simon Fraser University - Simulation for Embodied AI: Emerging Directions

12:30pm - 1:00pm Franziska Meier, Facebook - Model-Based Visual Imitation Learning

1:00pm - 2:00pm Lunch

2:00pm - 2:30pm Hao Su, UC San Diego - SAPIEN++ and Manipulation Skill Challenge

2:30pm - 3:00pm Andy Zeng, Google - From Shapes to Actions

3:00pm - 3:45pm Spotlight Session 3: Datasets for Vision and Robotics

SAIL-VOS 3D: A Synthetic Dataset and Baselines for Object Detection and 3D Mesh Reconstruction from Video Data

Yuan-Ting Hu (UIUC)*; Jiahong Wang (UIUC); Raymond A Yeh (UIUC); Alexander Schwing (UIUC)


DexYCB: A Benchmark for Capturing Hand Grasping of Objects

Yu-Wei Chao (NVIDIA)*; Wei Yang (NVIDIA); Yu Xiang (NVIDIA); Pavlo Molchanov (NVIDIA); Ankur Handa (NVIDIA);

Jonathan Tremblay (NVIDIA); Yashraj S Narang (NVIDIA); Karl Van Wyk (NVIDIA); Umar Iqbal (NVIDIA Research);

Stan Birchfield (NVIDIA); Jan Kautz (NVIDIA); Dieter Fox (NVIDIA)


Data-driven haptic feedback utilizing an object manipulation data-set

Athanasios Ntovas (CERTH-ITI); Lazaros Lazaridis (CERTH); Dr. Alexis Papadimitriou (Visual Computing Lab of CERTH/ITI)*;

Athanasios Psaltis (CERTH); Apostolos Axenopoulos (ITI-CERTH); Petros Daras (ITI-CERTH, Greece)


Spotlight Session 4: Applications in Vision and Robotics

A Sim2Real Approach to Augment Low-Resource Data for Dynamic Emotion Expression Recognition

Saba Akhyani (Simon Fraser University)*; Mehryar Abbasi Boroujeni (Simon Fraser University);

Mo Chen (Simon Fraser University); Angelica Lim (Simon Fraser University)


Photometric Gaussian Mixtures for Direct Virtual Visual Servoing of Omnidirectional Camera

Seif Eddine Guerbas (Université de Picardie Jules Verne)*; Nathan Crombez ( Université de technologie de Belfort Montbéliard);

Guillaume Caron (Universite de Picardie Jules Verne); El Mustapha Mouaddib (Universite de Picardie Jules Verne)


Visually Guided Agile Quadruped Locomotion

Gabriel B Margolis (MIT)*; Tao Chen (MIT); Xiang Fu (MIT ); Kartik Paigwar (ASU); Donghyun Kim (MIT);

Sangbae Kim (MIT); Pulkit Agrawal (MIT)


3:45pm - 4:00pm Break

4:00pm - 4:30pm Roozbeh Mottaghi, AI2 - Visual Navigation in Indoor Scenes

4:30pm - 5:30pm Panel (Li Erran Li, Charles Qi)

5:30pm - 5:45pm Closing Remarks

Speakers and Talks

Sanja Fidler, University of Toronto

Sanja Fidler is an Associate Professor at University of Toronto, and a Director of AI at NVIDIA, leading a research lab in Toronto. Prior coming to Toronto, in 2012/2013, she was a Research Assistant Professor at Toyota Technological Institute at Chicago, an academic institute located in the campus of University of Chicago. She did her postdoc with Prof. Sven Dickinson at University of Toronto in 2011/2012. She completed her PhD in computer science at University of Ljubljana in 2010, and was a visiting student at UC Berkeley in the final year of her PhD. She has served as an Area Chair for multiple computer vision, machine learning and NLP conferences (CVPR, ICLR, EMNLP, ACCV), and as a Program Chair of 3DV’16. Her main research interests are object recognition, 3D scene understanding, and combining vision and language.

David Held, Carnegie Mellon University

David Held is an assistant professor at Carnegie Mellon University in the Robotics Institute and is the director of the RPAD lab: Robots Perceiving And Doing. His research focuses on perceptual robot learning, i.e. developing new methods at the intersection of robot perception and planning for robots to learn to interact with novel, perceptually challenging, and deformable objects. David has applied these ideas to robot manipulation and autonomous driving. Prior to coming to CMU, David was a post-doctoral researcher at U.C. Berkeley, and he completed his Ph.D. in Computer Science at Stanford University. David also has a B.S. and M.S. in Mechanical Engineering at MIT. David is a recipient of the Google Faculty Research Award in 2017 and the NSF CAREER Award in 2021.

Kristen Grauman, The University of Texas at Austin

Kristen Grauman is a Professor in the Department of Computer Science at the University of Texas at Austin and a Research Scientist at Facebook AI Research. Her research in computer vision and machine learning focuses on visual recognition and search. Before joining UT Austin in 2007, she received her Ph.D. at MIT in computer science. She is a Sloan Fellow, a recipient of NSF CAREER and ONR Young Investigator awards, the 2013 PAMI Young Researcher Award, the 2013 IJCAI Computers and Thought Award, a Presidential Early Career Award for Scientists and Engineers (PECASE), a 2017 Helmholtz Prize computer vision “test of time” award, and the 2018 J.K. Aggarwal Prize from the International Association for Pattern Recognition. She and her collaborators were recognized with best paper awards at CVPR 2008, ICCV 2011, and ACCV 2016.

Manolis Savva, Simon Fraser University

Manolis Savva is an Assistant Professor in the School of Computing Science at Simon Fraser University, and a Canada Research Chair in Computer Graphics. He completed his PhD at the Stanford graphics lab, advised by Pat Hanrahan. His research focuses on human-centric 3D scene analysis, 3D scene generation, and simulation for scene understanding. He has also worked in data visualization, grounding of natural language to 3D content, and in creating large-scale scene datasets for 3D deep learning.

Franziska Meier, Facebook AI Research

Franziska Meier is a research scientist at Facebook AI Research. Previously she was a research scientist at the Max-Planck Institute for Intelligent Systems and a postdoctoral researcher with Dieter Fox at the University of Washington, Seattle. She received her PhD from the University of Southern California, where she defended her thesis on “Probabilistic Machine Learning for Robotics” in 2016, under the supervision of Prof. Stefan Schaal. Prior to her PhD studies, she received her Diploma in Computer Science from the Technical University of Munich. Her research focuses on machine learning for robotics, with a special emphasis on lifelong learning for robotics.

Hao Su, UC San Diego

Hao Su is an Assistant Professor of Computer Science and Engineering at UC San Diego. He is interested in fundamental problems in broad disciplines related to artificial intelligence, including machine learning, computer vision, computer graphics, and robotics. His most recent work focuses on integrating the disciplines for building and training embodied AI that can interact with the physical world. In the past, his work on ShapeNet, PointNet series, and graph neural networks have significantly impacted the emergence and growth of a new field, 3D deep learning. He also participated in the development of ImageNet, a large-scale 2D image database. He has served as the Area Chair, Associated Editor, and other comparable positions in the program committee of CVPR, ICCV, ECCV, ICRA, Transactions on Graphics (TOG), and AAAI.

Andy Zeng, Google AI

Andy Zeng is a research scientist at Google AI working on vision for robotics. His research focuses on manipulation and self-supervised deep learning, to enable machines to intelligently interact with the physical world and improve themselves over time. His work has been recognized with the best paper system award at RSS 2019, and as best paper finalists at ICRA 2020 and IROS 2018.

Roozbeh Mottaghi, Allen Institute for AI

Roozbeh Mottaghi is the Research Manager of the PRIOR team at Allen Institute for AI and an Affiliate Associate Professor in Paul G. Allen School of Computer Science & Engineering at the University of Washington. Prior to joining AI2, he was a post-doctoral researcher at the Computer Science Department at Stanford University. He obtained his PhD in Computer Science in 2013 from University of California, Los Angeles. His research is mainly focused on Computer Vision and Machine Learning.

Panel Discussion Video

Call for Abstracts

We solicit 2-4 page extended abstracts conforming to the official CVPR style guidelines. A paper template is available in LaTeX and Word. References will not count towards the page limit. The review process is double-blind. Submissions can include: late-breaking results, under review material, archived, or previously accepted work (please make a note of this in the submission).

Important Dates

  • Submission Deadline: April 21, 2021 (11:59 pm PST)

  • Papers Assigned to Reviewers: April 24, 2021 (11:59 pm PST)

  • Reviews Due: May 8, 2021 (11:59 pm PST)

  • Acceptance Decision: May 15, 2021 (11:59 pm PST)

  • Camera-Ready Version: May 29, 2021 (11:59 pm PST)

Please note the accepted contributions will be presented as spotlight talks in the workshop, and will be posted on the workshop website upon author approval.


Angel X. Chang

Simon Fraser University

Katerina Fragkiadaki

Carnegie Mellon University

Qixing Huang

The University of Texas at Austin

Li Erran Li

Alexa AI at Amazon

Yu Xiang

NVIDIA Research

Yuke Zhu

The University of Texas at Austin