Schedule

Final Program

TimeZone CET on 11th July 2021

  • 09:00-09:10 : Introduction

  • 09:10-09:40: On Monocular Depth Estimation: (1) MonoDEVS ; (2) Multi-modal Co-training, Dr. Antonio M. López

  • 09:40-10:10: Point-based recognition, Prof. Philipp Krähenbühl

  • 10:10-10:40: Lidar Segmentation at Motional, Venice Liong

  • 10:40-11:10: Co-Development of Automatic Annotation for Machine Learning and Sensor Fusion Improvement System, Stefan Haag

  • 11:10-11:40: Radar Perception for Automated Driving – Data and Methods, Ole Schumann

  • 11:40-12:10: Perception Data Pipeline at Innoviz Technologies, Amir Day

  • 12:10-12:30: [paper 1] The Oxford Road Boundaries Dataset, Tarlan Suleymanov*, Matthew Gadd, Daniele De Martini, Paul Newman

  • 12:30-13:30 Lunch Break

  • 13:30-13:50: [paper2] Unsupervised Joint Multi-Task Learning of Vision Geometry Tasks, Prabhash Kumar Jha*, Doychin Tsanev, Luka Lukic

  • 13:50-14:10: [paper3] CFTrack: Center-based Radar and Camera Fusion for 3D Multi Object Tracking, Ramin Nabati*, Landon Harris, Hairong Qi

  • 14:10-14:30: [paper4] Machine learning based 3D object detection for navigation in unstructured environments, Gjorgji Nikolovski*, Michael Reke, Ingo Elsen, Stefan Schiffer

  • 14:30-14:50: [paper5] Pruning CNNs for LiDAR-based Perception in Resource Constrained Environments, Manoj Vemparala*, Anmol Singh, Ahmed Mzid, Nael Fasfous, Alexander Frickenstein, Florian Mirus, Hans Joerg Voegel, Naveen Shankar Nagaraja, Walter Stechele

  • 14:50-1500 BREAK

  • 15:00-15:30: Modern methods of visual localization, Dr. Martin Humenberger

  • 15:30-16:00: All-In-One Drive: A Large-Scale Comprehensive Perception Dataset with High-Density Long-Range Point Clouds, Xinshuo Weng

  • 16:00-16:30: Offboard Perception for Autonomous Driving, Charles R Qi

  • 16:30-17:00: Using Artificial Intelligence layer to transform high-resolution radar point cloud into insights for Autonomous Driving applications, Sani Ronen

  • 17:00-17:30: Self-supervised 3D vision, Dr. Rareș Ambruș

  • 17:30-17:45: Closing

Q&A is included at the end(last 5 minutes) of each talk if time permits.


Accepted papers

  1. The Oxford Road Boundaries Dataset, Tarlan Suleymanov*, Matthew Gadd, Daniele De Martini, Paul Newman

  2. Unsupervised Joint Multi-Task Learning of Vision Geometry Tasks, Prabhash Kumar Jha*, Doychin Tsanev, Luka Lukic

  3. CFTrack: Center-based Radar and Camera Fusion for 3D Multi Object Tracking, Ramin Nabati, Landon Harris, Hairong Qi

  4. Machine learning based 3D object detection for navigation in unstructured environments, Gjorgji Nikolovski, Michael Reke*, Ingo Elsen, Stefan Schiffer

  5. Pruning CNNs for LiDAR-based Perception in Resource Constrained Environments, Manoj Vemparala*, Anmol Singh, Ahmed Mzid, Nael Fasfous, Alexander Frickenstein, Florian Mirus, Hans Joerg Voegel, Naveen Shankar Nagaraja, Walter Stechele

Invited Speakers

Principal Investigator, Autonomous Driving, Computer Vision Center (CVC) Tenure Associate Professor at the Dpt. Computer Science, Universitat Autònoma de Barcelona (UAB) ICREA Acadèmia at UAB

Antonio M. López is the principal investigator of the Autonomous Driving lab of the Computer Vision Center (CVC) at the Universitat Autonoma de Barcelona (UAB). He has also a tenure position as associated professor at the Computer Science department of the UAB. Antonio has a long trajectory carrying research at the intersection of computer vision, computer graphics, machine learning, driver assistance and autonomous driving. Antonio has been deeply involved in the creation of the SYNTHIA dataset and the CARLA open-source simulator, both for democratizing autonomous driving research. He is actively working hand-on-hand with industry partners to bring state-of-the-art techniques to the field of autonomous driving. Currently, Antonio is granted by the Catalan ICREA Academia program.


On Monocular Depth Estimation: (1) MonoDEVS ; (2) Multi-modal Co-training.

Depth information is essential for on-board perception in autonomous driving. Monocular depth estimation (MDE) is very appealing since it allows for appearance and depth being on direct pixelwise correspondence without further calibration. In this talk, we present MonoDEVS, our MDE approach to train on virtual-world supervision and real-world SfM self-supervision (thus, on monocular sequences). In addition, we present our recent results on semi-supervised learning (SSL) for object annotation in on-board images. More specifically, we use co-training as SSL approach, assessing the usefulness of MDE as one of the data views for co-training.

Department of Computer ScienceUniversity of Texas at Austin

Philipp is an Assistant Professor in the Department of Computer Science at the University of Texas at Austin. He received his Ph.D. in 2014 from the CS Department at Stanford University and then spent two wonderful years as a PostDoc at UC Berkeley. His research interests lie in Computer Vision, Machine learning, and Computer Graphics. He is particularly interested in deep learning, image understanding, and vision and action.


Point-based recognition

Computer vision algorithms commonly recognize objects as axis-aligned boxes. Even before deep learning, the best performing object detectors classified rectangular image regions. On one hand, this approach conveniently reduces recognition to image classification. On the other hand, it has to deal with a nearly exhaustive list of image regions that do not contain any objects. In this talk, I'll present an alternative representation of objects: as points. I'll show how to build an object detector from a keypoint detector of object centers. The presented approach is both simpler and more efficient (faster and/or more accurate) than equivalent box-based detection systems. Our point-based detector easily extends to other tasks, such as object tracking, monocular or Lidar 3D detection, and pose estimation.

Group Lead 3D Vision, Theme Lead AI for Robotics NAVER LABS Europe

Martin Humenberger leads the 3D Vision group and the AI for Robotics research theme at NAVER LABS Europe. His research interests are 3D vision in general, visual localization, camera pose estimation, visual features and, in particular, combining machine learning with geometry. He applies his research to mobile robotics and navigation. He is the author of numerous scientific publications and papers and a regular organiser of international workshops in his fields of interest and expertise. Martin joined NAVER LABS Europe in 2017 prior to which he was a senior scientist at the Austrian Institute of Technology in Vienna. He spent a year at NASA's Jet Propulsion Laboratory as a Caltech Postdoc after obtaining his Ph.D. in electrical engineering from the Vienna University of Technology.

Modern methods of visual localization

Visual localization is an important component of many location-based systems such as self-driving cars, autonomous robots, or augmented, mixed, and virtual reality. The goal is to estimate the accurate position and orientation of a camera from images. In more detail, correspondences between a representation of the environment (map) and a query image are utilized to estimate the camera pose in 6 degrees of freedom (DOF). In this presentation, I will given an overview of popular techniques for visual localization and by providing concrete examples, I will go deeper into approaches that use global representations for image retrieval and local features for accurate pose computation. I will also introduce our open source platform named kapture, that is designed to facilitate future research in this and related domains.

Research scientist at Waymo LLC

Charles Qi is currently a senior research scientist at Waymo LLC. Before that he was a postdoctoral researcher at Facebook AI Research (FAIR). He got his Ph.D. from Stanford University in 2018 and his B.Eng. from Tsinghua University in 2013. His research focuses on deep learning, computer vision and 3D with well known publications in CVPR, ICCV, SIGGRAPH Asia, and Neurips. Some of the 3D deep learning and 3D object detection models he developed have been widely adopted in both academia and industry. His work of PointNet and Frustum PointNets have been recognized among the top 10 most influential papers in CVPR 2017 and CVPR 2018 respectively, and his paper on Deep Hough Voting has won the ICCV 2019 Best Paper nomination award. More information can be found on his homepage: www.charlesrqi.com

Offboard Perception for Autonomous Driving

While current 3D object recognition research mostly focuses on the real-time, onboard scenario, there are many offboard use cases of perception that are largely underexplored, such as using machines to automatically generate high-quality 3D labels. Existing 3D object detectors fail to satisfy the high-quality requirement for offboard uses due to the limited input and speed constraints. In this talk, we introduce a novel offboard 3D object detection pipeline using point cloud sequence data. Observing that different frames capture complementary views of objects, we design the offboard detector to make use of the temporal points through both multi-frame object detection and novel object-centric refinement models. Evaluated on the Waymo Open Dataset, our pipeline named 3D Auto Labeling shows significant gains compared to the state-of-the-art onboard detectors and our offboard baselines. Its performance is even on par with human labels verified through a human label study. Further experiments demonstrate the application of auto labels for semi-supervised learning and unsupervised domain adaptation, as well as the application to build a large-scale motion forecasting dataset.

Ph.D., Robotics InstituteSchool of Computer ScienceCarnegie Mellon University

Xinshuo Weng is a Ph.D. candidate at Robotics Institute of Carnegie Mellon University (CMU) advised by Kris Kitani. She received a master's degree at CMU, where she worked with Yaser Sheikh and Kris Kitani. Prior to CMU, she worked at Facebook Reality Lab as a research engineer to help buildPhotorealistic Telepresence. Her bachelor's degree was received from Wuhan University. Her research interest lies in 3D Computer Vision and Graph Neural Networks for autonomous systems. She has developed 3D multi-object tracking systems such as AB3DMOT that received >1,000 stars on GitHub. Also, she is leading a few autonomous driving workshops at major conferences such as NeurIPS 2020, IJCAI 2021, ICCV 2021 and IROS 2021. She was awarded a Qualcomm Innovation Fellowship for 2020 and a Facebook Fellowship Finalist for 2021.


Point Cloud Forecasting in Autonomous Driving: Approach, Challenge and Benchmark

Perception and prediction pipeline (3D object detection and multi-object tracking, trajectory forecasting) is a key component in self-driving cars. Although significant advancements have been achieved in each individual module of this pipeline, limited attention is received to improve the pipeline itself. In this talk, I will introduce an alternative to this standard pipeline, which first forecasts LiDAR point clouds. Then, detection and tracking are performed on the predicted point clouds to obtain future object trajectories. As forecasting LiDAR point clouds does not require object labels for training, we can scale performance with more unlabeled data.

To deal with the challenge in point cloud forecasting, I will also talk about a few techniques that can produce point cloud sequence with significantly more fine-grained details. Finally, as an emerging task in autonomous driving, I will talk about a new perception dataset we have built to benchmark the point cloud forecasting task. This new dataset is all-inclusive in terms of sensor modalities, annotations and environmental variations. We hope that this dataset can help benchmark progress in point cloud forecasting and innovate multi-sensor multi-task perception systems.

Director of Computer Vision at Innoviz Technologies

Amir is an Algorithms and Software Engineering with more than 10 years of experience in leading complex, multi-disciplinary technological projects. Since joining Innoviz in its early days in 2016, Amir was responsible for the company product’s low-level algorithms. He then moved to develop the company’s perception software from scratch. Nowadays, Amir oversees Innoviz's perception solution, from data collection though algorithms to deployment. Prior to working at Innoviz, Amir spent 6 years in the Israel Defense Forces in various roles, the last of which, leading two novel and multi-disciplinary projects.

Perception Data Pipeline at Innoviz Technologies

Perception for autonomous vehicles has greatly improved over the last few years. Advancement and novelty in this field depend both on state-of-the-art algorithms as well as data collection and its usage. Now that data in itself has become ubiquitous, the main challenge is using it correctly and efficiently. During this talk, we will share the data flow between the Algorithm, Embedded and Testing teams at Innoviz, and describe how the data pipeline is being used to constantly improve our perception product, the software layer that complements our high-performance LiDAR sensor, InnovizOne.


Research Scientist at Motional

Venice is a Senior Research Scientist with the Map Scalability team at Motional. She leads a group in developing machine learning solutions for mapping which includes semantic understanding of the surroundings, automated map annotations and ML-based map update/validation. Venice also has experience working on image and LiDAR networks for various perception tasks such as object detection and semantic segmentation for autonomous vehicles. She holds a PhD from the Nanyang Technological University, Singapore where her work focused on face analysis, person re-identification, image/multimedia retrieval using feature learning, deep learning and metric learning techniques.

Lidar Segmentation at Motional

Abstract: Point cloud semantic segmentation is a critical task for autonomous systems. In particular, this task provides useful semantic information that enables the building of crisp, high-definition maps from LiDAR point clouds used in autonomous vehicles (AVs) for localization and road understanding. In this talk, I will present the nuScenes-lidarseg which is an extension of our large scale public dataset wherein 1.4 billion annotated points across 40,000 pointclouds and 1000 scenes are provided. I will then focus on our current research on LiDAR Segmentation, AMVNet which is a simple late fusion LiDAR segmentation approach, evaluated in the nuScenes-lidarseg dataset along with other benchmark datasets.

Product Manager Arbe

Sani Ronen serves as a leader in the product group at Arbe, he has over 26 years experience in various semiconductors and systems companies. Sani experience includes expertise in Phased Array and Air born Radars, RF and Communications accumulated from his service in Microchip , Microsemi, Radwin, Siemens and the IAF. He holds a bachelor’s degree in Electrical & Computer Engineering from the Ben-Gurion University in Beer-Sheva, Israel and a Master in Entrepreneurship and Innovation from the Swinburne University of Technology in Australia

Using Artificial Intelligence layer to transform high-resolution radar point cloud into insights for Autonomous Driving applications

We will present how utilizing an AI layer for post-processing the radar's data enables many advanced real-time features, such as accurate inference of the vehicle’s ego-motion, tracking objects, their bounding boxes, and motion vector in the entire field of view, accurate free space mapping to distinguish drivable from non-drivable environments, and even simultaneous localization and mapping. In this talk we will present the 4D imaging radar technology, and show why it is the perfect counterpart to the camera, not offering redundancy by doing "more of the same", but rather, relying on a different technology, due to which radar's and camera's strengths and weaknesses complement each other: radars excel in depth, relative radial velocity, and long range sensing, around the clock, while cameras are perfect for contrast based information, accurate tangential velocities, and classification, when light permits. Therefore, achieving true safety for any Level 2 application, hands free driving, and full autonomy which is, after all, the ultimate goal, must utilize the fusion of both sensors.

Mercedes-Benz AG

Ole Schumann received the master's degree in physics from Göttingen University, Germany, in 2016. From 2017 to 2018, he was a Ph.D. student at Daimler AG in the radar perception team. He received his Ph.D. degree from TU Dortmund University, Dortmund, Germany, in 2021 and works in the research and development department for autonomous driving at Mercedes-Benz AG. His research interests include (semi-)supervised machine learning algorithms as well as clustering methods suitable for scene understanding with radar.

Radar Perception for Automated Driving – Data and Methods

In comparison to camera and lidar, radar sensors are often only marginally considered when it comes to data sets for machine learning applications. In this talk, a new radar data set is introduced which should help to shift some focus away from the mainstream sensors. The RadarScenes data set (www.radar-scenes.com) contains real-world measurements from four automotive radar sensors with point-wise annotations. Some of the algorithms and methods already developed on this data set will be introduced and examples for machine learning approaches for the classification of moving road users will be presented.

Mercedes-Benz AG

Stefan Haag is a Ph. D. student at Mercedes-Benz AG in Sindelfingen, Germany, and the University of Bonn, Germany, since 2017. He was working in the radar perception team and the sensor data fusion team for Mercedes-Benz. His research interests lie in low-level sensor data fusion for autonomous driving applications for dynamic and static targets and combining model driven with data driven approaches to improve robustness as well as performance of 360° surround fusion algorithms. He received his master’s degree in mathematics at the University of Ulm, Germany, in 2017.

Co-Development of Automatic Annotation for Machine Learning and Sensor Fusion Improvement System

Labeled radar datasets are crucial for establishing artificial intelligence-based methods on radar data for environment perception. In particular, fast and efficient acquisition of new data under a specific sensor setup is essential. We present a framework that utilizes Bayesian-based approaches and eventually fusion methods to provide veritable and precise object trajectories and shape estimation to provide annotation labels on the detection level under various supervision levels. Simultaneously, the framework provides continuous evaluation of tracking performance and label annotation through automated feedback evaluation. If manually labeled data is available, each processing module can be analyzed independently or combined with other modules to enable closed-loop continuous improvements. The framework allows the integration of information from additional sensors for improved results, but it allows the execution as a radar-only application.

Senior Research Scientist Toyota Research Institute

Bio: Rareș Ambruș is a Senior Research Scientist in the Machine Learning team at the Toyota Research Institute (TRI), in Los Altos, CA, USA. His research interests lie at the intersection of robotics, computer vision and machine learning, with an emphasis on self-supervised learning for 3D perception. He received his PhD in 2017 from the Royal Institute of Technology (KTH), Sweden focusing on self-supervised perception and mapping for mobile robots. He has 8+ years of industry experience working on autonomous vehicles, mobile robots and virtual/augmented reality and has more than 25 publications and patents in top-tier computer vision, machine learning and robotics conferences.


Self-Supervised 3D Vision

Cameras are ubiquitous, and video data is widely available. In this talk I will cover our work at TRI on self-supervised learning and ways to leverage projective geometry to learn from videos without any human supervision. Starting from the now standard paradigm of self-supervising depth in videos, I will also cover extensions to multi-camera systems, non-standard camera models, visual odometry and keypoint learning. In a semi-supervised setting, we have developed networks that can leverage partial point clouds both at training and inference time, for increased accuracy and robustness. Additionally, we have shown that raw unlabeled data can be used to bootstrap and significantly improve monocular 3D object detection. Finally, I will present recent work that uses self-supervised monocular depth estimation as a proxy task to improve sim-to-real unsupervised domain adaptation for semantic segmentation.