Schedule

Program

All talk times are in CET time zone on June 5th 2022

08:15-08:30 Opening

08:30-09:00 Implicit Neural Representations for Novel View Appearance, Content and Semantic Synthesis, Yiyi Liao, Professor, Zhejiang University

09:00-09:30 Next-Gen Sensor Fusion for Next-Gen Sensors and Driving Functions, Eric Richter, Director Technology/Co-founder, BASELABS

09:30-10:00 Leveraging Physics and Geometry in 3D Visual Perception, Dr. Christos Sakaridis, Postdoctoral Researcher Computer Vision Lab ETH Zurich

10:00-10:15 PVFusion: Point-Voxel Fusion for Multimodal 3D Detection, Ke Wang*, zhichuang zhang, Tao Chen, Shulian Zhao [paper]

10:15-10:45 Coffee break

10:45-11:15 Supervised & Unsupervised Approaches for LiDAR-Based Perception of AVs in Urban Environments, Prof. Dr. Cyrill Stachniss, Head of Photogrammetry and Robotics Labs, University of Bonn

11:15:11:45 Navya 3D Segmentation Dataset for large scale semantic segmentation, Alexandre Almin, Navya

11:45-12:00 Residual MBConv Submanifold Module for 3D LiDAR-based Object Detection, Lie Guo*, Liang Huang, Zhao Yibing [paper]

12:00-13:00 Lunch Break

13:00-13:30 Exploiting Representational Sparsity to Improve 3D Object Detector Runtime on Embedded Systems and Beyond, Kyle Vedder, PhD Student, Computer Science, University of Pennsylvania

13:30-14:00 Strategies and methods for automotive sensor fusion, Robert Laganiere, Professor University of Ottawa, CEO Sensor Cortek

14:00-14:30 3D object detection survey and trends based on LiDAR, Steve Han, Deep Learning Engineer at Qualcomm

14:30-15:00 Collaborative and Adversarial 3D Perception for Autonomous Driving, Yiming Li, PhD Student, New York University

15:00-15:15 Coffee Break

15:15-15:45 Multi-Sensor Safety Calibration for ADAS Applications, Mohammad Musa, Founder & CEO at Deepen AI

15:45-16:15 Closed and Open Problems in 3D Perception for Self-Driving, Jonah Philion, PhD student, University of Toronto

16:15-16:45 DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection, Yingwei Li, Ph.D. Student, Johns Hopkins University

16:45-17:00 Vision-based Large-scale 3D Semantic Mapping for Autonomous Driving Applications, Qing Cheng, Artisense

17:00-Closing

Speakers

ProfessorZhejiang University

Yiyi Liao is an assistant professor in Zhejiang University. Prior to that, she was a Postdoc in Autonomous Vision Group, a part of the University of Tübingen and the MPI for Intelligent Systems, working with Prof. Andreas Geiger. Her research interest lies in 3D computer vision, including 3D scene understanding, 3D reconstruction, and 3D controllable image synthesis.

Implicit Neural Representations for Novel View Appearance, Content and Semantic Synthesis

A photorealistic simulator is essential for autonomous driving, yet existing manually designed simulation environments come with a syn-to-real domain gap that is hard to mitigate. Recent advances on implicit neural representations, e.g., NeRF, have shown impressive performance in photorealistic novel view synthesis. This brings the hope of building a simulator based on the real world. Nevertheless, scaling NeRF to an ideal simulator still faces several challenges, such as the slow rendering speed, the lack of content creation, and the absence of semantic information. In this talk, we will present our recent progress to tackle these challenges, including KiloNeRF for fast rendering, GRAF for creating novel contents, and Panoptic NeRF for rendering semantic labels. In addition, we present the KITTI-360 dataset, a densely annotated, large-scale dataset with novel benchmarks at the intersection of vision, graphics, and robotics in order to foster progress toward full autonomy.

PhD student University of Toronto

Jonah Philion is a PhD student at University of Toronto where he is advised by Sanja Fidler. His research focuses on machine learning and computer vision, primarily with applications to self-driving perception and planning. He is also a research scientist at NVIDIA where he works on simulation for self-driving. Prior to moving to Toronto, Jonah was an early hire at ISEE, a startup working on self-driving for warehouse yard trucks.

Tentive : Closed and Open Problems in 3D Perception for Self-Driving

The goal of perception for autonomous vehicles is to extract semantic representations from multiple sensors and fuse these representations into a single "bird's-eye-view" coordinate frame for consumption by motion planning. In this talk, I'll present two end-to-end architectures that directly extract a bird's-eye-view representation of a scene given image data from an arbitrary number of cameras. The core mechanism of these architectures is a module that "lifts" each image individually into a frustum of features for each camera, then "splats" all frustums into a rasterized bird's-eye-view grid. By training on the entire camera rig, we provide evidence that our model is able to learn not only how to represent images but how to fuse predictions from all cameras into a single cohesive representation of the scene while being robust to calibration error. On standard bird's-eye-view tasks such as object detection, object segmentation and map segmentation, our model outperforms all baselines and prior work. In pursuit of the goal of learning dense representations for motion planning, we show that the representations inferred by our model enable interpretable end-to-end motion planning by "shooting" template trajectories into a bird's-eye-view cost map output by our network.

Director Technology/Co-founder BASELABS

Eric Richter is Director Technology Innovations and co-founder at BASELABS. With his strong technology and market perspective, he brings in new requirements for developing BASELABS´ products, such as sensor fusion for embedded and series devices. He holds a PhD in the field of data fusion for automated driving.

Next-Gen Sensor Fusion for Next-Gen Sensors and Driving Functions

Next-generation driving functions, like automated driving in urban environments or automated parking, are targeting an increasing number of highly complex scenarios with many different traffic participants and object types.

Like high-resolution radars or cameras with semantic segmentation information, next-gen sensors provide significantly more detail-rich information about the environment and act as one enabler of next-gen driving functions. Current sensor fusion approaches, like the combination of object fusion and static grid fusion, feature high modularity. While this has many benefits, it results in an early and irreversible reduction of information.

This talk will outline how this inherent property of the current-gen sensor fusion approaches results in a high risk of failing in challenging scenarios.

We will present the next-gen sensor fusion technology, which provides information for static and dynamic objects as well as free space, with high quality and robustness, through an integrated sensor fusion approach.

This so-called Dynamic Grid approach acts as the second enabler for next-gen driving functions.

Key Take-Aways:

  • Current sensor fusion approaches have inherent properties that limit their applicability for next-generation driving functions and sensors.

  • Integrated sensor fusion approaches resolve these limitations and thus enable next-gen driving functions.

  • The Dynamic Grid is the integrated next-gen sensor fusion technology, ready for series production.

Professor University of OttawaCEO Sensor Cortek

Robert is a professor at the School of Electrical Engineering and Computer Science of the University of Ottawa and the CEO of Sensor Cortek, a startup company developing AI solutions for perception systems. Robert is the co-author of several scientific publications and patents in content-based video analysis, visual surveillance, embedded vision, driver-assistance and autonomous driving applications. Robert authored the OpenCV2 Computer Vision Application Programming Cookbook (2011) and co-authored Object Oriented Software Development (2001). He co-founded Visual Cortek in 2006, an Ottawa-based video analytics startup that was later acquired by iWatchLife in 2009. He also co-founded Tempo Analytics in 2016 a company proposing Retail Analytics solutions and founded Sensor Cortek inc in 2018.TBDRobert is a professor at the School of Electrical Engineering and Computer Science of the University of Ottawa and the CEO of Sensor Cortek, a startup company developing AI solutions for perception systems. Robert is the co-author of several scientific publications and patents in content-based video analysis, visual surveillance, embedded vision, driver-assistance and autonomous driving applications. Robert authored the OpenCV2 Computer Vision Application Programming Cookbook (2011) and co-authored Object Oriented Software Development (2001). He co-founded Visual Cortek in 2006, an Ottawa-based video analytics startup that was later acquired by iWatchLife in 2009. He also co-founded Tempo Analytics in 2016 a company proposing Retail Analytics solutions and founded Sensor Cortek inc in 2018.

Strategies and methods for automotive sensor fusion

The development of road vehicles with a high level of autonomy requires advanced perception capabilities. These vehicles are generally equipped with three main sensor types: cameras, lidar and radar. However, the intrinsic limitation of each sensor affects the performance of the perception task. One way to overcome this issue and to increase the overall performance is to combine the information coming from different sensor modalities. This is the objective of sensor fusion to combine the information coming from different sensors and thus improve the perceptual ability of the vehicle. This way the vehicle can better operate under challenging environmental conditions by relying on the sensor data that is the least impacted by the current situation (e.g. poor lighting, adverse weather). In this presentation, we will present the main sensor fusion strategies that can be used for combining heterogeneous sensor data. In particular, we will discuss the three main fusion methods that can be applied in a perception system, namely early fusion, late fusion and mid-level fusion.

PhD StudentNew York University

Yiming Li is a Ph.D. candidate in AI4CE Lab at New York University (NYU) with the Dean’s PhD Fellowship. His research interest primarily lies in robot vision and learning, with its applications in cyber-physical systems, autonomous driving, and human-robot interaction. More specifically, he is interested in collaborative and adversarial perception, egocentric vision, multi-modal perception, and embodied AI. His works have been published in top-tier conferences including NeurIPS, CVPR, ICCV, ICRA, and IROS. During his first Ph.D. year, he visited MARS Lab in Institute for Interdisciplinary Information Sciences (IIIS) at Tsinghua University, MediaBrain Group in School of Electronic Information and Electrical Engineering at Shanghai Jiao Tong University (SJTU), and Institute for AI Industry Research (AIR) at Tsinghua University. He obtained a bachelor degree in mechatronics, manufacture, and automation from Tongji University at Shanghai with honors.

Collaborative and Adversarial 3D Perception for Autonomous Driving

Robust and reliable perception systems serve as the “eyes” of autonomous vehicles. LiDAR is a widely applied perception sensor in autonomous vehicles for capturing 3D geometry information of the environment. However, LiDAR-based perception faces many challenges such as data sparsity, occlusions, and motion distortion. In this talk, I will show how we design novel 3D deep learning algorithms from two aspects, collaborative and adversarial, in order to improve the robustness of LiDAR-based 3D perception. For effective and efficient collaborative perception, we propose DiscoNet. It uses a dynamic directed graph with matrix-valued edge weight for an ego-vehicle to adaptively retrieve the most important complementary information from its neighboring vehicles, which could improve its own perception performance and robustness. Besides collaborative perception, we also study the adversarial robustness of LiDAR-based perception, and reveal an often-overlooked vulnerability that lies in the LiDAR motion correction process. We show that spoofing of a vehicle’s trajectory estimation with small adversarial perturbations can jeopardize LiDAR perception. We hope our collaborative and adversarial 3D perception research can help improve the robustness and safety of autonomous driving systems.

Head of Photogrammetry and Robotics LabsUniversity of Bonn

Cyrill Stachniss is a Full Professor at the University of Bonn and heads the lab for Photogrammetry and Robotics. He is additionally a Visiting Professor in Engineering at the University of Oxford. Before working in Bonn, he was a lecturer at the University of Freiburg in Germany, a guest lecturer at the University of Zaragoza in Spain, and a senior researcher at the Swiss Federal Institute of Technology in the group of Roland Siegwart. Cyrill Stachniss finished his habilitation in 2009 and received his Ph.D. thesis entitled “Exploration and Mapping with Mobile Robots” supervised by Wolfram Burgard at the University of Freiburg in 2006. From 2008-2013, he was an associate editor of the IEEE Transactions on Robotics, since 2010 a Microsoft Research Faculty Fellow, and received the IEEE RAS Early Career Award in 2013. Since 2015, he is a senior editor for the IEEE Robotics and Automation Letters. He is the spokesperson of the DFG Cluster of Excellence EXC 2070 ”PhenoRob – Robotics and Phenotyping for Sustainable Crop Production” and of the DFG Research Unit FOR 1505 ”Mapping on Demand”. He was furthermore involved in the coordination of several EC-funded FP7 and H2020 projects. In his research, he focuses on probabilistic techniques in the context of mobile robotics, navigation, and perception. Central areas of his research are solutions to the simultaneous localization and mapping problem, visual perception, robot learning, self-driving cars, agricultural robotics, and unmanned aerial vehicles. He has coauthored over 230 peer-reviewed publications.

Supervised and Unsupervised Approaches for LiDAR-Based Perception of Autonomous Vehicles in Urban Environments

Self-driving cars, robots, and other autonomous vehicles need models of their surroundings to operate effectively and efficiently. Often, these are geometric and semantic models of the world in which the vehicles operate. In this talk, I will present recent developments in the context of supervised and unsupervised learning for the perception system of autonomous cars. This includes approaches for semantic estimation, compact mapping, and predicting future states of the environment.

Postdoctoral ResearcherComputer Vision LabETH ZurichSwitzerland

Dr. Christos Sakaridis is a postdoctoral researcher at Computer Vision Lab, ETH Zurich. His broad research fields are Computer Vision and Machine Learning. The focus of his research is on high-level visual perception, involving adverse visual conditions, domain adaptation, semantic segmentation, depth estimation, object detection, synthetic data generation, and fusion of multiple sensors including lidar, radar and event cameras, with emphasis on their application to autonomous cars and robots. Since 2021, he is the Principal Engineer in TRACE-Zurich, a project on computer vision for autonomous cars running at Computer Vision Lab and funded by Toyota Motor Europe. Moreover, he is the Team Leader in the EFCL project Sensor Fusion, in which adaptive sensor fusion architectures for high-level visual perception are developed. He obtained his PhD in Electrical Engineering and Information Technology from ETH Zurich in June 2021, working at Computer Vision Lab and supervised by Prof. Luc Van Gool. Prior to joining Computer Vision Lab, he received his MSc in Computer Science from ETH Zurich in 2016 and his Diploma in Electrical and Computer Engineering from National Technical University of Athens in 2014, conducting his Diploma thesis at CVSP Group under the supervision of Prof. Petros Maragos.

Leveraging Physics and Geometry in 3D Visual Perception

3D visual perception is a key enabler for automated driving, with exemplar tasks being 3D object detection and depth estimation, among others. In this talk, we will review how the prior knowledge we have about the physics of the acquisition of 3D sensor measurements as well as the geometric structure of the input scenes we observe can be leveraged to create representative training data and design well-tailored models which boost the performance of 3D perception. On the physics side, we will show how we have applied the linear system model for lidar optics to adverse weather conditions, namely fog and snow, to establish a physically-based non-learned transformation of clear-weather lidar point clouds to foggy and snowy counterparts. The resulting adverse weather simulation generates partially synthetic data which are shown to benefit several state-of-the-art methods for 3D object detection, without the need for access to real annotated adverse-weather data. On the geometry side, we will present P3Depth, a novel method for monocular depth estimation, which is based on a piecewise planarity prior, motivated by the high degree of regularity both in indoor and outdoor man-made scenes. The method uses this prior implicitly, by including an intermediate plane coefficient representation in the network, which is used to learn interactions between pixels in order to exploit potential co-planarities in predicting depth. P3Depth matches or even exceeds the state of the art in indoor benchmarks despite using a lighter network, and is competitive in outdoor benchmarks, ranking first in the closer depth range.

Deep Learning Engineer at Qualcomm

Steve Han is a deep learning engineer at Qualcomm AI research. The research interest includes computer vision, deep learning, 3D object detection, 3D segmentation, autonomous driving. Before joining Qualcomm, Steve worked in a startup company for medical image (CT, MRI) segmentation and object detection for assistance diagnosis. He has publications in top-tier conferences such as CVPR, NeurIPS, ECCV.

3D object detection survey and trends based on LiDAR

3D object detection-based points cloud is wildly used in many applications especially for autonomous driving. In this presentation, we will go through the SOTA methods for 3D object detection based on LiDAR points cloud and discuss the recent trends from the perspective of accuracy vs latency. The presentation will mainly cover the LiDAR only methods and some recent papers about LiDAR Camera fusion. Our recent published “Fast Polar Attentive 3D Object Detection on LiDAR point clouds” will also be present. The proposed method focuses on reducing computational load and latency while maintaining high accuracy for 3D object detection. Specifically, a novel streaming detector utilizes polar space feature representation to provide faster inference.

Mohammad founded Deepen AI in 2017 to solve critical bottlenecks preventing faster adoption of autonomy and robotics products. He was Head of Product Strategy & Launch for Google Apps now part of Google Cloud Platform. Before Google, he worked as a Software engineer at Havok (acquired by Intel), Emergent Game Technology (acquired by Gamebase), and Sonics (acquired by Facebook).

Multi-Sensor Safety Calibration for ADAS Applications

Sensors in ADAS systems are the eyes of the vehicle, enabling everything from ADAS (Advanced Driver-Assistance Systems) features such as automated braking and lane-keeping to eliminating the driver entirely. The sensors may go out of calibration as a result of daily normal use, changes in operating conditions such as temperature, or vibrations, or as a result of more serious events such as accidents. Sensor calibration is therefore an essential part of ensuring safety in ADAS systems. This session will present ways to cut down time spent on calibrating multi-sensor data from hours to minutes, massively accelerating sensor fusion applications.

Deep Learning Research Engineer Navya

Alexandre Almin is a Deep Learning Research Engineer at Navya, speciliazed in Perception of Autonomous Vehicles. His research focus on deep learning, computer vision and LiDAR based tasks such as semantic segmentation and object detection applied to self-driving cars. Prior to work on deep learning applications, he developed a strong experience on terrestrial mapping applications for self-driving cars.

Large-scale Semantic segmentation and dataset distillation with Bayesian Active Learning

Semantic segmentation for mapping applications in Autonomous Driving has gain a lot of attention lately, especially with the arrival of publicly available datasets such as SemanticKITTI, nuScenes and Waymo Open Dataset. These Autonomous driving datasets have progressively grown in size in the past few years to enable better deep representation learning. Active learning has re-gained attention recently to address reduction of annotation costs and dataset size. AL has remained relatively unexplored for AD datasets, especially on point cloud data from LiDARs. We conduct an applied study of Bayesian active learning applied on semantic segmentation task for dataset distillation, and the effect of data augmentation, "LiDAR dataset distillation within bayesian active learning framework - Understanding the effect of data augmentation", recently published at VISAPP. In this context, the presentation will introduce a novel and fully annotated dataset called N3DS, and the complete production pipeline associated with. Our research results on dataset distillation applied to our newly N3DS dataset will also be presented.

PhD StudentComputer Science, University of Pennsylvania

Kyle Vedder is a PhD candidate at the University of Pennsylvania where he is advised by Eric Eaton. Motivated by his goal of developing elder care robots, his research focuses on 3D object detection and self-supervised methods for object understanding on mobile platforms.

Exploiting Representational Sparsity to Improve 3D Object Detector Runtime on Embedded Systems and Beyond

Bird's Eye View (BEV) is a popular representation for processing 3D point clouds, and by its nature is fundamentally sparse. Motivated by the computational limitations of mobile robot platforms, we create a fast, high-performance BEV 3D object detector that maintains and exploits this input sparsity to decrease runtimes over non-sparse baselines and avoids the tradeoff between pseudoimage area and runtime. We present results on KITTI, a canonical 3D detection dataset, and Matterport-Chair, a novel Matterport3D-derived chair detection dataset from scenes in real furnished homes. We evaluate runtime characteristics using a desktop GPU, an embedded ML accelerator, and a robot CPU, demonstrating that our method results in significant detection speedups (2X or more) for embedded systems with only a modest decrease in detection quality. Our work represents a new approach for practitioners to optimize models for embedded systems by maintaining and exploiting input sparsity throughout their entire pipeline to reduce runtime and resource usage while preserving detection performance.

Ph.D. candidate in Computer Science at Johns Hopkins University

Yingwei Li is a fourth-year Ph.D. candidate in Computer Science at Johns Hopkins University, advised by Bloomberg Distinguished Professor Dr. Alan Yuille. I am a member of Computational Cognition, Vision, and Learning. He obtained a B.S. in Computer Science at Fudan University in 2018. He also spent time at Google Research, Waymo, ByteDance, NTU, and TuSimple. His research interests mainly lie in computer vision, especially in autonomous driving, robust representation learning, multi-modality fusion, automated machine learning, and medical machine intelligence.

DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection

Lidars and cameras are critical sensors that provide complementary information for 3D detection in autonomous driving. While prevalent multi-modal methods simply decorate raw lidar point clouds with camera features and feed them directly to existing 3D detection models, our study shows that fusing camera features with deep lidar features instead of raw points, can lead to better performance. However, as those features are often augmented and aggregated, a key challenge in fusion is how to effectively align the transformed features from two modalities. In this paper, we propose two novel techniques: InverseAug that inverses geometric-related augmentations, e.g., rotation, to enable accurate geometric alignment between lidar points and image pixels, and LearnableAlign that leverages cross-attention to dynamically capture the correlations between image and lidar features during fusion. Based on InverseAug and LearnableAlign, we develop a family of generic multi-modal 3D detection models named DeepFusion, which is more accurate than previous methods. For example, DeepFusion improves PointPillars, CenterPoint, and 3D-MAN baselines on Pedestrian detection for 6.7, 8.9, and 6.2 LEVEL_2 APH, respectively. Notably, our models achieve state-of-the-art performance on Waymo Open Dataset, and show strong model robustness against input corruptions and out-of-distribution data.

Deep Learning Engineer at Artisense Corporation

Qing Cheng is a Computer Vision and Deep Learning Engineer at Artisense, co-founded by Prof. Daniel Cremers. He works on full-stack product development in the field of Visual SLAM and Relocalization. He also carried out research projects on Visual Localization and 3D Semantic Mapping. Prior to that, he also did a research project on 3D vehicle detection from monocular images at Bosch. He has publications at ICRA and GCPR.

Vision-based Large-scale 3D Semantic Mapping for Autonomous Driving Applications

3D perception is one of the most considerable challenges. High-quality 3D maps are a complementary source of information to online perception. We present a complete pipeline for 3D semantic mapping solely based on a stereo camera system. The pipeline comprises a direct sparse visual odometry front-end as well as a back-end for global optimization including GNSS integration and semantic 3D point cloud labelling. We propose a simple but effective temporally consistent labelling scheme which improves the quality and consistency of the 3D point labels. The whole pipeline runs in real-time. Qualitative and quantitative evaluations of our pipeline are performed on the KITTI-360 dataset. The results show the effectiveness of our proposed temporally consistent labelling scheme and the capability of our pipeline for efficient large-scale 3D semantic mapping. The large-scale mapping capability of our pipeline is furthermore demonstrated by presenting a very large-scale semantic map covering 8000 km of roads generated from data collected by a fleet of vehicles.

Accepted papers

  1. PVFusion: Point-Voxel Fusion for Multimodal 3D Detection, Ke Wang*, zhichuang zhang, Tao Chen, Shulian Zhao

  2. Residual MBConv Submanifold Module for 3D LiDAR-based Object Detection, Lie Guo*, Liang Huang, Zhao Yibing