Schedule

Program

Timings are in Pacific Time (PT).

09:00-09:10 : Introduction
09:10-09:40: Zhijian Liu (MIT EECS), Point-Voxel CNN for Efficient 3D Deep Learning(PVCNN)
09:40-10:10: Felix Heide (Princeton University and Algolux) , Designing Cameras to Detect the “Invisible”: Computational Imaging for Adverse Conditions
10:10-10:40: Dr. Jens Behley (Postdoc at Photogrammetry & Robotics Lab, University of Bonn), SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences
10:40-11:10: Qingyong Hu (Dphil candidate at the University of Oxford), RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds
11:10-11:40: Thomas Chaton (Senior Research Engineer, Sky London) ,Torch-Points-3D — A unifying framework for deep learning on point clouds
11:40-12:00: Accepted paper: Urs Niesen, Jayakrishnan Unnikrishnan, Camera-Radar Fusion for 3-D Depth Reconstruction
12:00-13:00 Lunch Break
13:00-13:30: Biao Gao (Peking university), Are we hungry of 3D Lidar Data for semantic segmentation? A new dataset SemanticPOSS and the researches at PKU-POSS
13:30-14:00: Prof. Wolfram Burgard (VP Automated Driving, Toyota Research Institute USA & Univ. of Feiburg), Self-Supervised Learning for Perception Tasks in Automated Driving
14:00-14:30: Prof. D. Gavrila (TU-Delft, Netherlands), 3D Semantic Scene Analysis in Urban Traffic
14:30-15:00: Prof. Matthias Niessner (Technical University of Munich) , 3D Deep Learning & Self Supervision

There will be no Interative Q&A sessions in the interest of saving time for speakers/organization.

Invited Speakers

Prof. Wolfram Burgard,

VP Automated Driving, Toyota Research Institute (USA) University of Freiburg (Germany )

Wolfram Burgard is VP for Automated Driving Technology at the Toyota Research Institute. He is on leave from his professorship at the University of Freiburg where he heads the research group for Autonomous Intelligent Systems. Wolfram Burgard is known for his contributions to mobile robot navigation, localization and SLAM (simultaneous localization and mapping). He has published more than 350 papers in the overlapping area of robotics and artificial intelligence.

Self-Supervised Learning for Perception Tasks in Automated Driving

At the Toyota Research Institute we are following the one-system-two-modes approach to building truly automated cars. More precisely, we simultaneously aim for the L4/L5 chauffeur application and the the guardian system, which can be considered as a highly advanced driver assistance system of the future that prevents the driver from making any mistakes. TRI aims to equip more and more consumer vehicles with guardian technology and in this way to turn the entire Toyota fleet into a giant data collection system. To leverage the resulting data advantage, TRI performs substantial research in machine learning and, in addition to supervised methods, particularly focuses on unsupervised and self-supervised approaches. In this presentation, I will present three recent results regarding self-supervised methods for perception problems in the context of automated driving. I will present novel approaches to inferring depth from monocular images and a new approach to panoptic segmentation.

Prof. Dariu Gavrila

Professor, Intelligent Vehicles section, TU Delft.

Dariu M. Gavrila received the MSc degree in computer science from the Vrije University, in Amsterdam, NL. He received the PhD degree in computer science from the University of Maryland at College Park, USA, in 1996. He was a Visiting Researcher at the MIT Media Laboratory in 1996. From 1997 till 2016 he has been with Daimler R&D in Ulm, Germany, where he eventually became a Distinguished Scientist. In 2010, he was appointed professor at the University of Amsterdam, chairing the area of Intelligent Perception Systems (part-time). Since 2016 he heads the Intelligent Vehicles section at the TU Delft as a Full Professor.

Over the past 20 years, Prof. Gavrila has focused on visual systems for detecting humans and their activity, with application to intelligent vehicles, smart surveillance and social robotics. He led the multi-year pedestrian detection research effort at Daimler, which was incorporated in the Mercedes-Benz S-, E-, and C-Class models (2013-2014). Currently, he performs research on self-driving cars in complex urban environment and is particularly interested in the anticipation of pedestrian and cyclist behavior.

Prof. D. M. Gavrila graduated eight Ph.D. students and over 20 MS students. He published 100+ papers in first-tier conferences and journals, and is frequently cited in the area of computer vision and intelligent vehicles (Google Scholar: 13.000+ times). He served as Area Chair and Associate Editor on many occasions, he was Program Co-Chair at the IEEE Intelligent Vehicles 2016 conference. He received the I/O 2007 Award from the Netherlands Organisation for Scientific Research (NWO) and the IEEE Intelligent Transportation Systems Application Award 2014. He has had regular appearances in the international broadcast and print media. His personal Web site is www.gavrila.net (till 2016). His group's Web site is www.intelligent-vehicles.org (since 2016).

3D Semantic Scene Analysis in Urban Traffic

This talk presents recent work at TU Delft on 3D semantic scene analysis in urban traffic using video and/or LiDAR. First, I discuss fast and compact stereo image segmentation using Instance Stixels [1]. These augment single-frame stixels with instance information, which can be extracted by a CNN from the RGB image input. As a result, the novel Instance Stixels method efficiently computes stixels that account for boundaries of individual objects, and represents instances as grouped stixels that express connectivity. Second, I discuss the outcome of an experimental study on video- and LiDAR-based 3D person detection (i.e. pedestrians and cyclists) [2]. I report how the detection performance depends on distance, number of LiDAR points, amount of occlusion, and the optional use of LiDAR intensity cues. I include results on the new EuroCity Persons 2.5D (ECP2.5D) dataset, which is about one order of magnitude larger than KITTI regarding persons. Finally, I cover domain transfer experiments between the KITTI and ECP2.5D datasets, and discuss future challenges.

[1] T. Hehn, J.F.P. Kooij and D.M. Gavrila. “Fast and Compact Image Segmentation using Instance Stixels”. Under review at IEEE Trans. on Intelligent Vehicles, 2020.

[2] J. van der Sluis, E.A.I. Pool and D.M. Gavrila. “An Experimental Study on 3D Person Localization in Traffic Scenes”. Under review at IEEE Trans. on Intelligent Vehicles, 2020.

Zhijian Liu

Ph.D. degree at the MIT EECS Department

Zhijian Liu is pursuing his Ph.D. degree at the MIT EECS Department, under the supervision of Prof. Song Han. Zhijian received his B.Eng. degree in computer science from Shanghai Jiao Tong University. His research mainly focuses on efficient and hardware-friendly machine learning and its applications in vision and language.

Point-Voxel CNN for Efficient 3D Deep Learning

3D neural networks are widely used in real-world applications (e.g., AR/VR headsets, self-driving cars). They are required to be fast and accurate; however, limited hardware resources on edge devices make these requirements rather challenging. Previous work processes 3D data using either voxel-based or point-based neural networks; while both types of 3D models are not hardware-efficient because of the large memory footprint and random memory access. In this thesis, we study 3D deep learning from the efficiency perspective. We first systematically analyze the bottlenecks of previous 3D methods. We then combine the best from point-based and voxel-based models together and propose a novel hardware-efficient 3D primitive, Point-Voxel Convolution (PVConv). We evaluate our proposed method on various tasks including 3D part segmentation (for objects), 3D semantic segmentation (for indoor and outdoor scenes), and 3D object detection (for outdoor scenes). Across all four benchmarks, our proposed method achieves state-of-the-art performance with a 2.8x measured speedup on average. Furthermore, our model has been deployed to the autonomous racing vehicle of MIT Driverless, achieving a larger detection range, higher accuracy, and lower latency for efficient LiDAR perception.

Felix Heide

CTO at Algolux | Incoming Professor at Princeton University

Felix Heide has co-authored over 50 publications and filed 6 patents. He received his Ph.D. from the University of British Columbia under the advisement of Professor Wolfgang Heidrich. He obtained his MSc from the University of Siegen, and was a postdoc at Stanford University. His doctoral dissertation won the Alain Fournier Ph.D. Dissertation Award and the SIGGRAPH outstanding doctoral dissertation award.

Designing Cameras to Detect the “Invisible”: Computational Imaging for Adverse Conditions

Imaging has become an essential part of how we communicate with each other, how autonomous agents sense the world and act independently, and how we research chemical reactions and biological processes. Today's imaging and computer vision systems, however, often fail for the ``edge cases'', for example in low light, fog, snow, or highly dynamic scenes. These edge cases are a result of ambiguity present in the scene or signal itself, and ambiguity introduced by imperfect capture systems. In this talk, I will present several examples of computational imaging methods that resolve this ambiguity by jointly designing sensing and computation for domain-specific applications. Instead of relying on intermediate image representations, which are often optimized for human viewing, these cameras are designed end-to-end for a domain-specific task. In particular, I will show how to co-design optics, sensors and ISP for automotive HDR ISPs, detection and tracking (beating Tesla's latest OTA Model S Autopilot), how to optimize thin freeform lenses for wide field of view applications, and how to extract accurate dense depth from three gated images (beating scanning lidar, such as Velodyne's HDL64). Finally, I will present computational imaging systems that extract domain-specific information from faint measurement noise using domain-specific priors, allowing us to use conventional intensity cameras or conventional Doppler radar to image ``hidden'' objects outside the direct line of sight at ranges of more than 20m.

Biao Gao,

Key Lab of Machine Perception (MOE), Peking University

PhD student at Key Lab of Machine Perception (MOE), Peking University

Are we hungry of 3D Lidar Data for semantic segmentation? A new dataset SemanticPOSS and the researches at PKU-POSS

This talk will introduce the recent published SemanticPOSS dataset and give an overview of researches in PKU Intelligent Vehicle Group (POSS, http://www.poss.pku.edu.cn) on 3D LiDAR semantic scene understanding.

Nowadays, researches of 3D LiDAR semantic scene understanding mostly face the challenge of “data hungry”, especially for deep network based models. We will present our investigation results about the “data hungry” situation in the domain from different viewpoints.

The challenge of “data hungry” urges us to make the new dataset out of the ordinary: SemanticPOSS, which is a point-level point cloud dataset collected in Peking University. The main feature of this dataset is large quantity of dynamic instances inside. For example, the average instances of pedestrians are up to 8.29 per frame, which is more than 10 times denser than KITTI and SemanticKITTI. The rich dynamic instances provide more challenging and diverse environment for autonomous driving system, and fill in some blanks of crowded dynamic scenes among public datasets.

By the way, we will introduce our work about 3D LiDAR semantic segmentation focused on solving “data hungry” problem. Concretely, weakly and semi-supervised learning algorithms applied to different scenarios will be presented.

Qingyong Hu

Dphil candidate at the University of Oxford

Qingyong Hu is a second-year D.Phil student (Oct 2018 - ) in the Department of Computer Science at the University of Oxford, supervised by Niki Trigoni and Andrew Markham. His research goal is to build intelligent systems that are able to achieve an effective and efficient perception and understanding of 3D scenes. In particular, his research focuses on large-scale point cloud segmentation, dynamic point cloud processing, and point cloud tracking.

RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds

We study the problem of efficient semantic segmentation for large-scale 3D point clouds. By relying on expensive sampling techniques or computationally heavy pre/post-processing steps, most existing approaches are only able to be trained and operate over small-scale point clouds. In this paper, we introduce RandLA-Net, an efficient and lightweight neural architecture to directly infer per-point semantics for large-scale point clouds. The key to our approach is to use random point sampling instead of more complex point selection approaches. Although remarkably computation and memory efficient, random sampling can discard key features by chance. To overcome this, we introduce a novel local feature aggregation module to progressively increase the receptive field for each 3D point, thereby effectively preserving geometric details. Extensive experiments show that our RandLA-Net can process 1 million points in a single pass with up to 200$\times$ faster than existing approaches. Moreover, our RandLA-Net clearly surpasses state-of-the-art approaches for semantic segmentation on two large-scale benchmarks Semantic3D and SemanticKITTI.

Thomas Chaton

Senior Research Engineer

Thomas Chaton holds a double master degree from Telecom Paristech in Security / Data Science, with a specialization in Machine Learning.After graduation, he worked 2 years at Fujitsu AI Labs in London on ML CV applied to anomaly detection for manufacturing processes and 1 year at HELIX RE working on state-of-the-art AI point cloud models for segmentation at scale, which became the starting point for TorchPoints3d framework.He is currently working as a Senior Research Engineer at Sky, London.

Torch Points3D — A unifying framework for deep learning on point clouds

Research in Deep Learning for point cloud data has boomed in the past few years resulting in rapid improvements of state of art (SOTA) results on common benchmarks. Yet, developing new architecture and evaluating the real impact of new contributions remains a challenge. We propose an open source framework that enables easy reproduction of the existing SOTA models and exploration of new models.

Matthias Nießner

Professor, Visual Computing & Artificial IntelligenceTechnical University of Munich, Department of Informatics

3D Deep Learning & Self Supervision

In this talk, I talk about how we can use 3D data to self-supervise existing problems, and how in particular panoramic images such as from the Matterport3D dataset can fuel many computer vision tasks. I will further talk about how to leverage these ideas in the context of 3D shape reconstruction / completion, getting high-quality 3D models even when no ground truth data is available in real-world scan.

Dr. Jens Behley

Postdoc at Photogrammetry & Robotics Lab, University of Bonn

Dr. Jens Behley is post-doctoral researcher at the Photogrammetry and Robotics Lab since 2015. He obtained his Ph.D. in computer science from University of Bonn in 2014. The Ph.D. thesis addresses LiDAR-based perception in urban environments and was supervised by Prof. Dr. Armin B. Cremers. His research interest includes semantic scene understanding in urban environments and semantic interpretation in the agricultural domain with a strong focus on Deep Learning.

SemanticKITTI and our Approach for LiDAR-based Panoptic Segmentation

Panoptic segmentation is the recently introduced task that tackles semantic segmentation and instance segmentation jointly.

In this talk, we present our extension of SemanticKITTI, which is a large-scale dataset providing dense point-wise semantic labels for all sequences of the KITTI Odometry Benchmark, for training and evaluation of laser-based panoptic segmentation. We present our approach to get temporally-consistent instance annotations and our strong two-stage baselines combining state-of-the-art LiDAR-based semantic segmentation approaches with a state-of-the-art detector enriching the segmentation with instance information.

Furthermore, we present our novel, single-stage, and real-time capable panoptic segmentation approach using a shared encoder with a semantic and instance decoder. We leverage the geometric information of the LiDAR scan to perform a novel, distance-aware tri-linear upsampling, which allows our approach to use larger output strides than using transpose convolutions leading to substantial savings in computation time.