Schedule

Schedule

All talk times are in 

Reference time zones for speakers

09:15-09:30 Opening 

Morning Session (Chair Ravi/Agapius)


09:30-10:00 Harnessing Natural Language Supervision for Autonomous Driving: A Look at LidarCLIP, Adam Tonderski

10:00-10:20 DeepSTEP - Deep Learning-Based Spatio-Temporal E2E Perception for AVs S Huch, F Sauerbeck and J Betz. (paper)

10:20-10:50 Learning Interacting Dynamic Systems with Prediction and Control using Neural ODEs, Dimitris N. Metaxas


10:50-11:15 Coffee break 

11:15-11:45 Unlocking the Full Potential of BEV Perception: Bridging Sensors, Domains, Spaces, and Time, Prof. Yuxiong WANG 

11:45-12:15 How do NeRF and CLIP advance 3D Scene Reconstruction and Understanding, Songyou Peng

12:15:12:30 Vision-RADAR fusion for Robotics BEV Detections: A Survey Apoorv Singh (paper presentation)



12:40-13:40 Lunch Break 

Afternoon Session (Chair Xinshuo Weng/Agapius)

13:45-14:00 Generating Vehicle Camera Perspectives from a Bird’s-Eye View, Alexander Swerdlow (short presentation)

14:00-14:30  What Really Matters for Multi-Sensor BEV Perception? Adam Harley

14:30-15:00 Towards Scalable Autonomous Driving, Yue Weng NVIDA


15:00-15:15 Coffee Break

15:15-15:45 Cross-View Transformers for Map-View Perception, Brady Zhou

15:45-16:15 Towards Robust Visual Perception in Autonomous Driving, Lingdong Kong

16:15-16:45 Training Large-scale Transformers for Instance-level Detection, Segmentation, Prediction, and Planning, Prof. Xinggang Wang

16:45-17:00 Closing

Speakers

Yue Wang is a research scientist at Nvidia and an incoming assistant professor at USC CS. He is currently leading efforts to build neural field representations for robotics and autonomous driving.  His research interests lie in the intersection of computer vision, computer graphics, and machine learning. His worked on learning from point clouds during his PhD. His paper "Dynamic Graph CNN" has been widely adopted in 3D visual computing and other fields. He is a recipient of the Nvidia Fellowship and is named the first place recipient of the William A. Martin Master’s Thesis Award for 2021. Yue received his BEng from Zhejiang University and MS from University of California, San Diego. He has spent time at Nvidia Research, Google Research and Salesforce Research.


Towards Scalable Autonomous Driving


ProfessorArtificial Intelligence InstituteSchool of Electronic Information and CommunicationsHuazhong University of Science and Technology

Xinggang Wang is a Professor at Huazhong University of Science and Technology and the Co-Editors-in-Chief of Elsevier Image and Vision Computing. He was visiting scholar at Temple University and UCLA. His research topics include large-scale and efficient visual representation learning for video understanding and autonomous driving. He has published more than 60 papers on top-tier conferences and journals, in which the criss-cross attention (CCNet) method had been applied as the backbone network in AlphaFold. According to Google scholar, he has more than 17000 citations. He also serves as an associate editor for Pattern Recognition and as an area chair for CVPR 2022 and ICCV 2023.

Training Large-scale Transformers for Instance-level Detection, Segmentation, Prediction, and Planning


PhD student ETH Zurich and Max Planck Institute for Intelligent 

Songyou Peng is currently a final-year PhD candidate enrolled at ETH Zurich and the Max Planck Institute for Intelligent Systems. Under the supervision Marc Pollefeys and Andreas Geiger, his research focuses on the intersection of 3D Vision and Deep Learning.  Songyou has gained valuable experience as a research intern at Google Research, Meta Reality Labs Research, TUM, and INRIA. His primary research interests include exploring both neural implicit and explicit representations for various applications, such as 3D reconstruction, novel view synthesis, SLAM, and 3D scene understanding using open vocabularies.

How do NeRF and CLIP advance 3D Scene Reconstruction and Understanding


Ph.D. Student, National University of Singapore

Lingdong Kong is a Ph.D. Student, National University of Singapore. His research interests include 3D perception, domain adaptation, and visual representation learning. He pursues to build robust and scalable perception systems that can be generalized across different domains and scenarios, with minimum or no human annotations needed. He was an autonomous vehicle intern at Motional, a research assistant at MMLab@NTU, and a research intern ByteDance AI Lab. He holds an M.Eng. degree from Nanyang Technological University, Singapore, and a B.Eng. degree from the South China University of Technology.


Towards Robust Visual Perception in Autonomous Driving

The resilience of a visual perception system is pivotal for safety-critical applications such as autonomous driving. While promising results have been achieved on standard benchmarks, the robustness of existing 3D perception models is still unknown. In this presentation, we introduce recent efforts in designing robustness benchmarks for monocular depth estimation, LiDAR semantic segmentation, and 3D object detection. We find that most existing 3D perception models are at risk of being vulnerable to data corruption, domain shift, and sensor failure, due to the lack of suitable robustness evaluation suites. To fill in this gap, we benchmark a wide range of 3D perception models, on their robustness under adverse weather conditions, sensor failure and movement, and data processing issues. Based on our benchmarking results, we further draw important observations on designing robust and reliable models for mitigating performance degradations under these out-of-distribution scenarios.


Ph.D. Student, Lund University & Zenseact

Adam is currently an Industrial PhD student at Lund University, collaborating with Zenseact, an autonomous driving company based in Gothenburg. His academic journey began at Chalmers University, where he earned his Master's degree and later specialized in monocular 3D object detection as a deep learning engineer at Zenuity. Today, Adam's research is focused on reducing human supervision in the development of perception systems for autonomous driving. He is exploring this complex field through three synergistic strategies: enhancing the utility of existing annotations, developing robust offline foundation models using self, weak, and semi-supervised training techniques, and refining the quality of automatic annotation pipelines.



Harnessing Natural Language Supervision for Autonomous Driving: A Look at LidarCLIP

With recent breakthroughs in linking text and images, research in this domain has taken a significant leap forward. However, the connection between text and other visual modalities, notably lidar data, remains largely underexplored due to the scarcity of text-lidar datasets. We introduce LidarCLIP, a novel approach that associates automotive point clouds with a pre-existing CLIP embedding space. By utilizing image-lidar pairs, LidarCLIP supervises a point cloud encoder with image CLIP embeddings, enabling a potent correlation between text and lidar data. We explore several applications, ranging from retrieving highly challenging detection scenarios, zero-shot point-cloud classification, and even off-the-shelf lidar-to-image and lidar-to-text generation. Finally, we take a step back to examine the future prospects of language supervision in autonomous driving. 

Brady Zhou

PhD student at UT Austin


Brady is a PhD student at the University of Texas at Austin, working under the supervision of Dr. Philipp Krähenbühl. His research focuses on the practical applications of deep learning in the realm of autonomous driving, specifically in the development of effective representations for driving scenes. Brady focuses on finding solutions to real-world problems in previous collaborations at NVIDIA, Motional, and Wayve.

Cross-View Transformers for Map-View Perception

In order to drive safely, autonomous systems need both semantic understanding of the environment and spatial reasoning, due to the inherent nature of navigation. To tackle this challenge, we introduce a transformer architecture that utilizes multi-view camera sensors to create a unified map-view representation of the scene. We achieve this simply by encoding the geometric relationship between camera views and the map-view into a positional embedding of the transformer. Positions in the map-view learn to attend to different image patches and implicitly build up the map-view representation. The architecture shows competitive results in camera only map-view semantic segmentation and runs comfortably in real time.

Alexander is an incoming MSR student at CMU, currently advised by Prof. Bolei Zhou at UCLA. His research interests include controllable generative modeling, 3D perception, and visual representation learning. He was previously a research intern on the Perception team at Waabi and with the RoMeLa group at UCLA.

Generating Vehicle Camera Perspectives from a Bird’s-Eye View

Bird's-Eye View (BEV) Perception has received increasing attention in recent years with work on discriminative tasks such as BEV segmentation and detection. However, the dual generative task of creating street-view images from a BEV layout has been rarely explored. The ability to generate realistic street-view images that align with a given HD map and traffic layout is critical for visualizing complex traffic scenarios and developing robust perception models for autonomous driving. We propose BEVGen, a conditional generative model that synthesizes a set of realistic and spatially consistent surrounding images that match the BEV layout of a traffic scenario. BEVGen incorporates a novel cross-view transformation with spatial attention design which learns the relationship between cameras and map views to ensure their consistency. We evaluate the proposed model on the challenging NuScenes and Argoverse 2 datasets and show that BEVGen can accurately render road and lane lines, as well as generate traffic scenes under different weather conditions and times 

Adam is a postdoctoral scholar at Stanford University, working with Leonidas Guibas. He recently completed his Ph.D. at The Robotics Institute at Carnegie Mellon University, where he worked with Katerina Fragkiadaki. His research interests lie in Computer Vision and Machine Learning, particularly for 3D understanding and fine-grained tracking.

What Really Matters for Multi-Sensor BEV Perception?

Building 3D perception systems for autonomous vehicles that do not rely on high-density LiDAR is a critical research problem because of the expense of LiDAR systems compared to cameras and other sensors. Recent research has developed a variety of camera-only methods, where features are differentiably "lifted" from the multi-camera images onto the 2D ground plane, yielding a "bird's eye view" (BEV) feature representation of the 3D space around the vehicle. This line of work has produced a variety of novel "lifting" methods, but we observe that other details in the training setups have shifted at the same time, making it unclear what really matters in top-performing methods. We will step through and analyze which design decisions impact performance most. We also pay special attention to radar, a sensor which is often neglected, and invite the community to consider metric sensors in general as a key component of the sensor platform.

Distinguished Prof. Of CS

Rutgers University


Dimitris Metaxas is a Distinguished Professor in the Computer and Information Sciences Department at Rutgers University. He is directing the Center for Computational Biomedicine, Imaging and Modeling (CBIM) and the NSF University-Industry Collaboration Center CARTA with emphasis on real time and scalable data analytics, AI and machine learning methods with applications to computer vision, dynamical systems, the environment and medical image analysis. Dr. Metaxas has been conducting research towards the development of novel methods and technology upon which AI, machine learning, physics-based modeling, computer vision, medical image analysis, and  computer graphics can advance synergistically. Dr. Metaxas has published over 700 research articles in these areas and has graduated 65 PhD students, who occupy prestigious academic and industry positions. His research has been funded by NIH, NSF, AFOSR, ARO, DARPA, HSARPA, and the ONR. Dr. Metaxas work has received many best paper awards and he has 8 patents. He was awarded a Fulbright Fellowship in 1986, is a recipient of an NSF Research Initiation and Career awards, and an ONR YIP. He is a Fellow of the American Institute of Medical and Biological Engineers, a Fellow of IEEE and a Fellow of the MICCAI Society. He has been general chair of IEEE CVPR 2014, Program Chair of ICCV 2007, General Chair of ICCV 2011, FIMH 20011 and  MICCAI 2008 and the Senior Program Chair for SCA 2007. He will also be a General Chair of CVPR 2026.


Learning Interacting Dynamic Systems with Prediction and Control using Neural Ordinary Differential Equations


Modeling Interacting Dynamic Systems in the form of different types of agents (e.g., vehicles, pedestrians) is important due to its many applications including autonomous driving and mixed reality simulations. Many approaches model Interacting Dynamic Systems in temporal and relational dimensions using purely data driven methods. However, these approaches usually fail to learn the underlying continuous temporal dynamics, agent interactions and their dynamic adaptation explicitly. In this talk, we present a novel Dynamic Data Driven approach in the form of an interacting system of ordinary differential equations (ISODE) that is scalable and models multiple heterogenous agents. Our approach uses the latent space of Neural ODEs to model continuous temporal dynamics by incorporating distance and interaction intensity into agent dynamic interaction modeling. In addition, we show how to control and update dynamically without the need for retraining an agent’s trajectory when obstacles and targets are introduced dynamically. Extensive experiments reveal that our ISODE approach outperforms the state-of-the-art. We also show how an agent given sensing can dynamically avoid suddenly appearing obstacles and how to effectively control the agent motion by introducing attractors and repellers.

Yuxiong WANG

Assistant Professor

Department of Computer Science, University of Illinois at Urbana-Champaign

Yuxiong Wang is an Assistant Professor in the Department of Computer Science at the University of Illinois Urbana-Champaign. He is also affiliated with the National Center for Supercomputing Applications (NCSA). He received a Ph.D. in robotics from Carnegie Mellon University. His research interests lie in computer vision, machine learning, and robotics, with a particular focus on few-shot learning, meta-learning, open-world learning, 3D vision, and streaming perception. He is a recipient of awards including the Amazon Faculty Research Award, the ECCV Best Paper Honorable Mention Award, and the CVPR Best Paper Award Finalist. For details: https://yxw.cs.illinois.edu/.


Unlocking the Full Potential of BEV Perception: Bridging Sensors, Domains, Spaces, and Time

The Bird's-eye-view (BEV) formulation has recently emerged as a powerful tool in autonomous driving. It unifies the coordinate systems of different sensors (e.g., cameras, LiDARs, RADARs) and shares a direct interface with various perception and downstream planning tasks. In this talk, I will discuss our recent efforts that broaden the horizon of BEV perception across diverse sensors, domains, spaces, and temporal contexts. I will first focus on how to fuse diverse modalities within a unified BEV space and transfer such knowledge across different domains. I will then demonstrate how to further expand BEV perception across the spatial-temporal dimension, where end-to-end object tracking is achieved in streaming videos using BEV queries, and data from multiple traversals are aggregated to generate accurate offline high-definition maps. Overall, the talk aims to showcase both the diversity and unification of spectrum in BEV perception.

Accepted papers