Speakers


Kostas Daniilidis

Kostas Daniilidis is the Ruth Yalom Stone Professor of Computer and Information Science at the University of Pennsylvania where he has been faculty since 1998. He is an IEEE Fellow. He was the director of the GRASP laboratory from 2008 to 2013, Associate Dean for Graduate Education from 2012-2016, and Faculty Director of Online Learning since 2016. He obtained his undergraduate degree in Electrical Engineering from the National Technical University of Athens, 1986, and his PhD (Dr.rer.nat.) in Computer Science from the University of Karlsruhe, 1992, under the supervision of Hans-Hellmut Nagel. He was Associate Editor of IEEE Transactions on Pattern Analysis and Machine Intelligence from 2003 to 2007. He co-chaired with Pollefeys IEEE 3DPVT 2006, and he was Program co-chair of ECCV 2010. He received the Best Conference Paper Award at ICRA 2017. His most cited works have been on visual odometry, omnidirectional vision, 3D pose estimation, 3D registration, hand-eye calibration, structure from motion, and image matching. Kostas’ main interest today is in geometric deep learning, data association, event-based cameras, and vision based manipulation and navigation.

Title: From Semantic SLAM to Semantic Navigation

Abstract: Progress in visual localization and mapping has led to deployed systems resulting in accurate metric localization. When presented with a navigation problem of going from A to B robots have to predict the occupancy of the environment to reach the target position efficiently. In this sense, they can take advantage of the prior distribution of spaces in terms of occupancy. When a navigation command entails a semantic target like “go to the oven,” current semantic SLAM systems have to explore the whole environment to find the oven. We argue that a robot has to learn how to map by predicting the semantic and occupancy information outside the field of view. We follow an active learning approach driven by estimates of the epistemic uncertainty using an ensemble. When dropped into an unseen environment, it can predict where a semantic target might be located and follow a policy according to the upper confidence bound. We present results in semantic mapping and in the PointNav and ObjectNav benchmarks.

Yiannis Aloimonos

Yiannis Aloimonos is Professor of Computational Vision and Intelligence at the Department of Computer Science, University of Maryland, College Park, and the Director of the Computer Vision Laboratory at the Institute for Advanced Computer Studies (UMIACS). He is also affiliated with the Institute for Systems Research, the Neural and Cognitive Science Program and the Maryland Robotics Center. He was born in Sparta, Greece and studied Mathematics in Athens and Computer Science at the University of Rochester, NY (PhD 1990). He is interested in Active Perception and the modeling of vision as an active, dynamic process for real time robotic systems. For the past five years he has been working on bridging signals and symbols, specifically on the relationship of vision to control, and the relationship of action and language using Hyper-dimensional Computing.

Talk Title: Hyper-dimensional Active Perception: towards minimal cognition

Abstract: Action and perception are often kept in separated spaces, which is a consequence of traditional vision being frame based and only existing in the moment and motion being a continuous entity. This bridge is crossed by the dynamic vision sensor (DVS), a neuromorphic camera that can see the motion. We propose a method of encoding actions and perceptions together into a single space that is meaningful, semantically informed, and consistent by using hyper-dimensional binary vectors (HBVs). We show that the visual component can be bound with the system velocity to enable dynamic world perception, which creates an opportunity for real-time navigation and obstacle avoidance with active perception. Actions performed by an agent are directly bound to the perceptions experienced to form its own “memory.” Furthermore, because HBVs can encode entire histories of actions and perceptions—from atomic to arbitrary sequences—as constant-sized vectors, auto-associative memory was combined with deep learning paradigms for controls. Using this methodology we can implement, for nanoquadrotors with all processing on board, a hierarchy of sensorimotor loops providing a set of competences (egomotion, moving object detection, obstacle avoidance, homing and landing) which can be interfaced with episodic, procedural and semantic memory, giving rising to a minimal cognitive system.

Dezhen Song

Dezhen Song is a Professor and Associate Department Head for Academics with Department of Computer Science and Engineering, Texas A&M University, College Station, Texas, USA. Song received his Ph.D. in 2004 from University of California, Berkeley, MS and BS from Zhejiang University in 1995 and 1998, respectively. Song's primary research area is robot perception, networked robots, visual navigation, automation, and stochastic modeling. He received NSF Faculty Early Career Development (CAREER) Award in 2007. From 2008 to 2012, Song was an associate editor of IEEE Transactions on Robotics (T-RO). From 2010 to 2014, Song was an Associate Editor of IEEE Transactions on Automation Science and Engineering (T-ASE). From 2017 to 2020, Song was a Senior Editor for IEEE Robotics and Automation Letters (RA-L). He is a multimedia Editor for Springer Handbook of Robotics. Dezhen Song has been PI or Co-PI on more than $15.0 million in grants including more than $4.9 million from NSF. His research has resulted in two books and more than 120 refereed publications. Dr. Song received the Kayamori Best Paper Award of the 2005 IEEE International Conference on Robotics and Automation.


Title: A Few Attempts to Improve Robustness of Visual SLAM

Abstract: When a camera is employed as the primary sensor to perform simultaneous localization and mapping (SLAM) task for a robot or a mobile device, it is often referred to as the visual SLAM approach. Visual SLAM has seen many applications including augmented reality, autonomous driving, and service robotics due to its low cost in sensory hardware and small footprint. It is vital part of navigation and scene reconstruction. However, visual SLAM still suffers from robustness issue due to its reliance on the continuously successful image matching process. Due to lighting, camera perspective, and feature distribution, vSLAM algorithms still have non negligible failure rate. Here we present a few attempts that our lab has tried to attack the robustness issue from multiple angles: exploiting complex feature, spatial knowledge sharing, better robust estimation, and improvement of sparse optimization solver. We present those approaches and hope to encourage discussion and attention to the robustness issue which is the main hurtle in many real-world applications.

Davide Scaramuzza

Davide Scaramuzza is a Professor of Robotics and Perception at both departments of Informatics (University of Zurich) and Neuroinformatics (joint between the University of Zurich and ETH Zurich), where he directs the Robotics and Perception Group. His research lies at the intersection of robotics, computer vision, and machine learning, using standard cameras and event cameras, and aims to enable autonomous, agile, navigation of micro drones in search-and-rescue applications. In 2018, his team won the IROS 2018 Autonomous Drone Race and in 2019 it ranked second in the AlphaPilot Drone Racing world championship. For his research contributions to autonomous, vision-based, drone navigation and event cameras, he won prestigious awards, such as a European Research Council (ERC) Consolidator Grant, the IEEE Robotics and Automation Society Early Career Award, an SNSF-ERC Starting Grant, a Google Research Award, the KUKA Innovation Award, two Qualcomm Innovation Fellowships, the European Young Research Award, the Misha Mahowald Neuromorphic Engineering Award, and several paper awards.


Luca Carlone

Dr. Luca Carlone is the Leonardo Career Development Associate Professor in the Department of Aeronautics and Astronautics at the Massachusetts Institute of Technology, and a Principal Investigator in the MIT Laboratory for Information & Decision Systems (LIDS). He is also the director of the MIT SPARK Lab, where he works at the cutting edge of robotics and autonomous systems research. He received the PhD degree from the Polytechnic University of Turin, Turin, Italy, in 2012. From 2013 to 2015, he was a Post-Doctoral Fellow with the Georgia Institute of Technology, Atlanta, GA, USA. In 2015, he was a Post-Doctoral Associate with the Laboratory for Information and Decision Systems (LIDS), Massachusetts Institute of Technology, Cambridge, MA, USA, where he became a Research Scientist in 2016. His research interests include nonlinear estimation, numerical and distributed optimization, and probabilistic inference, applied to sensing, perception, and decision-making in single- and multi-robot systems. He received the 2017 Transactions on Robotics King-Sun Fu Memorial Best Paper Award and the Best Paper Award at WAFR 2016.


Title: From Visual Odometry to Real-time Scene Understanding with 3D Scene Graphs

Abstract: 3D scene understanding is a grand challenge for robotics and computer vision research. In particular, scene understanding is a prerequisite for safe and long-term autonomous robot operation, and for effective human-robot interaction. 3D scene graphs have recently emerged as a powerful high-level representation for scene understanding. A 3D scene graph describes the environment as a layered graph where nodes represent spatial concepts at multiple levels of abstraction and edges represent relations between concepts. While 3D scene graphs can serve as an advanced "mental model" for robots, how to build such a rich representation in real-time is still uncharted territory. This talk describes Hydra, a perception system that builds a 3D scene graph from sensor data in real-time. Hydra includes real-time algorithms to incrementally construct the layers of a scene graph as the robot explores the environment. Moreover, it includes the first 3D scene graph optimization technique that converts the scene graph into a factor graph and simultaneously corrects all layers in response to loop closures. We show that 3D scene graphs create novel opportunities to enhance loop closure detection, demonstrate Hydra across multiple real and simulated datasets, and discuss open problems.

Sudipta Sinha

Sudipta N. Sinha is a principal researcher working on Mixed Reality and AI in the Cloud and AI group at Microsoft. His research interests lie in computer vision, robotics and machine learning. He works on 3D computer vision problems related to 3D scene reconstruction from images and video (structure from motion, visual odometry, stereo matching, optical flow, multi-view stereo, image-based localization, object detection and pose estimation) which enable applications such as 3D scanning, augmented reality (AR) and UAV-based aerial photogrammetry. He received his M.S. and Ph.D. from the University of North Carolina at Chapel Hill in 2005 and 2009 respectively. As a member of the UNC Chapel Hill team, he received the best demo award at CVPR 2007 for one of the first scalable, real-time, vision-based urban 3D reconstruction systems. He has served or will serve as an area chair for 3DV 2016, ICCV 2017, 3DV 2018, 3DV 2019 and 3DV 2020, was a program co-chair for 3DV 2017 and serves as an area editor for the Computer Vision and Image Understanding (CVIU) Journal.


Title: Towards Storage Efficient and Privacy-Preserving Camera Localization in Pre-Mapped Environments


Abstract: Camera localization is a fundamental task in spatial AI systems for robotics, mixed reality, and wearable computing. Today, the most accurate localization approaches are based on image-based retrieval, visual feature matching, and 3D structure-based pose estimation. They require persistent storage of numerous visual features or images from the scene. Thus, they have high storage requirement which makes these approaches unsuitable for resource constrained platforms. Although cloud processing and storage can alleviate the storage issue, uploading visual features to persistent storage can raise privacy concerns because the features can be potentially inverted to recover sensitive information about the scene or subjects, e.g., by reconstructing the appearance of query images. I will describe solutions to address both the privacy issue and the high storage needs. I will first present geometric insights that enable new feature representations and pose estimation methods where it is more difficult to invert the features. Such methods could enhance privacy in cloud-based camera localization systems. I will then discuss learned localization approaches which by design, address both the storage and privacy issues since neither features nor 3D map data is stored. Existing learned approaches are unfortunately not as accurate as the methods that use high storage. To that end I will present a new learned camera localization technique. Our key idea is to designate a few salient 3D scene points as scene landmarks, and then train convolutional neural network (CNN) architectures to detect these landmarks in query images, or predict their bearing when the landmark is not visible in the image. The new approach shows promise and outperforms existing learned approaches on a challenging new dataset.

Fan Deng

Fan Deng is currently the principal manager/engineer at OPPO US Research Center, leading perception lab on SLAM and 3D vision R&D. His work supports the development of prototypes and products of OPPO including smartphones and XR HMD, etc. Prior to OPPO, he worked at Qualcomm on Computer Vision and Camera R&D. He graduated from the University of Pennsylvania in Electrical Engineering in 2011.

Title: Practical vSLAM Design for the Real World


Abstract: SLAM algorithm has evolved dramatically over the past decades, especially with recent progress in sensor technologies and the widespread adoption of smartphones. What are the challenges of deploying SLAM in real products? What do we need to do to run real-time vSLAM on smartphones, AR/VR HMD, or robotics? Is there a one-for-all SLAM solution for all embedded environments? For the past four years, the perception lab of OPPO US research center has been working on bringing vSLAM to computation-limited devices like smartphones and AR glasses, at the same time built up knowledge of SLAM-related sensor design, mechanical design, and system architecture to make sure our vSLAM solution succeed in the real world.

Katerina Fragkiadaki

Katerina Fragkiadaki is an Assistant Professor in the Machine Learning Department in Carnegie Mellon University. She received her Ph.D. from University of Pennsylvania and was a postdoctoral fellow in UC Berkeley and Google research after that. Her work is on learning visual representations with little supervision and on combining spatial reasoning in deep visual learning. Her group develops algorithms for mobile computer vision, learning of physics and common sense for agents that move around and interact with the world. Her work has been awarded with a best Ph.D. thesis award, an NSF CAREER award, AFOSR Young Investigator award, a DARPA Young Investigator award, Google, TRI, Amazon, UPMC and Sony faculty research awards.

Title: LiDAR-Free Bird's Eye View Perception for Autonomous Vehicles

Abstract: LiDAR-free 3D perception systems for autonomous vehicles are in the center of research focus due to high expense of LiDAR systems compared to cameras and other sensors.

Current methods use multi-view RGB data collected from cameras around the vehicle and neurally ``lift'' features from the perspective images to the 2D ground plane, yielding a ``bird's eye view'' (BEV) feature representation of the 3D space around the vehicle. Recent research focuses on the way the features are lifted from images to the BEV plane. We instead propose a simple baseline model, where the ``lifting'' step simply averages features from all projected image locations, and outperform the current state-of-the-art in vehicle BEV segmentation. Our ablations show that batch size, data augmentation, and input resolution play a large part in performance. Additionally, we reconsider the utility of radar input, which has previously been either ignored or found non-helpful by previous works. With a simple RGB-radar fusion module, we obtain a sizable boost in performance, approaching the accuracy of a LiDAR-enabled system.

Javier Civera

Javier Civera is Associate Professor at the University of Zaragoza in Spain, where he teaches courses on Machine Learning, Computer Vision and SLAM. His research interests lie in computer vision, in particular, in the use of machine learning and multi-view geometry for localization and scene reconstruction and understanding, with applications to robotics, wearable technologies, augmented and virtual reality. He has co-authored more than 60 research papers on those topics, and has served as Associate Editor at IEEE ICRA (3x), IEEE/RSJ IROS (5x), IEEE T-ASE (2015-2017), IEEE T-RO (2020-) and IEEE RA-L (senior editor, 2020-).

Title: Uncertainty quantification on place recognition, multi-view correspondences and single-view depth

Abstract: Uncertainty quantification is a key aspect in many safety-critical applications, in particular related to robots that act in the real world and can damage it, or themselves, or harm the persons in it. However, this topic remained under-addressed in visual localization and reconstruction/mapping for many years. In this talk, I will refer to uncertainty in localization and mapping tasks, and in particular to three of our recent works that cover different aspects of localization and mapping pipelines. Firstly, I will show how image embeddings used in place recognition can be modeled as distributions via a Bayesian triplet loss, allowing to quantify the degree of certainty on a retrieved place. Secondly, I will show how an accurate model of the covariance of visual residuals (beyond isotropic and constant noise) improves the accuracy of bundle adjustment and makes information metrics usable. Finally, I will show our results and novel developments on uncertainty quantification in supervised and self-supervised single-view depth learning, with special focus on the medical domain.