Program

Workshop Schedule

(All times US Pacific)

8:45am - 8:50am

[LIVE]

Welcome Message

8:50am - 9:10am

[Recorded]

Event cameras are bio-inspired sensors with microsecond latency, much larger dynamic range and one thousand times lower power consumption than standard cameras. I will give a short tutorial on event cameras and show their applications on visions, drones, and cars.

9:10am - 9:30am

[Recorded]

Combining virtual and real visual elements into a single, realistic image requires the accurate estimation of the lighting conditions of the real scene. Unfortunately, doing so typically requires specific capture devices or physical access to the scene. In this talk, I will present approaches for automatically estimating indoor lighting from a single image. In particular, I will present two recent works that frame lighting estimation as a learning problem: 1) an approach that estimates a non-parametric lighting representation from a single image; and 2) another method that estimates a more intuitive parametric lighting representation instead, which allows for spatially-varying relighting. In both cases, large datasets of omnidirectional images (360º panoramas) are leveraged for training the models. I will show that using our illumination estimates for applications like 3D object insertion can achieve photo-realistic results on a wide variety of challenging scenarios.

9:30am - 9:50am

[Recorded]

Audio is an omnidirectional counterpart for any visual input: we may see only what is in front of us, but we can hear sounds from all around. This talk presents two recent developments on linking audio and visual data for enriched spatial understanding of 3D scenes. First, I present our work on 2.5D visual sound, where we infer the binaural audio stream for a video given only monaural sound. The result is immersive first-person sound for novel videos, including a dataset of 360 YouTube clips. Second, I present our work on "Visual Echoes," where we learn visual representations from echolocation. Without requiring audio at test time, the resulting image features are comparable or even outperform heavily supervised pre-training methods for multiple fundamental spatial tasks (depth prediction, surface normals, and visual navigation).

9:50am - 10:10am

[Recorded]

The reconstruction of our 3D world from moving cameras is among the central challenges in computer vision. I will present recent developments in camera-based reconstruction of the world. In particular, I will discuss direct methods for visual SLAM (simultaneous localization and mapping). These recover camera motion and 3D structure directly from brightness consistency thereby providing better performance in terms of precision and robustness compared to classical keypoint-based techniques. Moreover, I will demonstrate how we can leverage the predictive power of deep networks in order to significantly boost the performance of direct SLAM methods. The resulting methods allow us to track a single camera with a precision that is on par with state-of-the-art stereo-inertial odometry methods. Moreover, we can relocalize a moving vehicle with respect to a previously generated map despite significant changes in illumination and weather.

10:10am - 10:40am

[LIVE]

LIVE Q&A Session with Davide Scaramuzza, Jean-François Lalonde, Kristen Grauman, and Daniel Cremers

10:40am - 11:00am

[Recorded]

In this talk, I talk about how we can use 3D data to self-supervise existing problems, and how in particular panoramic images such as from the Matterport3D dataset can fuel many computer vision tasks. I will further talk about how to leverage these ideas in the context of 3D shape reconstruction / completion, getting high-quality 3D models even when no ground truth data is available in real-world scans.

11:00am - 11:20am

[Recorded]

Keynote: Taco Cohen

In this talk I will discuss omnidirectional CNNs from the perspective of equivariant convolutional networks. I will explain the notions of equivariance & symmetry in general, and discuss why they are relevant in machine learning and (omnidirectional) vision. After presenting some examples of planar rotation equivariant CNNs, we will discuss Spherical CNNs as well as several recent advances in building highly efficient and flexible CNNs on spheres and other curved surfaces, including the Icosahedral CNN, Mesh CNNs, and Spherical Gauge CNNs.

11:20am - 11:40am

[Recorded]

Keynote: Tomas Pajdla

We will present an overview of camera pose computation from images acquired by cameras with rolling shutter (RS). Majority of contemporary cameras have rolling shutter. When RS camera move during image acquisition, a very complex projection geometry arises. We will present resent results and application related to generalizing a classical computer vision (3-point) problem of finding a camera absolute pose with RS cameras.

11:40am - 12:10pm

[LIVE]

LIVE Q&A Session with Matthias Nießner, Taco Cohen, and Tomas Pajdla

LUNCH BREAK

1:30pm - 2:00pm

[Recorded]

1:30pm - 1:40pm

Toward real-world panoramic image enhancement

Yupeng Zhang, Hengzhi Zhang, Daojing Li, Liyan Liu, Hong Yi, Wei Wang, Hiroshi Suitoh, Makoto Odamaki

In this work, I describe a real world panoramic image enhancement method and discuss about if and how the image quality of a fisheye camera can be improved to a high-end camera level.

1:40pm - 1:50pm

A Deep Physical Model for Solar Irradiance Forecasting with Fisheye Images

Vincent Le Guen, Nicolas Thome

We leverage prior physical knowledge into deep neural network to accurately forecast solar irradiance from fisheye images.

1:50pm - 2:00pm

Upright and Stabilized Omnidirectional Depth Estimation for Wide-baseline Multi-camera Inertial Systems

Changhee Won, Hochang Seok, Jongwoo Lim

Upright and stabilized omnidirectional depth estimation via alignment of the rig pose to the gravity direction.

2:00pm - 2:20pm

[Recorded]

360 cameras are now affordable and easy to use by consumers. As a result, applications and research of 360 imaging are increased in number. I'm going to talk about our product line: THETA and applications for businesses.

360 cameras are now used by business persons and researchers as well as general consumers. In the business area, real estate market, 360 images are used to advertise properties by creating 360 virtual tours. In academia, computer vision researchers are exploring novel aspects of 360 imaging.

We have developed applications for real estate market and provided smart tools using deep learning techniques for creating enhanced ads from 360 images in it. In this time, I'll show you the 360 computer vision algorithms being used in the real products, some explanation about the algorithms themselves, and our research.

2:20pm - 2:40pm

[Recorded]

In the last two decades, we have seen the growth of many different omnidirectional data capturing technologies including catadioptric cameras, multi-sensor camera approaches. At DreamVu, we took a step forward and made the omnidirectional stereo cameras using just a single imaging sensor and mirrored binocular optics a reality. In this talk, we will see DreamVu's innovative catadioptric solution which has revolutionized the visual intelligence in the futuristic robotics, smart cities and healthcare. Catadioptric systems pose certain challenges which make them difficult to scale for the real applications like low resolution, complex calibration, low lighting and High dynamic range. I will present DreamVu’s solution to these problems at an industrial level which enables the growth of such cameras in the future. Omnidirectional stereo data is also uncommon for traditional computer vision methods, which makes it challenging to scale and get commercialized. At last, we will see how the omnidirectional stereo data can be transformed and adapted for state-of-the-art computer vision and deep learning methods.

2:40pm - 3:10pm

[LIVE]

LIVE Q&A Session with Hirochika Fujiki and Rajat Aggarwal

3:10pm - 3:50pm

[Recorded]

3:10pm - 3:20pm

ArUcOmni: Detection of highly reliable fiducial markers in panoramic images

Jaouad Hajjami, Jordan Caracotte, Guillaume Caron, Thibault Napoléon

Real time augmented reality in hypercatadioptric and fisheye images using ArUcOmni for marker detection.

3:20pm - 3:30pm

RAPiD: Rotation-Aware People Detection in Overhead Fisheye Images

Zhihao Duan, M. Ozan Tezcan, Hayato Nakamura, Prakash Ishwar, Janusz Konrad

RAPiD detection of people in arbitrary orientations, outperforming SOTA methods. New fisheye video dataset for people detection and tracking.

3:30pm - 3:40pm

Unsupervised Learning of Metric Representations with Slow Features from Omnidirectional Views

Mathias Franzius, Benjamin Metka, Muhammad Haris, Ute Bauer-Wersing

Mapping and Localization with unsupervised learning from omni images and noisy odometry. Autonomous and efficient mapping to metric position space.

3:40pm - 3:50pm

Deep Lighting Environment Map Estimation from Spherical Panoramas

Vasileios Gkitsas, Nikolaos Zioulis, Federico Alvarez, Dimitrios Zarpalas, Petros Daras

Omnidirectional lighting estimation from a single monocular spherical panorama with uncoupled datasets, exploiting image-based relighting.

3:50pm - 4:10pm

[Recorded]

Keynote: Ganesh Sistu

Fisheye cameras provide a large horizontal field of view of 190°. A suite of four cameras placed on each side of the vehicle provide a complete 360° near field perception, commonly referred to as ‘surround view system’. It is necessary for enabling automated parking, low and high speed maneuvering & emergency braking. In this talk, I will present the design and implementation of an industrial automated driving system from the perspective of surround view camera perception. In spite of its prevalence in commercial vehicles, there is no public dataset available for surround view cameras to enable systematic research. We discuss our fisheye multitask learning dataset WoodScape which will be made public. We then discuss the various tasks necessary for a complete visual perception system and share our results. Finally, we share our experience to build a multi-task learning system for all the tasks, specifically investigation on the architectures, loss functions and training strategies.

4:10pm - 4:30pm

[Recorded]

Zillow is the leading real estate and rental marketplace dedicated to empowering consumers with data, inspiration and knowledge around the place they call home, and connecting them with the best local professionals who can help. Zillow serves the full lifecycle of owning and living in a home: buying, selling, renting, financing, remodeling and more.

Within Zillow, the Rich Media Experiences (RMX) group is tasked with providing shoppers with tools they need to find, experience, and visualize homes. Images are exceedingly important for developing such tools, in particular, omnidirectional images. In this talk, I will describe some of RMX's applied research efforts on omnidirectional vision.

One key product from RMX is the Zillow 3D Home®, which allows the shopper to remotely tour homes through inter-connected omnidirectional/panoramic images. Omnidirectional images are captured using an omnidirectional RGB camera; alternatively, 360-degree panoramas are generated by stitching images captured by panning a smartphone. Omnidirectional images captured for a tour are subsequently used to generate annotated 2D floor plans.

To promote research on using omnidirectional images for real estate, we are releasing the Zillow Indoor Database that consists of annotated omnidirectional images and floor plans.

4:30pm - 4:50pm

[Recorded]

Keynote: Gang Hua

Although e-commerce accounts for more and more of total retail sales over the past decades, right before the COVID-19, traditional brick-and-mortar stores still account for over 80% of total sales. Certain categories of retail, like convenience stores, still offer a unique value that online retailers can't compete with: the ability to quickly and easily walk in and out of in just a few minutes. I will present a perspective of convenience stores that leverage the power of artificial intelligence to make operations more profitable. Centrally, the idea is to change operations decisions from human ones to fully digital, data-driven AI.

To achieve this, the first step is to transfer real-world physical information from the brick-and-mortar stores into the digital space, where computer vision, empowered with 360 cameras, will play a central role. Then, we can build and apply mathematical models to make optimized operation decisions. Next, we must relay these decisions to be executed in the physical store, where computer vision plays a second important role to ensure the execution changes in the physical stores to match what we configure them to be in the digital space.

These three steps comprise a co-robotic or a cyber-physical system. Hence, we can view AI-empowered large scale convenience stores as gigantic distributed co-robotic or cyber-physical systems. I will elaborate on this by sharing the results and insights obtained from several R&D projects at Wormpex AI Research for Bianlifeng, a new and fast-growing convenience store chain based in China.

4:50pm - 5:20pm

[LIVE]

LIVE Q&A Session with Ganesh Sistu, Sing Bing Kang, and Gang Hua