Program

The workshop will be October 1st using the IROS platform https://iros2021.gcon.me/

The workshop have invited talks with pre-recorded videos (already available in the platform). If you have any question about the talks, please join us in the Q&A session. There are also short live spotlight talks from contributed papers.

All times are in CEST

14:00-15:00 Panel and Q&A with the invited speakers
15:00-15:10 Spotlight talks | Detecting Human-Robot Interaction Failures Through Egocentric Visual Head-Face Analysis
15:10-15:20 Spotlight talks | Bayesian prediction of affordances from images
15:20-15:30 Spotlight talks | Bayesian Interaction Primitives for Robot to Human Handover with Giver-Egocentric Observations
15:30-15:40 Spotlight talks | Fast 3D Camera Pose Estimation with Edge Images and a CNN
15:40-15:50 Spotlight talks | Active Perception: Scene Exploration using Foveal Vision
15:50-16:00 Spotlight talks | Towards Video Summarization: A Temporal Multimodal Method for Action Spotting in Sports Videos

Invited talks:

Goldilocks and the Robot Brain - Steven LaValle

This talk considers an egocentric view of robot development when taking into account its space of possible environments and specific tasks. How much does a robot need to sense and remember to interact with its environment? The emphasis is on determining the minimal amount of information necessary to solve tasks, thereby giving the robot the smallest possible "brain". At one extreme, strong geometric information is sensed and encoded, leading to problems such as classical motion planning. On the path to minimalism, weak geometric information is considered in the form of combinatorial or relational sensing and filtering. Eventually, topological and set-based representations are considered at the minimalist extreme.

Egocentric Affordance and Skill determination from video - Walterio Mayol-Cuevas

Video: https://www.youtube.com/watch?v=mqA_2tbAEOc

In this talk I will discuss recent work on the detection of geometric affordances from a single example as well as methods that are related to skill determination.

Prof. Walterio Mayol-Cuevas received the B.Sc. degree from the National University of Mexico and the Ph.D. degree from the University of Oxford. He is a member of the Department of Computer Science, University of Bristol, UK and Principal Research Scientist at Amazon USA. His research with students and collaborators proposed some of the earliest versions of applications of visual simultaneous localization and mapping (SLAM) for robotics and augmented reality. And more recently, working on visual understanding for skill in video, new human-robot interaction metaphors and Computer Vision for Pixel Processor Arrays. He was General Co-Chair of BMVC 2013 and the General Chair of the IEEE ISMAR 2016. Topic editor of an upcoming Frontiers in Robotics and AI title for environmental mapping.

Visual learning from interaction tasks - Danica Kragic

The integral ability of any robot is to act in the environment, interact and collaborate with people and other robots. Interaction between two agents builds on the ability to engage in mutual prediction and signaling. Thus, human-robot interaction requires a system that can interpret and make use of human signaling strategies in a social context. In such scenarios, there is a need for an interplay between processes such as attention, segmentation, object detection, recognition and categorization in order to interact with the environment. In addition, the parameterization of these is inevitably guided by the task or the goal a robot is supposed to achieve. In this talk, I will present the current state of the art in the area of robot perception and interaction and discuss open problems in the area. I will also show how visual input can be integrated with proprioception, tactile and force-torque feedback in order to plan, guide and assess robot's action and interaction with the environment. For interaction, we employ a deep generative model that makes inferences over future human motion trajectories given the intention of the human and the history as well as the task setting of the interaction. With help predictions drawn from the model, we can determine the most likely future motion trajectory and make inferences over intentions and objects of interest.

3D Scene Understanding: From point-based to object-based representations of the world - Alessio del Bue

Autonomous systems have to understand the 3D spatial layout world they navigate and interact with. In order to fully operate in the wild, a fundamental step is to build representations of the 3D world that are reliable and that they can be generalised to every scenario. In this lecture we will provide a walktrough on recent advancements in generating 3D models of the world that are semantically meaningful and that can be used to solve high level tasks. We will first provide fundamentals on 3D geometry and how it is possible to localise objects in multi-view by using structure from motion principles. Then, this information can be used to provide 3D scene graphs linked to the physical world using Graph Neural Networks encoding both geometric structure and visual appearance of the objects present in the scene. Finally we will demonstrate how these models can be effective for several tasks such as camera re-localisation, active visual search and Visual Question and Answering.

Studying child visual learning with egocentric computer vision - David Crandall

Lightweight, inexpensive wearable cameras and gaze trackers allow us to capture a good approximation of a person's first-person (egocentric) field of view. This view is unique compared to those traditionally studied in computer vision, offering both challenges and opportunities. Because the camera is constantly moving, illumination conditions are suboptimal, and motion blur and object occlusion are common. However, the egocentric view provides a unique perspective in order to understand how people interact with objects, one another, and the world around them. In this talk, I'll discuss recent work in which we use head-mounted cameras and eye gaze trackers to study how children learn words for new objects. We show that children's egocentric views have unique visual properties that may help them to learn efficiently. We conduct simulation studies using ideal learner models to characterize and quantify these properties. Finally, we show how these properties may yield insights into how to improve modern computer vision algorithms and training datasets.