Out of Sight, Still in Mind: Reasoning and Planning about Unobserved Objects with Video Tracking Enabled Memory Models

Paper Supplemental Material Twitter

International Conference on Robotics and Automation (ICRA) 2024

Abstract

Robots need to have a memory of previously observed, but currently occluded objects to work reliably in realistic environments. We investigate the problem of encoding object-oriented memory into a multi-object manipulation reasoning and planning framework. We propose DOOM and LOOM, which leverage transformer relational dynamics to encode the history of trajectories given partial-view point clouds and an object discovery and tracking engine. Our approaches can perform multiple challenging tasks including reasoning with occluded objects, novel objects appearance, and object reappearance. Throughout our extensive simulation and real-world experiments, we find that our approaches perform well in terms of different numbers of objects and different numbers of distractor actions. 

Approach Overview

Real-Robot Experiments

We show one example of how our model can reason about occluded objects inside the container and the container is moved a bit.

History

Goal: Contact(apple, orange) = 1

We show how our model can reason about both unseen objects and novel objects' appearance. 

History

Goal: Boundary(All objects, table) = 1

History

Goal: Above(all objects, table) = 0

We show how our framework can understand the occluded objects and multiple objects with the same appearance. 

History

Goal: Left(blue cup on the shelf, mustard) = 1

In this example, we show how our approaches can reason about the reappearance of the object. 

History

Goal: Contact(green box, table) = 1

Simulation Training and Test

Two examples of our training dataset. We train with a maximum of 5 segments including objects and the environments. (all videos at x4 speed)

Training 1

Training 2

The testing example has 8 segments with different shapes and a novel view point. 

Test (History)

Test (planning)

Goal: Right(orange, mug) = 1

Relations are defined from the robot's view point