SceneScore:
Learning a Cost Function for Object Arrangement

Contents

Abstract | Optimisation Video | Real-World Scenes | Supplementary Material | Code

Abstract

Arranging objects correctly is a key capability for robots which unlocks a wide range of useful tasks. A prerequisite for creating successful arrangements is the ability to evaluate the quality of a given arrangement. Our method SceneScore learns a cost function for arrangements, such that high-quality, human-like arrangements have a low cost. We learn the distribution of training arrangements offline using an energy-based model, solely from example images without requiring environment interaction or human supervision. Our model is represented by a graph neural network which learns object-object relations, using graphs constructed from images. Experiments demonstrate that the learned cost function can be used to predict poses for missing objects, generalise to novel objects using semantic features, and can be composed with other cost functions to satisfy constraints at inference time.

Optimisation Video

This video shows the dining table scene, where initially two objects are in untidy poses. The method optimises the poses of the movable objects by following the gradient of the learned cost function using Langevin Dynamics. Further details are in the paper.

Real-World Scenes

In the training examples, the further the pen is from the book, the further the mug is also. This is shown in the images below. The model must learn these object-object relations. The learned cost function is visualised in the video below. Further details are in the Supplementary Material document.

pen_video_demo.mp4

The model can generalise the learned object-object relations to pencils (a new class) at inference time using semantic features, as shown in the video below:

pencil_video_demo.mp4

Supplementary Material

Code

The code, along with the full image datasets, can be found here.