SceneScore:
Learning a Cost Function for Object Arrangement
Contents
Abstract | Optimisation Video | Real-World Scenes | Supplementary Material | Code
Abstract
Arranging objects correctly is a key capability for robots which unlocks a wide range of useful tasks. A prerequisite for creating successful arrangements is the ability to evaluate the quality of a given arrangement. Our method SceneScore learns a cost function for arrangements, such that high-quality, human-like arrangements have a low cost. We learn the distribution of training arrangements offline using an energy-based model, solely from example images without requiring environment interaction or human supervision. Our model is represented by a graph neural network which learns object-object relations, using graphs constructed from images. Experiments demonstrate that the learned cost function can be used to predict poses for missing objects, generalise to novel objects using semantic features, and can be composed with other cost functions to satisfy constraints at inference time.
Optimisation Video
This video shows the dining table scene, where initially two objects are in untidy poses. The method optimises the poses of the movable objects by following the gradient of the learned cost function using Langevin Dynamics. Further details are in the paper.
Real-World Scenes
In the training examples, the further the pen is from the book, the further the mug is also. This is shown in the images below. The model must learn these object-object relations. The learned cost function is visualised in the video below. Further details are in the Supplementary Material document.
The model can generalise the learned object-object relations to pencils (a new class) at inference time using semantic features, as shown in the video below:
Supplementary Material
Code
The code, along with the full image datasets, can be found here.