Preference learning for guiding the tree searches in continuous POMDPs

Jiyong Ahn, Sanghyeon Son, Dongryung Lee, Jisu Han, Dongwon Son, Beomjoon Kim

Intelligent Mobile-Manipulation Lab, KAIST

[Paper] [Project page] [Github] [SlideShare]

Abstract

A robot operating in a partially observable environment must perform sensing actions to achieve a goal, such as clearing the objects in front of a shelf to better localize a target object at the back, and estimate its shape for grasping. A POMDP is a principled framework for enabling robots to perform such information-gathering actions. Unfortunately, while robot manipulation domains involve high-dimensional and continuous observation and action spaces, most POMDP solvers are limited to discrete spaces. Recently, POMCPOW has been proposed for continuous POMDPs, which handles continuity using sampling and progressive widening. However, for robot manipulation problems involving camera observations and multiple objects, POMCPOW is too slow to be practical. We take inspiration from the recent work in learning to guide task and motion planning to propose a framework that learns to guide POMCPOW from past planning experience. Our method uses preference learning that utilizes both success and failure trajectories, where the preference label is given by the results of the tree search. We demonstrate the efficacy of our framework in several continuous partially observable robotics domains, including real-world manipulation, where our framework explicitly reasons about the uncertainty in off-the-shelf segmentation and pose estimation algorithms.

Video Overview

unguided.mp4

Unguided Search

igp.mp4

Imitation Learning (IGP)

pgp(our method).mp4

Ours: Preference Learning (PGP)

Approach

We propose an alternative data-efficient technique for learning a value function based on the following two observations. First, a search tree for a POMDP typically consists of a few success histories that led to a goal and a large number of other histories that did not. Second, all we need to efficiently guide a tree search is ranking among the histories specifying which one is more likely to lead to the goals, not their actual values, since the purpose of a value function in a tree search is to determine it. Based on these two observations, we propose a value function learning algorithm that learns the ranking among histories.

We utilize the preference function instead of a regressor because it suffices to determine the exploration priority among nodes. In preference-based reward learning you are presented with two trajectories, and an oracle indicates the preferred one. Subsequently, a reward function is trained to assign greater rewards to the preferred trajectory compared to the other. We adapt this concept to value function learning, where preference labels are derived from the outcomes of tree searches. Within a search tree, we select a history that reached a goal, termed as a success history, and another that did not, termed as a failure history. We learn a value function that favors the success history over the failure history. One potential limitation of this straightforward success-and-failure preference labeling approach could be the absence of optimality consideration. To address this, we generate additional data by pairing two successful histories and labeling the one closer to the goal as the preferred one. We discovered that in scenarios with limited data, preference learning proves more robust than regression because it is less susceptible to variations in value differences, demonstrating greater resilience against noise in comparison to regression, which exhibits higher variance.

Result

Right figure shows the result of the three domains. Preference-based approaches, PGP and SF-PGP, achieve significantly higher success rates than IGP when they are trained with the same number of tree searches and computes, supporting our hypotheses.

Additional Links

Demo Youtube Video

Page updated

Google Sites

Report abuse