Bayesian Constraint Inference from User Demonstrations Based on Margin-Respecting Preference Models


Dimitris Papadimitriou and Daniel S. Brown

Overview

Inferring constraints in environments where autonomous agents operate is a topic of growing interest in the robotics community. Since specifying constraints can be a difficult task, algorithms that infer those constraints by observing experts operate in environments are crucial. Such constraints can arise for instance in autonomous driving applications or in cases where  wearable exoskeletons for motion assistance are fit to individuals. 

Image taken from the Victoria Transport Policy Institute

Image taken from the BBC

In our work we propose a constraint inference algorithm, called Preference-Based Bayesian Inverse Constraint Reinforcement Learning (PBICRL), that infers constraints from pairwise comparisons (preferences) of demonstrations. This setting provides a computationally efficient approach for constraint inference. More specifically, our contributions can be outlined as follows:


 

We compare our approach to the classic Bayesian Inverse Reinforcement Learning (BIRL) algorithm as well as a state of the art constraint inference algorithm and we show that PBICRL outperforms both in a number of simulated environments. 

Experiments

We carry out experiments in four simulation environments, namely a 2D point mass navigation, the Fetch-Reach robot, the HalfCheetah and the Ant environments. We study both cases in which the features in the environment are known and cases in which the parameters of the features are unknown and have to be estimated. 

In the following Figures we show examples of the inferred weights associated with the environment features for the Fetch-Reach robot. The demonstrations given to us can be seen on the left figure. They are categorized in three groups, namely the safe ones (green), the ones violating the orange constraint (orange) and the ones violating the red constraint (red). On the right we plot the values of the inferred weights, with BPL corresponding to the Bayesian Preference Learning  baseline. The results for PBICRL are shown for both not tuned (mij=0) and tuned values of the modified Bradley-Terry hyperparameters. To tune the latter, we utilize feedback from the demonstrator that the margins between groups 1 and 2 and groups 2 and 3 are approximately equal. The ground-truth values are shown in blue. 



Demonstrations

Inferred Weights

Using the inferred weights from the baseline and PBICRL (with tuned mij), we obtain rollouts for the same task. In the following figures we show the constraint violations for both algorithms. Clearly, under the PBICRL inferred weights, we are able to obtain safer rollouts .

Baseline Rollouts

PBICRL Rollouts