Task-Oriented Active Learning of Model Preconditions for Inaccurate Dynamics Models
Alex Lagrassa, Moonyoung Lee, Oliver Kroemer
Carnegie Mellon University, Robotics Institute
{alagrass,moonyoul,okroemer}@andrew.cmu.edu
A dynamics model that works in one environment may lead to noticeable deviations when applied to a different environment. Knowing where the model is accurate can help compute plans that can successfully reach the goal.
Here's an illustrative example showing a robot trying to pour water into a plant container.
Due to the plant's presence, water may or may not enter the container. As such, the robot can actively learn where the model is accurate by using a planner and acquisition function to iteratively select informative trajectories.
The resulting learned model precondition is then used at test time to only perform actions in the model precondition.
Abstract
When planning with an inaccurate dynamics model is necessary, a promising strategy involves confining planning effort to regions of state-action space where the model is accurate: sometimes referred to as a model precondition. Many model forms such as simulators and analytical models lack inherent criteria for regions where the model will be accurate in the test environment, which motivates the utility of defining model preconditions using small amounts of data collected in the test environment. This paper presents an algorithm that actively selects trajectories to learn a model precondition for planning with an inaccurate pre-specified dynamics model. The main contributions of this work are the proposed techniques for actively learning model deviation estimators and the experimental analysis of algorithmic properties in three planning domains: icy gridworld, simulated plant watering, and realworld plant watering environments
Method
Each iteration j starts with sampling a planning problem and generating candidate trajectories.
We outline the acquisition function computation for each trajectory in the pink box, including the step-wise acquisition function values, αstep(st ,at ) for each state-action pair in the trajectory.
These values are then aggregated by a function h to yield the trajectory’s utility: α(τ).
The final step is selecting and executing τ∗ , in the test environment to collect the ground truth [s[1 : Tτ ] , a[1 : Tτ − 1] ].
The MDE is updated every M trajectories.
Three domains used showing the experimental setup and their corresponding dynamics models.
(a) Slippery grid world where movement may result in slipping backwards over ice (blue) or not moving (grey). The analytical dynamics model assumes unimpeded movement with grid bounds in four directions.
(b) Simulated plant watering using a learned dynamics model trained on a simple water transport domain without a plant
(c) Real-world robot pouring water in a plant pot where the analytical dynamics model is based on container geometry
Additional Experiments
Qualitative Analysis of MDE over Online Learning Runs
We can observe successful balance of exploration and exploitation through additional qualitative analysis. The plots below show model preconditions and acquisition function values in the simulated watering environment for individual online learning runs for a successful run.
Acquisition function values (top)
Graph explanation
This graph shows the values of the step-wise acquisition function to better understand the utility of points in space for actions with a change in angle, where pouring can occur.
The color bar indicates the acquisition function value, so lower is more useful
The dot shading indicates the true deviation for an observed sample
Results show:
A darker region (lower acquisition function values) under the leaves which is highest around both low-error and high-noise regions.
High values prevent high amounts of exploration above the leaves once sufficient data is collected
Coverage of a wide region of points
Model preconditions (bottom)
Graph explanation
This graph shows how the model precondition evolves as more data is collected
Blue is within the model precondition and red is outside the model precondition .The white region indicates the boundary, which is just outside the model precondition.
Results show:
A wider region as more data is collected, which can allow a higher success rate in computing plans
A conservative precondition with a low amount of data, which can prevent the robot from computing plans that are unlikely to succeed
Qualitative Analysis for Active Learning
Although our acquisition function can bias data collection to low-deviation trajectories given enough data, the random generation process of RRT may not provide a sufficient set.
Constraining the trajectories using the MDE during training can improve the selection.
Experiment: We empirically evaluate the effect of four different beta schedules, two of which are fixed, and two of which start permissive and become more conservative.
Summary of results: When evaluated in the simulated plant-watering domain, we observed no significant performance difference between using these schedules.
Effect of risk tolerance schedules on test performance. Schedule A varies β from -2 to 2 using a sigmoid function. Schedule B is the same as schedule A, but with a maximum at 1. Schedule C fixes β = −2 and schedule D fixes β = 1