Latent Space Safe Sets (LS3)

Albert Wilcox*, Ashwin Balakrishna*, Brijen Thananjeyan,

Joseph E. Gonzalez, Ken Goldberg

Abstract

Reinforcement learning (RL) algorithms have shown impressive success in exploring high-dimensional environments to learn complex, long-horizon tasks, but often exhibit unsafe behaviors and can require a prohibitive amount of environment interactions that can be inefficient and pose significant safety concerns to the robot and its surroundings when exploration is unconstrained. A promising strategy for safe learning in dynamically uncertain environments is requiring that the agent can robustly return to states where task success (and therefore safety) can be guaranteed. While this approach has been successful in low-dimensions, enforcing this constraint in environments with high-dimensional state spaces, such as images, is challenging. We present Latent Space Safe Sets (LS3), which extends this strategy to iterative, long-horizon tasks with image observations by using suboptimal demonstrations and a learned dynamics model to restrict exploration to the neighborhood of a learned Safe Set where the agent is confident in task completion. We evaluate LS3 on 4 domains, including a challenging sequential pushing task in simulation and a physical cable routing task. We find that LS3 can use its learned Safe Set in conjunction with model-based planning to restrict exploration and learn more efficiently than prior algorithms while satisfying constraints.

Latent Space Safe Sets

Latent Space Safe Sets (LS3) is a model-based RL algorithm for visuomotor policy learning that provides safety by learning a continuous relaxation of a safe set in a learned latent space. This latent space safe set is then used to ensure that the agent can plan back to regions in which it is confident in task completion even when learning in high dimensional spaces. This constraint makes it possible to

Improve safely by ensuring that the agent can consistently complete the task (and therefore avoid unsafe behavior).
Learn efficiently since the agent only explores promising regions of the state space in the immediate neighborhood of states in which it was previously successful.

LS3 additionally enforces user-specified, state space constraints by estimating the probability of constraint violations over a learned, probabilistic, latent space dynamics model.

Simulation Experiments

We study whether LS3 can

Learn more efficiently than algorithms which do not structure exploration based on prior task successes
Leverage the learned safe set to reliably achieve task completion during learning
Use model-based planning to satisfy user-specified state-space constraints

To answer these questions, we evaluate LS3 on three vision-based continuous-control domains which are illustrated to the right: a simulated navigation task, a modified version of the Deepmind Control Suite Reacher task, and a simulated robotic pushing task

Experiments suggest that LS3 learns more efficiently than baselines across all tasks, although SACfD and SACfD+RRL eventually match its performance.

LS3 achieves a significantly higher task completion rate than comparisons, and we see that especially for the sequential pushing task, the safe set significantly increases the task success rate. However, LS3 does violate constraints more often than SACfD and SACfD+RRL for 2/3 tasks, but SACfD and SACfD+RRL also achieve a much lower task success rate on these tasks.

Sensitivity Experiments

Key hyperparameters in LS3 are the constraint threshold δC and safe set threshold δS, which control whether the agent decides predicted states are constraint violating or in the safe set respectively. We ablate these parameters for the Sequential Pushing environment the plots shown to the right.

We find that lower values of δC made the agent less likely to violate constraints as expected. Additionally, we find that higher values of δS helped constrain exploration more effectively, but too high of a threshold led to poor performance suffered as the agent exploited local maxima in the safe set estimation.

Finally, we ablate the planning horizon H for LS3 and find that when H is too high, Latent Space Safe Sets (LS3) can explore too aggressively away from the safe set, leading to poor performance. When H is lower, LS3 explores much more stably, but if it is too low (ie. H = 1), LS3 is eventually unable to explore significantly new plans, while slightly increasing H (ie. H = 3) allows for continuous improvement in performance.

Physical Experiments

Finally we evaluate LS3 on a physical cable routing task shown to the right. The objective of the task is to move the endpoint of the cable from the starting state into the goal set shown in green without the robot arm or any part of the cable hitting the blue obstacle. This task is very challenging for vision-based control due to the difficulty of reasoning about cable deformation and collisions. We find that LS3 significantly outperforms comparisons, which are largely unable to make progress on the task, while also violating constraints less often.

Early Training

Mid Training

Late Training