1Stanford University, 2NVIDIA Research
Conference on Robot Learning (CoRL) 2024
Robot behavior policies trained via imitation learning are prone to failure under conditions that deviate from their training data. Thus, algorithms that monitor learned policies at test time and provide early warnings of failure are necessary to facilitate scalable deployment. We propose Sentinel, a runtime monitoring framework that splits the detection of failures into two complementary categories: 1) Erratic failures, which we detect using statistical measures of temporal action consistency, and 2) task progression failures, where we use Vision Language Models (VLMs) to detect when the policy confidently and consistently takes actions that do not solve the task. Our approach has two key strengths. First, because learned policies exhibit diverse failure modes, combining complementary detectors leads to significantly higher accuracy at failure detection. Second, using a statistical temporal action consistency measure ensures that we quickly detect when multimodal, generative policies exhibit erratic behavior at negligible computational cost. In contrast, we only use VLMs to detect failure modes that are less time-sensitive. We demonstrate our approach in the context of diffusion policies trained on robotic mobile manipulation domains in both simulation and the real world. By unifying temporal consistency detection and VLM runtime monitoring, Sentinel detects 18% more failures than using either of the two detectors alone and significantly outperforms baselines, thus highlighting the importance of assigning specialized detectors to complementary categories of failure.
Challenges on the horizon: Generative modeling has empowered robot policies to effectively learn from multimodal, human demonstration data. However, when robots are deployed in the real world, even the most powerful generative policies will eventually encounter out-of-distribution (OOD) scenarios, causing their behavior to deviate from expectations.
Towards reliable deployment: In response, we will require methods that monitor the behavior of learned, generative polices at deployment time to detect failures as they occur and prevent the potential negative downstream consequences.
Approach overview: We present Sentinel, a runtime monitor for detecting unknown failures of generative robot policies at deployment time. Constructing Sentinel requires only a set of successful policy rollouts and a description of the task, from which it detects diverse failures by monitoring (a) the temporal consistency of action-chunk distributions generated by the policy and (b) the task progress of the robot(s) through video QA with VLMs.
We propose to split the failure detection task into two complementary failure categories: 1) Erratic failures; 2) Task progression failures
Why? Because it is extremely difficult to design a single failure detector that can capture virtually all failure modes of generative policies.
Notes: When a policy fails by out-of-distribution conditions, it may exhibit highly diverse behaviors. Also, actions sampled from multimodal, generative policies can vary greatly from one timestep to the next, leading to complex runtime behaviors and, by extension, diverse failures compared to previous model-free policies.
So? Defining complementary failure categories admits a divide-and-conquer approach: assign a specialized detector to each failure category.
Definition (Erratic failure): The policy fails by exhibiting erratic behavior as measured by the temporal inconsistency of its action distributions across time.
Failure detector (STAC): Tracks the temporal consistency of overlapping action sequences sampled from the generative policy during a rollout by:
Using statistical distances (e.g., KL-Div.) to quantify temporal consistency;
Raising an alarm if the statistical distances exceed a calibrated threshold.
Key insight (Erratic failures)
The policy is more temporally consistent when it succeeds than when it fails.
Therefore, by calibrating a detection threshold on successful policy rollouts, we can detect unknown erratic failures using only information we know about success.
Definition (Task progression failure): The policy fails by taking actions that do not solve the task in a temporally consistent way.
Failure detector (VLM): Monitors the task progress of the generative policy by performing online Video QA with VLMs.
The Sentinel runtime monitor synergistically combines STAC and the VLM to provide coverage over highly heterogeneous failure modes that a generative policy might exhibit in the field. At deployment time, execution of the policy stops if either STAC or the VLM raise an alarm.
The images depict a policy rollout for timesteps t = [1, ..., T]. Temporal Consistency Detector: At each timestep, the state is passed to the generative policy to obtain action distributions between which statistical distances are computed to measure temporal consistency. The statistical distances are summed up to the current timestep and thresholded to detect policy failure. Vision Language Model (VLM) Detector: The VLM classifies whether the policy is failing to make progress on its task given a video up to the current timestep and a description of the task. Execution stops if either detector raises a warning.
STAC detects 80% of policy failures in this real-world task, whilst raising just one false alarm.
Sentinel, which combines STAC and the VLM (GPT-4o), attains a 100% failure detection rate.
The baselines raise alarms on all OOD test cases, whether or not the diffusion policy fails.
STAC (Ours) judiciously raises alarms: that is, only when the policy fails on OOD test cases.
As observed above, STAC (Ours) judiciously raises alarms to catch erratic policy failures.
CLIP Embedding Similarity conflates OOD states with failure, leading to false alarms.
Diffusion Output Variance cannot distinguish low-variance, OOD failures from successes.
When the policy exhibits temporally consistent failing behavior, STAC detects <50% of failures.
By incorporating predictions from the VLM, Sentinel's detects 93% of failures, whilst only incurring a 7% increase in false alarms.
Conclusion: Defining complementary failure categories for specialized detectors is key!
Citation
If you found this work interesting, please consider citing:
@inproceedings{AgiaSinhaEtAl2024,
title = {Unpacking Failure Modes of Generative Policies: Runtime Monitoring of Consistency and Progress},
author = {Agia, Christopher and Sinha, Rohan and Yang, Jingyun and Cao, Ziang and Antonova, Rika and Pavone, Marco and Bohg, Jeannette},
year = {2025},
booktitle = {Proceedings of The 8th Conference on Robot Learning},
publisher = {PMLR},
series = {Proceedings of Machine Learning Research},
volume = {270},
pages = {689--723}
}
Acknowledgements