When a Robot is More Capable than a Human:
Learning from Constrained Demonstrators
When a Robot is More Capable than a Human:
Learning from Constrained Demonstrators
Can a robot learn a better policy than the one demonstrated by a constrained expert?
LfCD with Goal-proximity Reward InterPolation
Problem:
(1) Since expert actions are restricted by the interface, the IRL reward should be decoupled from the expert action, defined for state-state transitions rather than state-action.
(2) Since demonstrations cover only part of the state space, a learning agent must identify which explored states have reliable reward estimates.
(3) For the novel states encountered during exploration, the agent requires a generalizable reward signal.
Solution: We propose the LfCD with Goal-proximity Reward InterPolation(LfCD-GRIP) framework:
(1) Goal-proximity reward depends on state only, measuring progress toward the goal.
(2) Confidence Estimator identifies expert-like observations where goal proximity reward is valid.
(3) Proximity Interpolation Mechanism propagates task progress to the novel states.
Results
Through extensive experiments, including real-world validation on a WidowX robotic arm, we show that our approach outperforms baseline methods in final task performance.
Behavioral Cloning (~100s)
Ours: LfCD-GRIP (~12s)
Other (Failed)
Figure 1: Task completion time under different settings. The top row shows results with unconstrained experts, while the bottom row shows results with constrained experts. LfCD-GRIP performs competitively when action spaces are restricted, and significantly outperforms other baselines once the agent has access to the full action space.
Figure 2: MiniGrid-LfCD Results. (left) The expert follows the blue path to the green goal, while only LfCD-GRIP takes the red shortcut; (right) average episode length across methods.
Figure 3: Varying constraint severity shows the increasing benefit of LfCD-GRIP over baselines. Severity 2 means constraint $[-0.05, 0.05], which is more severe.