Constrained Expert

When a Robot is More Capable than a Human:

Learning from Constrained Demonstrators

Can a robot learn a better policy than the one demonstrated by a constrained expert?

LfCD with Goal-proximity Reward InterPolation

Problem:

(1) Since expert actions are restricted by the interface, the IRL reward should be decoupled from the expert action, defined for state-state transitions rather than state-action.

(2) Since demonstrations cover only part of the state space, a learning agent must identify which explored states have reliable reward estimates.

(3) For the novel states encountered during exploration, the agent requires a generalizable reward signal.

Solution: We propose the LfCD with Goal-proximity Reward InterPolation(LfCD-GRIP) framework:

(1) Goal-proximity reward depends on state only, measuring progress toward the goal.

(2) Confidence Estimator identifies expert-like observations where goal proximity reward is valid.

(3) Proximity Interpolation Mechanism propagates task progress to the novel states.

Results

Through extensive experiments, including real-world validation on a WidowX robotic arm, we show that our approach outperforms baseline methods in final task performance.

WidowX Robot picks a cube 10x faster with LfCD-GRIP.

bc_video.mp4

Behavioral Cloning (~100s)

grip_video.mp4

Ours: LfCD-GRIP (~12s)

failed_video.mp4

Other (Failed)

LfCD-GRIP achieves shorter trajectories under various constrained experts.

Figure 1: Task completion time under different settings. The top row shows results with unconstrained experts, while the bottom row shows results with constrained experts. LfCD-GRIP performs competitively when action spaces are restricted, and significantly outperforms other baselines once the agent has access to the full action space.

LfCD-GRIP Discovers Shortcut Trajectories to Goal in MiniGrid-LfCD

Figure 2: MiniGrid-LfCD Results. (left) The expert follows the blue path to the green goal, while only LfCD-GRIP takes the red shortcut; (right) average episode length across methods.

LfCD-GRIP Performs stably With More Severe Expert Constraints

Figure 3: Varying constraint severity shows the increasing benefit of LfCD-GRIP over baselines. Severity 2 means constraint $[-0.05, 0.05], which is more severe.

Page updated

Google Sites

Report abuse