MoReFree: A Model-based Framework for Reset-Free RL

Two reset-free MBRL methods (MoReFree and reset-free PEG) outperform SOTA baselines (IBC, MEDAL, R3L) in 7/8 tasks, and MoReFree beats reset-free PEG in 3 more difficult tasks. Directly applying PEG works poorly.

Evaluation of MoReFree agent (does not need ground-truth reward fn nor demonstrations)

Behaviors of MoReFree

Evaluation of R3L agent (does not need ground-truth reward fn nor demonstrations)

Evaluation of IBC agent (requires ground-truth reward fn)

Evaluation of Oracle agent (200k steps) (requires reset and ground-truth reward fn)

Two sets of Fetch envs

Jittering behaviors of Fetch envs from the IBC paper

Fetch envs in our paper

Evaluation of MoReFree agent (does not need ground-truth reward fn nor demonstrations)

In simiplified Sawyer Door, we did see some learning happened, but the agent is not able to solve the original task (the first column). Top row is goals, bottom row is executions.

Behaviors of MoReFree

MoReFree is performing a 'reset', getting back to the initial state and restart.

During the non-episodic training, MoReFree goes to eval goal and explore near the goal area.

MoReFree is performing a 'reset', getting back to the initial state and restart.

MoReFree learns different behaviors to reach the goal. First, the ant flips, then uses the head to move to the goal. Without reset, turnning upside down is more difficult than just using the head to move.

`Normal' reaching

During data collection, MoReFree brings the object back to the initial state (green point) from the wall and execute the exploration policy.

Generally, the gripper is blocked for the push task. We found it is important to release the gripper in the reset-free setting. We see that MoReFree is using the gripper to poke the object from the corner.

Blocked gripper.

Jittering behavior in IBC's Fetch env.

Evaluation of R3L agent (does not need ground-truth reward fn nor demonstrations)

Since there's no public implementation of R3L agent, we implemented it ourselves. Forward agent is trained to solve the task using a learned reward fn (VICE); backward agent performs random pertubation using RND.

Behaviors of R3L

R3L is performing forward task-solving.

R3L is performing random perturbation.

Visualization of the learned reward fn on PointUMaze task. It is trained as a classifier to distinguish states from the goal state. Namely, states closer to the goal state will be rewarded. The output of the classifier is smoothed to the area near the initial state (blue circle), leads the agent to always go to the wrong direction. Such behaviors can be seen from the eval video above.

Evaluation of IBC agent (requires ground-truth reward fn)

Evaluation of Oracle agent (200k steps) (requires reset and ground-truth reward fn)

Oracle is not able to solve the task.

Two sets of Fetch envs

Jittering behaviors of Fetch envs from the IBC paper

If the robotic arm pushes the block outside, the block teleports back in. This unrealistic jittering behavior is hard to predict, and the model-based exploration methods tend to exploit this jittering behavior.

Fetch envs in our paper

We replaced the artificial position constraint with a physical wall, which is more realistic, the jittering behavior disappears.

Page updated

Google Sites

Report abuse