MoReFree: A Model-based Framework for Reset-Free RL

MoReFree outperforms baselines reset-free PEG, IBC, MEDAL and R3L on 5/6 tasks. 

Evaluation of MoReFree agent (works under Demo-free, Reward-free, Reset-free RL, i.e. DR3L setting)

In simiplified Sawyer Door, we did see some learning happened, but the agent is not able to solve the original task (the first column). Top row is goals, bottom row is executions.

Behaviors of MoReFree

MoReFree is performing a 'reset', getting back to the initial state and restart.

During the non-episodic training, MoReFree goes to eval goal and explore near the goal area.

MoReFree is performing a 'reset', getting back to the initial state and restart.

MoReFree learns different behaviors to reach the goal. First, the ant flips, then uses the head to move to the goal. Without reset, turnning upside down is more difficult than just using the head to move. 

`Normal' reaching

During data collection, MoReFree brings the object back to the initial state (green point) from the wall and execute the exploration policy.

Generally, the gripper is blocked for the push task. We found it is important to release the gripper in the reset-free setting. We see that MoReFree is using the gripper to poke the object from the corner.

Blocked gripper.

Jittering behavior in IBC's Fetch env.

Evaluation of R3L agent (works under DR3L setting, so it is the only suitable baseline for MoReFree to compare with)


R3L works under demo-free, reward-free, reset-free (DR3L) setting, so it is the only suitable baseline. Since there's no public implementation of R3L agent, we implemented it ourselves. Forward agent is trained to solve the task using a learned reward fn (VICE); backward agent performs random pertubation using RND. 

Behaviors of R3L

R3L is performing forward task-solving

R3L is performing random perturbation. 

Visualization of the learned reward fn on PointUMaze task. It is trained as a classifier to distinguish states from the goal state. Namely, states closer to the goal state will be rewarded. The output of the classifier is smoothed to the area near the initial state (blue circle), leads the agent to always go to the wrong direction. Such behaviors can be seen from the eval video above. 

Evaluation of IBC agent (requires task-related reward fn)

coming soon ...

coming soon ...

Evaluation of MEDAL agent (requires task-related reward fn and demonstrations)

coming soon ... 

Evaluation of Oracle agent (200k steps) (requires reset and task-related reward fn)

Oracle is not able to solve the task.

More results on IBC Fetch envs

Jittering behaviors of Fetch envs from the IBC paper

 If the robotic arm pushes the block outside, the block teleports back in. This unrealistic jittering behavior is hard to predict, and the model-based exploration methods tend to exploit this jittering behavior.

Fetch envs in our paper

We replaced the artificial position constraint with a physical wall, which is more realistic, the jittering behavior disappears.

MoReFree vs IBC on the original IBC Fetch envs

Fetch Push

PickandPlace

In PickandPlace task, MoReFree first focuses on jittering behaviors instead of picking up, after 1M steps it learns to pick up.