Two reset-free MBRL methods (MoReFree and reset-free PEG) outperform SOTA baselines (IBC, MEDAL, R3L) in 7/8 tasks, and MoReFree beats reset-free PEG in 3 more difficult tasks. Directly applying PEG works poorly.
In simiplified Sawyer Door, we did see some learning happened, but the agent is not able to solve the original task (the first column). Top row is goals, bottom row is executions.
MoReFree is performing a 'reset', getting back to the initial state and restart.
During the non-episodic training, MoReFree goes to eval goal and explore near the goal area.
MoReFree is performing a 'reset', getting back to the initial state and restart.
MoReFree learns different behaviors to reach the goal. First, the ant flips, then uses the head to move to the goal. Without reset, turnning upside down is more difficult than just using the head to move.
`Normal' reaching
During data collection, MoReFree brings the object back to the initial state (green point) from the wall and execute the exploration policy.
Generally, the gripper is blocked for the push task. We found it is important to release the gripper in the reset-free setting. We see that MoReFree is using the gripper to poke the object from the corner.
Blocked gripper.
Jittering behavior in IBC's Fetch env.
Since there's no public implementation of R3L agent, we implemented it ourselves. Forward agent is trained to solve the task using a learned reward fn (VICE); backward agent performs random pertubation using RND.
Behaviors of R3L
R3L is performing forward task-solving.
R3L is performing random perturbation.
Visualization of the learned reward fn on PointUMaze task. It is trained as a classifier to distinguish states from the goal state. Namely, states closer to the goal state will be rewarded. The output of the classifier is smoothed to the area near the initial state (blue circle), leads the agent to always go to the wrong direction. Such behaviors can be seen from the eval video above.
Oracle is not able to solve the task.
If the robotic arm pushes the block outside, the block teleports back in. This unrealistic jittering behavior is hard to predict, and the model-based exploration methods tend to exploit this jittering behavior.
We replaced the artificial position constraint with a physical wall, which is more realistic, the jittering behavior disappears.