Active Exploration for Robotic Manipulation

Tim Schneider, Boris Belousov, Georgia Chalvatzaki,

Diego Romeres, Devesh K. Jha, and Jan Peters

Abstract: Robotic manipulation stands as a largely unsolved problem despite significant advances in robotics and machine learning in recent years. One of the key challenges in manipulation is how to efficiently explore the dynamics of the environment, when there is a lot of contact between the objects being manipulated. In this paper, we propose a model-based active exploration approach that enables efficient learning in sparse-reward robotic manipulation tasks. The proposed method estimates an information gain objective using an ensemble of probabilistic models and utilizes model predictive control (MPC) to plan actions on-line to maximize the expected reward while performing directed exploration. We evaluate the algorithm in simulation and on the real robot on a challenging ball pushing task, where the target ball position is not known to the agent a priori. Furthermore, we establish a close link between our method and Active Inference (AI), a general framework for decision making under uncertainty, by showing that our agent can be seen as performing active inference on a more general dynamics model that does not require the mean-field assumption. Thus, our experiments demonstrate for the first time an AI agent solving a challenging manipulation task on a real robot.

Paper

Source Code

Talk for IROS 2022

Supporting Video

Additional Experiments on the Real System

In this section we present additional experiments we did on different table inclinations, where we optionally enabled and disabled rotation of the robot's finger. We find that with finger rotation disabled, our method reliably solves the task for all three table inclinations we tested (15°, 30°, 45°). If we enable finger rotation, on the other hand, the task becomes much more challenging as the action space grows by one dimension. Our method still solves the task for inclinations of 15° and 30°, but converges to a suboptimal solution for the 45° inclination.

15° table inclination, finger rotation disabled

In this experiment, we set the table inclination to 15° and do not allow the robot to rotate its finger. The experiment shows how our agent learns to balance the ball on the finger, purely driven by intrinsic reward. Since the agent has no knowledge of the system dynamics or the (sparse) reward function, the only reason it has to balance the ball is to discover novel, unexplored states. After some time of systematically moving around the ball on the table, our agent eventually discovers that it receives a reward of 1 per step, if the ball is moved into a target zone located in the top center of the table. As soon as this reward is discovered, the extrinsic term dominates the agent's behavior and it starts exploiting this reward.

15° table inclination, finger rotation enabled

In this experiment, we set the table inclination to 15° and allow the robot to rotate its finger. Allowing finger rotation gives the agent more control over the ball, but also makes the task harder, as the action space grows by one dimension.

30° table inclination, finger rotation disabled

In this experiment, we increase the table inclination to 30° and do not allow the robot to rotate its finger. Notable in this experiment is the sharp increase in performance after 60,000 steps. The reason for this increase is that we start learning variances with our dynamics and reward models at this point. Learning variances allows the agent to factor in uncertainty during planning. We believe that having a model of uncertainty allows the agent to avoid actions for which it is uncertain whether the ball will remain on the finger or not.

One example where uncertainty needs to be considered during planning is to avoid ball spin, which often leads to the ball dropping. Since the agent can only observe the position of the ball, it has no way of knowing whether the ball is currently spinning or not. However, the ball behaves very differently in both cases. Specifically, the ball is a lot more stable on the finger if it is not currently spinning. If we allow the agent learn different variances for different state transitions, it can assign a high variance to state-action pairs in which the agent is uncertain whether the ball might start to spin. By factoring in these uncertainties during planning, the agent avoids jerky movements, which might or might not lead to ball spin.

30° table inclination, finger rotation enabled

In this experiment, we set the table inclination to 30° and allow the robot to rotate its finger. Allowing finger rotation while increasing table inclination makes the exploration of this task significantly more challenging, as even small rotations of the finger usually lead to the ball being dropped if not counteracted.

45° table inclination, finger rotation disabled

In this experiment, we further increase the table inclination to 45° and do not allow the robot to rotate its finger. Here, we fork the agent into two versions after 60,000 steps. One version (continuous line) is our regular agent that starts learning the variances of the dynamics and the reward model at this point. In the other version (dotted line) we do not start to learn the variances. The plot below shows that not learning the variances causes the performance to deteriorate over time in this experiment.

45° table inclination, finger rotation enabled

In this experiment, we set the table inclination to 45° and additionally allow the robot to rotate its finger. The combination of these two settings makes this task extremely challenging, as small rotations of the finger can cause the ball to drop before the agent has time to react. Although the agent is initially balancing the ball and exploring the table, it is unable to find the reward in this experiment. Eventually, all the states around the initial state are sufficiently explored and the intrinsic reward in this area becomes close to zero. Although there exist states in the state space that have not been explored, the planner fails to find a path to them and eventually converges to a local optimum, in which the robot stops to move to save action cost.