Scaling data-driven robotics with reward sketching and batch reinforcement learning
Large-scale machine learning is one of the greatest triumphs of artificial intelligence in the last decade. The recipe for success typically involves the following ingredients:
- a large deep network,
- a massive curated dataset,
- and a lot of computational power.
With this recipe, we witnessed a big leap in performance in many areas, including language modelling, image understanding, speech recognition, and playing games, such as Go, Dota and StarCraft. How can we bring this simple, yet revolutionary, recipe to robotics? While many challenges remain, this work is an initial step in this direction.
We start with scripted policies and human teleoperation to construct a dataset of robot experiences, which grows continuously with new experience. We refer to it as NeverEnding storage. To solve any particular task, we learn a corresponding reward function with a new technique to collect human preferences: reward sketching. The reward function can be applied to the entire NeverEnding storage to automatically label all the data. The resulting labeled dataset is enough to learn a control policy, end-to-end from pixels, without the need for additional data from the robot. That is, the trial-and-error learning happens in the "mind" of the robot, and, thus, releases us from making a robot to act in the real world. The more tasks the robot solves, the more data it gathers, and all of it becomes useful when learning new skills.
Our framework for data-driven robotics results in policies which are:
- Faster than human in tasks involving complex interaction among several objects.
- Robust against adversarial interference with the the robot, while it is busy solving a task.
- Generalizable to diverse objects, including deformable, hard-to-simulate and hard-to-track ones such as cloth.
Our reward sketching interface is simple and efficient. To try it for yourself, simply click with the mouse on the light blue box below and "sketch" the reward to indicate how well the robot is doing in the video above the box. The green band indicates that the task (in this case stack green object on red) is accomplished. The lowest box shows the reward prediction of the reward model in training.
Policy for the task: stack green object on red object
In this video, a human operator is trying to prevent the robot to accomplish the task with a gripper. The robot never saw such a gripper trying to interfere with its mission. Still, with perseverance it successfully solves the task for arbitrary object configurations.
In this video the robot stacks a new object in each episode. The objects have diverse shapes and appearance rarely seen in the training dataset. Nevertheless, the robot is successful in each new trial.
Some failure modes:
Because we don't need explicit object tracking neither for the agent nor for the rewards, we can learn tasks involving deformable objects equally efficiently within same the framework:
USB insertion within a day
As described in the appendix of the paper, we concurrently trained an agent and a reward function from human demonstrations, human feedback and agent data to insert a usb module to a standard socket within a day from scratch and using only pixel inputs.
Because the policy is vision only and actions are defined in the wrist frame, it is robust to positional changes not used during training. In this video, agent is only trained on the unperturbed state and can still perform insertions after moving the computer significantly.