Check out the full paper on arXiv: https://arxiv.org/abs/2004.10190

How do we make robots which can learn "on the job"?

One of the great promises learning robots is that they will be able to learn from their mistakes, and continuously adapt to ever-changing environments. Despite this potential, most of the robot learning systems today are trained once, then deployed, and their behavior is never adapted during deployment . We would like to build robots which can adapt their previously learned behaviors to new environments, objects and conditions in the real world.

Why does this matter?

Consider the robot on the right. It was originally trained to grasp a wide variety of objects (top), but it can't possibly see all possible objects during training. In the real world, at some point, it will encounter an object which it has never seen before, and can't grasp well, such as transparent bottles (bottom), which are notoriously hard for robots to grasp. For the few learning robots today out in the world (e.g. self-driving cars), this would be the end of the story.

Conceivably, the robot's designers could add these bottles to robot's training routine, make a new grasping program, and update all their robots out in the world so that they could now grasp the bottles. However, if we imagine a world with many learning robots trying to help humans in very many different environments, tasks, and circumstances, this doesn't seem very practical.

Why not just have the robot learn how to pick up the bottles on-the-fly? Making that possible is the goal of this project.

↓ ?

Let's start by adapting just once.

If we want to adapt over-and-over again on the fly, the first step is to do it just once.

First, our robot uses a lot of data to learn to do something useful and fairly general (1. Pre-Train). In the experiments here, we pre-train the robot to grasp a wide variety of objects using a dataset of 608,000 grasp attempts. That's a lot!¹

Then something changes. In this example, we move the robot's fingers 10cm to the right of where they started. This halves its success rate at grasping from 86% to just 43%. Our robot can kind-of grasp objects with its new gripper, but not very well. We let it practice grasping with the new fingers for a while (2. Explore), and have it save logs of those attempts into a new dataset (Target Data). Now our robot has a lot of experience on the original task (plain grasping, 608k attempts), and just a little experience trying to use its new fingers (800 attempts).

We show that you can do something really simple to adapt: start the original learning algorithm where you left off from pre-training (3. Initialize), but this time instead of using only the original attempts to train, mix in your new attempts as well, making sure you visit these new examples very often (50% of the time, even though they are only 800 of the 608,800 total attempts we have at our disposal).

We run our learning algorithm for a while in this configuration (4. Adapt), then ask the robot to try again (5. Evaluate). In our example on the right, the robot achieved a 98% success rate with its new fingers--that's even better than the 86% it started with!

How did we do?

You can see a couple examples of on the right.

First, let's try to grasp transparent bottles, which we didn't see during pre-training. Notice that the non fine-tuned robot (top-left) on the left gets confused, and jams its gripper between bottles, and has a 49% overall success rate. The fine-tuned policy (top-right) successfully grasps the bottle without jamming. At 66%, it's not perfect, but it's noticeably better than the non-tuned policy.

Next let's look at the offset fingers example from above, but this time in more detail. You can see the non fine-tuned policy (bottom-left) tries to grasp objects, but misses because it doesn't know where its fingers are, and gets a 43% success rate. The fine-tuned policy (bottom-right) successfully grasps nearly every time, earning itself a 98% success rate.

Take a look at the diagram below for a full summary of the results.

What if we did something simpler instead?

The fine-tuning procedure we outlined above seems to be doing pretty well, but now is a good time to take a step back and ask "what if we did something simpler?"

The experiment on the right shows the result of trying a somewhat simpler method. Rather than asking our robot to first learn to grasp (recall, it used 608,000 pre-stored attempts to do that), we skip this lengthy step. Instead, we use a much more general dataset called ImageNet, which is a database of about 14 million images of objects throughout the real world. The robot can't actually try to grasp those objects, so instead we train it to simply identify them properly (classification). Then we use what it learned from this pre-training to start the fine-tuning process with the new data, as described above.

Looking as the result of this process (right), you can see that it was actually fairly successful: the robot learned to identify and even grasp towards real objects. Unfortunately, the images it trained on are 2D, and it never got the chance to try interacting with them, so it never learns depth perception.

Single-Step Fine-Tuning

Continual Fine-Tuning

If we can do it once, can we do it over and over again?

We don't want robots which can only adapt once, we want real-world robots which can adapt over-and-over again, because the world is constantly changing. Now that we verified above that we can use a simple technique to adapt once, we can ask "Can we do it again? How many times?"

To find out, we setup the experiment depicted by the diagram above. You'll notice it looks a lot like the first diagram we showed you, but instead of pre-training once and then adapting once, our robot adapts over-and-over again to 5 new situations. Rather than pre-training every time, we use the result of each step as the starting point for the next step.

You can see a summary of the results in the above diagram. If you compare it to the summary of single-step results just above, you see that the final success rates aren't very different. This is good news! It tells us that repeatedly fine-tuning doesn't perform much worse than single-step fine-tuning.

On the right are a couple examples of the robot's performance from this continual learning experiment. Notice that the experiment in which we move the robot's fingers has a 91% success rate, which is only 7% less than its single-step counterpart. Similarly, when we put a checkerboard behind the robot's workspace (to try to fool it into trying to pick up the background), we can achieve an 86% success rate with continual fine-tuning. This is 36% better than not fine-tuning, and only 4% less than single-step fine-tuning.

Where do we go from here?

These results are exciting for three reasons:

It's fast. Our robot can adapt to new situations very quickly (about 1-4 hours of practice, compared to ~6000 hours learning how to grasp)
It's simple. The method our robot uses is very simple. It's barely different than the basic procedure we already use for training robots, and doesn't require any new algorithms. It reuses data and technology we already have.
It's repeatable. Our robot can use the method over-and-over again to adapt to new situations, and its performance doesn't degrade the more it adapts. This is a key property for building a robot which learns new things every day.

Using this research, we hope to investigate a few new questions in the future:

How extreme are the changes our robot can adapt to? For instance, can we use a technique like this to teach our grasping robot how to sort what it picks up?
Can we make it automatic? In these experiments, the humans decided when to stop collecting data on the new challenge, and they decided when to stop running the learning algorithm with that new data. What if the robot could make those decisions for itself?
Can we use to to make a robot which learns every day? What if, instead of adapting to 5 new things in a row, we tried to adapt to 365, one for every day of the year? Will the system break down? Even better, can we make it so that learning one new thing helps you do better on things you already know how to do?

Footnotes

¹Keep in mind, we only have to do that initial trial-and-error process once. In practice, we just store a big database of these attempts and re-use them all the time, rather than starting from scratch. This is similar to how other machine learning technologies work, such as image recognition. Image recognition systems learn from large datasets of pre-labeled images which are used over-and-over again. Our system uses a large dataset of pre-labeled grasps.