GoalsEye: Learning High Speed Precision Table Tennis on a Physical Robot

Tianli Ding, Laura Graesser, Saminda Abeyruwan, David B. D’Ambrosio, Anish Shankar, Pierre Sermanet, Pannag R. Sanketi*, Corey Lynch*

*indicates equal advising

Robotics at Google

arXiv

Precise Goal-Reaching with Iterative Goal Conditioned Behavior Cloning and Self-Supervised Practice

Policy aiming at five different targets while the position and direction of incoming balls are varied (varied incoming ball distribution). Dots indicate the landing point of a ball, one color per target. Real time video.

50 attempts to reach a single target while the position and direction of incoming balls are varied (varied incoming ball distribution). Dots indicate the landing point of a ball. Real time video.


Abstract

Learning goal conditioned control in the real world is a challenging open problem in robotics. Reinforcement learning systems have the potential to learn autonomously via trial-and-error, but in practice the costs of manual reward design, ensuring safe exploration, and hyperparameter tuning are often enough to preclude real world deployment. Imitation learning approaches, on the other hand, offer a simple way to learn control in the real world, but typically require costly curated demonstration data and lack a mechanism for continuous improvement. Recently, iterative imitation techniques have been shown to learn goal directed control from undirected demonstration data, and improve continuously via self-supervised goal reaching, but results thus far have been limited to simulated environments. . In this work, we present evidence that iterative imitation learning can scale to goal-directed behavior on a real robot in a dynamic setting: high speed, precision table tennis (e.g. "land the ball on this particular target"). We find that this approach offers a straightforward way to do continuous on-robot learning, without complexities such as reward design or sim-to-real transfer. It is also scalable — sample efficient enough to train on a physical robot in just a few hours. In real world evaluations, we find that the resulting policy can perform on par or better than amateur humans (with players sampled randomly from a robotics lab) at the task of returning the ball to specific targets on the table. Finally, we analyze the effect of an initial undirected bootstrap dataset size on performance, finding that a modest amount of unstructured demonstration data provided up-front drastically speeds up the convergence of a general purpose goal-reaching policy.

Motivation

Imitation Learning (IL) provides a simple and stable approach to learning robot behavior, but requires access to demonstrations. Collecting expert demonstrations of precise goal targeting in such a high speed setting, say from teleoperation or kinesthetic teaching is a complex engineering problem.  Attempting to learn precise table tennis by trial and error using reinforcement learning (RL) is a similarly difficult proposition. It is sample inefficient and the random exploration that is typical at the beginning stages of RL may damage the robot. High-frequency control also results in long horizon episodes. These are among the biggest challenges facing current deep RL techniques. While many recent RL approaches successfully learn in simulation, then transfer to the real world, doing so in this setting remains difficult especially considering the requirement of precise, dynamic control. Here we restrict our focus to learning a hard dynamic problem directly on a physical robot without involving the complexities of sim-to-real transfer.

In this work, we consider what is the simplest way to obtain goal conditioned control in a dynamic real world setting such as precision table tennis? Can one design effective alternatives to more intricate RL algorithms that perform well in this difficult setup? In pursuit of this question, we consider the necessity of different components in existing goal conditioned learning pipelines, both RL and IL. Surprisingly, we find that the synthesis of two existing techniques in iterative self-supervised imitation learning, Learning from Play (LFP) and Goal-Conditioned Behavior Cloning (GCBC), indeed scales to this setting.

Method

The essential ingredients of success are: (1) A minimal, but non-goal-directed ``bootstrap" dataset of the robot hitting the ball to overcome an initial difficult exploration problem. (2) Hindsight relabeled goal conditioned behavioral cloning (GCBC) to train a goal-directed policy to reach any goal in the dataset. (3) Iterative self-supervised goal reaching. The agent improves continuously by setting random goals and attempting to reach them using the current policy. All attempts are relabeled and added into a continuously expanding training set. This self-practice, in which the robot expands the training data by setting and attempting to reach goals, is repeated iteratively. We refer to this as GoalsEye, a system for high-precision goal reaching table tennis, trained with goal conditioned behavior cloning plus self-supervised practice (GCBC+SSP).

Demonstrations and self-improvement through practice are key

Self-practice yields almost 40ppts improvement over learning from play data in simulation. After training on ~2,600 demonstration trajectories (LFP) goal-reaching accuracy is just 12%. After also self-practicing for 22k episodes  (GoalsEye) goal-reaching accuracy is 40% and after almost 100k is it 50%. Results are averaged over 10 seeds, with each seed trained on a shared dataset.

Self-practice yields 30+ppts improvement over learning from play data on a physical robot. After training on 2,480 demonstration trajectories (LFP) goal-reaching accuracy is just 9%. After also self-practicing for 13.5k episodes (GoalsEye) goal-reaching accuracy is >40%. Results are averaged over 5 seeds, with each seed trained on a shared dataset.

Goal reaching accuracy improves substantially during training. The policies are returning balls from the varied incoming ball distribution. Video playing at 4x real time speed.

The synthesis of techniques is essential. We set the policy the task of returning a variety of incoming balls to any location on the opponent’s side of the table.

 The charts above show how goal-reaching accuracy, defined as the % of balls that land <=30cm from the goal, changes during training. In both simulation (left) and on a physical robot (right) self-practice substantially improves performance when the task is difficult.

A policy trained on the initial 2,480 demonstrations (bootstrap data) on the physical robot has goal-reaching accuracy of just 9%. This is equivalent to training using LFP. However, when a policy has also self-practiced (autoplay) for ~13.5k attempts (~15 hours of wall clock time), goal-reaching accuracy rises to 43%. Yet if a policy only self-practices, training fails completely in this setting. 

The video on the left visualizes goal-reaching accuracy for a single goal after training on just the demonstrations (orange dots) and after self-practicing (green dots) for 13.5k attempts..

Comparing robots and amateur humans aiming at five fixed goals

Policy aiming at five different goals while the position and direction of the incoming balls is fixed (narrow incoming ball distribution). Circle indicates area <=20cm from the goal and is added post hoc to the video. That is, it is not visible to the policy. Real time video.

Best human amateur aiming at five different goals while the position and direction of the incoming balls is fixed (narrow incoming ball distribution). Circle indicates area <=20cm from the goal and is placed on the table during evaluation for humans to aim at. Real time video.

GoalsEye exceeds average human performance and is on par with the best human amateur on the narrow incoming ball distribution. GoalsEye exceeds LFP performance however the difference is moderate.

Table shows % balls landing <=30cm | <=20cm from goal, for 5 different goals.  Skill was self reported. AA = amateur advanced, AI = amateur intermediate, AB = amateur beginner. A Avg. = amateur average.

GoalsEye exceeds average human performance and is on par with intermediate amateur humans on the varied incoming ball distribution.  On this more challenging task GoalsEye substantially exceeds LFP performance. However the best human amateur exceeds GoalsEye's performance indicating further room for improvement.

Table shows % balls landing <=30cm | <=20cm from goal, for 5 different goals.  Skill was self reported. AA = amateur advanced, AI = amateur intermediate, AB = amateur beginner. A Avg. = amateur average.

Varied incoming ball distribution: 50 attempts per goal (real time videos)

Goal A

Goal B

Goal C

Goal D

Goal E