Learning Robot Soccer from Egocentric Vision with Deep Reinforcement Learning

(Full Paper)


Dhruva Tirumala, Markus Wulfmeier, Ben Moran, Sandy Huang, Jan Humplik, Guy Lever, Tuomas Haarnoja, 

Leonard Hasenclever, Arunkumar Byravan, Nathan Batchelor, Neil Sreendra, Kushal Patel, Marlon Gwira,
Francesco Nori, Martin Riedmiller, Nicolas Heess

Abstract


We apply multi-agent deep reinforcement learning (RL) to train end-to-end robot soccer policies with fully onboard computation and sensing via egocentric RGB vision. This setting reflects many challenges of real-world robotics, including active perception, agile full-body control, and long-horizon planning in a dynamic, partially-observable, multi-agent domain. We rely on large-scale, simulation-based data generation to obtain complex behaviors from egocentric vision which can be successfully transferred to physical robots using low-cost sensors. To achieve adequate visual realism, our simulation combines rigid-body physics with learned, realistic rendering via multiple Neural Radiance Fields (NeRFs). We combine teacher-based multi-agent RL and cross-experiment data reuse to enable the discovery of sophisticated soccer strategies. We analyze active-perception behaviors including object tracking and ball seeking that emerge when simply optimizing perception-agnostic soccer play. The agents display equivalent levels of performance and agility as policies with access to privileged, ground-truth state. To our knowledge, this paper constitutes a first demonstration of end-to-end training for multi-agent robot soccer, mapping raw pixel observations to joint-level actions, that can be deployed in the real world. 

1) 1v1 Matches

A) Matches on real robots

 Below we demonstrate three matches of multi-agent soccer with onboard sensing and computation on real OP3 robots. 

In the third match, we initialize the agents on opponent's side of the pitch, thus requiring them to turn around to shoot into the goal behind them. In each match we include the egocentric camera view that is fed to the agent in the bottom right corner.

Match 1

Match 2

Match 3

B) Behavior Analysis

On the right, we replay parts of the matches but edit the video to highlight particular gameplay elements that emerge during training including ball searching behavior and multi-agent behaviors like blocking and positioning. These emerge naturally with no incentive or changes to the reward structure.

C) Matches in simulation

As a comparison, we also include behaviors that emerge in 1v1 matches in simulation below.

 Match 1

Match 2

Match 3

2) Analysis : Emergent Tracking

In this analysis we highlight the emergent tracking behaviors that are learnt by the policy. Using our final vision policy, we train an additional set of MLP-heads on the penultimate layer to learn to predict various quantities like the egocentric location of the ball and opponent and the global position of the walker. Each head is used to parameterize a Mixture of Gaussian distributions with no gradients flowing back into the policy. The actual location of the quantity of interest is shown with a red cross and the prediction is shown a heat-map with red indicating higher likelihood.  In each video the egocentric view that is fed to the agent is shown in the top right and the predictions are plotted in the bottom right (egocentric predictions are converted to the real world frame for ease of interpretability).

We first highlight the visual tracking that emerges in simulation and subsequently show that this tracking transfers to the real robot.

A) Active visual tracking in simulation

Walker position


Egocentric opponent position


Egocentric ball position


B) Active visual tracking in the real world

Egocentric tracking in real

In this analysis, we show the egocentric view of the agent (rendered at 40x30 as seen by the agent) and predictions of its own position and the position of the ball. The predictions are trained using MLP-heads on top of the penultimate layer of the policy that parameterize a Mixture of Gaussians. The location of the true object is shown as a red cross and the prediction of the agent as a heat-map with red indicating higher likelihood.

Object tracking

In this analysis, we run the policy by freezing all joints of the robot except the head joints. We then analyze the tracking behavior of the robot using objects of various colors and shapes including a control with no object in the field of view.

3) Penalties

We carried out 20 penalty shoots on the real robot, 10 each with a yellow and orange ball. The agent starts on the ground and the initial locations are randomly distributed around the pitch. In the analysis video (left) we demonstrate 4 examples of successful goals, hitting the post and misses with both the yellow and orange ball.

4) Testing Visual Robustness