BeTAIL: Behavior Transformer Adversarial Imitation Learning 

from Human Racing Gameplay 

Catherine Weaver, Chen Tang, Ce Hao, Kenta Kawomoto, Masayoshi Tomizuka, Wei Zhan

UC Berkeley, Sony AI

arXiv, Code 

Improving the performance of offline sequence modeling with residual policies and adversarial imitation learning.

Abstract

Recent successes in autonomous racing leverage reinforcement learning; however, imitation learning is a promising alternative for learning from human demonstrations without requiring hand-designed rewards. However, learning a racing strategy from human demonstrations is difficult due to the unknown decision-making process and complex environment. Sequence modeling is a powerful non-Markovian approach, but offline learning struggles to overcome distribution shifts and adapt to new environments. Adversarial Imitation Learning (AIL) can mitigate this effect; however, AIL can be sample inefficient and may fail to model human decision-making with Markovian policies. To capture the benefits of both approaches, we propose BeT-AIL: Behavior Transformer-Assisted Adversarial Imitation Learning. BeT-AIL employs BeT to learn a non-Markovian policy from human demonstrations, and an added residual policy corrects BeT policy errors. The residual policy is trained with AIL to match the state occupancy in online rollouts with the state occupancy of demonstrations. We test BeT-AIL on three challenges with expert-level demonstrations from real human gameplay in the high-fidelity racing simulator Gran Turismo Sport.  First, the BeT and residual policy are trained on the same demonstrations and track, and BeT-AIL outperforms standalone BeT and AIL. Then, the BeT policy is pretrained on one or more tracks, and BeT-AIL fine-tunes the policy on unseen tracks with limited demonstrations.  In all three challenges, BeT-AIL reduces the necessary environment interactions and improves racing performance or stability, even when the BeT is pretrained on different tracks. 

BeT-AIL adds a residual AIL policy to correct BeT actions



BeT-AIL employs both the offline BeT and the residual AIL policy. The pre-trained BeT predicts an action from the last H state-actions. Then the residual policy specifies the residual action from the current state and  BeT prediction. The agent executes the sum of the BeT prediction and the residual action in the environment.

Learning an autonomous racing policy from human demonstrations in Gran Turismo Sport

Can BeT pretraining accelerate AIL learning on the same track?

Lago Maggiore Challenge



Lago Maggiore challenges pretrains the BeT on the same demonstrations and downstream environment. Lago Maggiore has 49 demonstrations. The BeTAIL leverages the demonstrations to accelerate learning a racing policy.



Results

Figure: Autonomous racing results in Gran Turismo Sport. Left:  evaluation of mean (std) success rate to finish laps Right: mean (std) of lap time




Table: Best policy’s mean ± std lap time and change in steering from previous time step

Video: Racing results in Gran Turismo Sport.


The Behavior Cloning immediately shifts out of the distribution of the demonstrations and does not learn a good racing policy.



The Behavior Transformer accurately models the behavior in the demonstrations, and can race better than BC. However, BeT still suffers to race well under distribution shift.



Adversarial Imitation Learning races better than BC or BeT since can it adjust its policy based on rollouts. However, the policy is unstable and shaky.



Our Behavior Transformer Adversarial Imitation Learning (BeTAIL) leverages the accurate non-Markovian modeling of the BeT with online AIL residual policy learning to learn the fastest and most stable racing policy.






exp1long.mp4

Gran Turismo Sport: TM & © 2021 Sony Interactive Entertainment Inc. Developed by Polyphony Digital Inc. 

What do BeTAIL trajectories look like?

Can BeT pretraining on a library of tracks accelerate AIL on a new track?

Mount Panorama Challenge



The Mount Panorama challenge pretrains the BeT on a library of 4 tracks, and BeTAIL finetunes on an unseen track. There are 107 demonstration trajectories in the library, and 1 demonstration trajectory on the downstream track.



Results

Figure: Autonomous racing results in Gran Turismo Sport. Left:  evaluation of mean (std) success rate to finish laps Right: mean (std) of lap time





Table: Best policy’s mean ± std lap time and change in steering from previous time step

Video: Racing results in Gran Turismo Sport.


The Behavior Cloning immediately shifts out of the distribution of the demonstrations and does not learn a good racing policy.



The Behavior Transformer accurately models the behavior in the demonstrations, and can race better than BC. However, BeT still suffers to race well under distribution shift.



Adversarial Imitation Learning races better than BC or BeT since can it adjust its policy based on rollouts. However, the policy is unstable and shaky.



Our Behavior Transformer Adversarial Imitation Learning (BeTAIL) leverages the accurate non-Markovian modeling of the BeT with online AIL residual policy learning to learn the fastest and most stable racing policy.






exp3_long.mp4

Gran Turismo Sport: TM & © 2021 Sony Interactive Entertainment Inc. Developed by Polyphony Digital Inc.