Abstract
Deriving robust control policies for realistic urban navigation scenarios is not a trivial task. In an end-to-end approach, these policies must map high-dimensional images from the vehicle's cameras to low-level actions such as steering and throttle. While pure Reinforcement Learning (RL) approaches are based exclusively on engineered rewards, Generative Adversarial Imitation Learning (GAIL) agents learn from expert demonstrations while interacting with the environment, which favors GAIL on tasks for which a reward signal is difficult to derive, such as autonomous driving. However, training deep networks directly from raw images on RL tasks is known to be unstable and troublesome. To deal with that, this work proposes a hierarchical GAIL-based architecture (hGAIL) which decouples representation learning from the driving task to solve the autonomous navigation of a vehicle. The proposed architecture consists of two modules: a GAN (Generative Adversarial Net) which generates an abstract mid-level input representation, which is the Bird's-Eye View (BEV) from the surroundings of the vehicle; and the GAIL which learns to control the vehicle based on the BEV predictions from the GAN as input. hGAIL is able to learn both the policy and the mid-level representation simultaneously as the agent interacts with the environment. Our experiments made in the CARLA simulation environment have shown that GAIL exclusively from cameras without BEV) fails to even learn the task, while hGAIL, after training exclusively on one city, was able to autonomously navigate successfully in 98% of the intersections of a new city not used in training phase.
Architecture
Hierarchical Generative Adversarial Imitation Learning (hGAIL) for policy learning with mid-level input representation. It basically consists of chained GAN and GAIL networks, where the first one (GAN) generates BEV representation from the vehicle's three frontal cameras, sparse trajectory and high-level command, while the latter (GAIL) outputs the acceleration and steering based on the predicted BEV input (generated by GAN), the current speed and the last applied actions. Both GAN and GAIL learn simultaneously while the agent interacts to the CARLA environment. The discriminator parts of both networks are not shown for the sake of simplicity.
Simulated Cities
Training City
Town01 environment of the agent, with one of the routes used to collect data by the expert.
Test City
Town02 environment of the agent, with one of the routes used to evaluate the agent trained in town01.
Learning Agent
Videos logging the hierarchical GAIL agent's interacting with the environment at different points in time: until 12,288 environment steps (1 cycle); from 135,168 to 147,456 environment steps (11 cycles); from 258,048 to 270,336 environment steps (21 cycles).
The videos depict the agent's interaction with six parallel simulated environments, with the bird's eye view generated by the environment on the left and the bird's eye view generated by the GAN on the right.
Early Train
Middle Train
Late Train
The vehicle’s trajectory in town1, in blue, during different moments of the training process. In the early training iterations, errors, marked in red color, are common. As training proceeds, less and less mistakes happen : until 86,000 environment interactions; from 61,000 to 147,000 interactions; from 245,000 to 331,000 interactions .
Early Train
Middle Train
Late Train
Controllable Agent
top-right
top-left
right-left
right-top
left-right
left-top
Agent's trajectories in town2 in blue color generated by the deterministic policy after training in town1 (at epoch 100) superimposed on the expert trajectory in orange color. At the same T intersection, 6 possible movements are possible: from top to right, top to left, right to left, right to top, left to right and left to top.
The evaluation videos show the trained agent at a T intersection, approaching from different directions and comparing the agent's behavior when taking two different actions starting from the same direction. Video 1: top-right and top-left. Video 2: right-left and right-top. Video 3: left-right and left-top
These videos demonstrate the agent's ability to take different actions at intersections based on the command it receives.
Top-Right and Top-Left
Right-Left and Right-Top
Left-Right and Left-Top
For each route, six images are displayed: A top-down view from a fixed camera, showing the intersection and plotting the desired trajectory in blue and the executed trajectory in orange. The bird's eye view generated by the environment. The bird's eye view generated by the agent's GAN. The frontal left camera of the vehicle. The frontal central camera of the vehicle. The frontal right camera of the vehicle.
Dynamic Environment
Here, we extend the hGAIL agent's architecture to allow for environments with traffic lights, pedestrians and other vehicles.
Three extra binary inputs were used for the GAIL policy in hGAIL, denoting the detection of traffic lights, pedestrians and vehicles in front of the cameras.
Is hGAIL better than BC or plain GAIL?
Evaluation performance for 8 T Intersections and 6 type of turns in Town2
The results are summarized in Table, whose lines presents the results for each possible turn out of 6 in total at a given T intersection. Thus, each turn, covering around 100 meters, was evaluated in 8 different T intersections, totalling 48 experiments for each agent.
The success percentage for each turn type is given in this table, where we can see that hGAIL can turn without failing in all intersections and for all turn types except for one top-right intersection, while BC fails 50% of the times, and GAIL from cameras fails to learn most of the required driving behavior, succeeding only in 4 turns out of 48.
This ablation of the GAN from hGAIL (which is the GAIL from cameras) shows the need for learning the mid-level input representation to succeed in this complex task.
Evaluation of agents in town2, trained exclusively in town1. The plot shows the percentage of completed routes from a total of six Leaderboard routes in town2 vs. environment interactions, averaged over three different runs, where each run entails a different agent trained only in town1.
Not shown in the plot, Behavior Cloning (BC) and GAIL from cameras agents fail to learn the task and complete any route (staying at 0% if shown in the plot).
Both hGAIL and GAIL with real BEV agents are able to generalize the learning in town1 to town2. The latter agent does not have to learn BEV, as it has always access to the true BEV. The hGAIL ablated agent receives no visual input of the sparse trajectory, but only its numeric vector.