Automatic Goal Generation for Reinforcement Learning Agents
Reinforcement learning (RL) is a powerful technique to train an agent to perform a task; however, an agent that is trained using RL is only capable of achieving the single task that is specified via its reward function. Such an approach does not scale well to settings in which an agent needs to perform a diverse set of tasks, such as navigating to varying positions in a room or moving objects to varying locations. Instead, we propose a method that allows an agent to automatically discover the range of tasks that it is capable of performing in its environment. We use a generator network to propose tasks for the agent to try to accomplish, each task being specified as reaching a certain parametrized subset of the state-space. The generator network is optimized using adversarial training to produce tasks that are always at the appropriate level of difficulty for the agent, thus automatically producing a curriculum. We show that, by using this framework, an agent can efficiently and automatically learn to perform a wide set of tasks without requiring any prior knowledge of its environment, even when only sparse rewards are available.
We compare our Goal GAN method against three baselines. Uniform Sampling (baseline 3) is a method that does not use a curriculum at all, training at every iteration on goals uniformly sampled from the state-space. Many goals will be infeasible or too difficult for the current performance of the agent and hence no reward is received, making this method very sample-inefficient. To demonstrate that a straight-forward distance reward can be prone to local minima, Uniform Sampling with L2 loss (baseline 2) samples goals in the same fashion, but instead of the indicator reward that our method uses, it receives the negative L2 distance to the goal as a reward at every step. Finally, On-policy GAN (baseline 1) samples goals from a GAN that is constantly retrained using the state visitation distribution. Finally, we also show the performance of using Rejection Sampling (oracle), where we sample goals uniformly from the feasible state-space and only keep them if they are Good goals. This method is orders of magnitude more expensive in terms of labeling, but serves to estimate an upper-bound for our method in terms of performance.
Ant in U-Maze
In this environment the challenging "Ant" robot (13 rigid links, 8 actuators, 125 dimensional observations) has to learn how to move its Center of Mass to all possible points in the maze. Therefore, our method generates 2D goals in the x-y plane. We do not assume knowledge of the shape of the maze.
The following figures give a visualization of the policy performance for different parts of the state space. For illustration purposes, the feasible state-space (i.e. the space within the maze) is divided into a grid, and a goal location is selected from the center of each grid cell. Each grid cell is colored according to the expected return achieved on this goal: Red indicates 100% success; dark blue indicates 0% success.
The next figures show the goals sampled by the Goal GAN for the same policy training as in the above figures. Low rewards are goals with return below R1; High rewards are the goals with return above R2; Good states are the goals with return between both bounds.
Point-mass in Spiral Maze
This environment has a more complex maze, but a simpler robot: Point-mass with 2 actuators, 4 dimensional observations. We see that the oracle does better and that all the other baselines do noticeably worse than our method.
Specifically, we test in this environment two previous methods: SAGG-RIAC  and Asymmetric Self-Play . We find that our method outperforms these previous methods from the literature. Below, we will describe our implementation of these methods in further detail, as well as provide an analysis of why these methods do not perform as well on this task.
Comparison with SAGG-RIAC
To have a fair comparison with all other methods in our work, we use the same state-of-the-art RL algorithm as the “Low-Level Goal-Directed Exploration with Evolving Context” in SAGG-RIAC, as we have used in our method and the rest of the baselines (i.e. Trust Region Policy Optimization). We therefore implement the method in batch mode: at every iteration, we sample N new goals, then we collect rollouts of t_max steps trying to reach them, and we perform optimization of the parameters using all of the collected data.
As hyperparameters, we have used the recommended ones in the SAGG-RIAC paper , when available: p1= 0.7, p2= 0.2, p3= 0.1. For the rest of the hyperparameters, the best performance in hyperparameter sweeps yield: ζ= 100, gmax= 250. The noise for mode(3) is chosen to be Gaussian with variance 0.1 (the same as the tolerance threshold for success and the competence threshold). All of the TRPO hyperparameters, including the batch_size= 20000, are the same as the ones described in our paper for the other methods.
Finally, in our tasks there are no constraints to penalize for, so ρ=∅. Also,there are no sub-goals. The reset value r is 1 as we reset to s_start after every reaching attempt. The number of explorative movements q∈N has a less clear equivalence as we use a policy gradient update with a stochastic policy π_θi instead of an SSA-type algorithm.
Generally, the approach of SAGG-RIAC  performs well. It reaches higher rewards than our other baselines, but it has slightly worse performance than our method, as seen in the above figure. Although in SAGG-RIAC there are no inner iterations, for plotting purposes we have grouped the TRPO updates in sequences of 5 such that the x-axis of this plot reflects the same number of samples for all methods.
It is remarkable how well SAGG-RIAC carves out the wall region (where no goal can ever be reached), sampling fewer goals from these areas. This can be seen in the below figure, where many goals are sampled near to the start position of the robot, at the inside of the spiral.
The samples still obtained in the walls (or in general in further away areas that the current policy cannot successfully reach) are due to two factors. First, the mode(2) of sampling selects goals uniformly from the whole goal space, which includes the walls. Second, the measure of competence proposed takes into account how much closer to the goal the agent ends after t_max steps, but even infeasible goals can receive some high competence under this measure. Indeed if the policy learn to get closer to the goals that are inside the wall (although it will never reach it) the competence on that goal increases. This is why in the below figure some further regions end up with high interest. See below for further analysis of how we can modify the competence measure to alleviate this issue.
One limitation is that regions are sampled proportional to their interest, regardless of their size. Thus, if one part of the goal-space has many small regions, this area will be sampled more frequently than a single large region of similar interest. The small regions will then get more samples and be split further, exacerbating the problem. This can be seen in the above figure, where the area on the inside of the spiral has a very large number of regions, and the number of regions in that area continues to grow over time.
We have also tested SAGG-RIAC in our N-dimensional point mass task. Note that in this case the goal space Y must be N-dimensional. No simplification can be made in terms of reducing the dimensionality of Y to be smaller than the dimension of the full state-space S, because each goal is defined to be a position in n-dimensional space. Our preliminary experiments show that SAGG-RIAC performs very well in the 2-dimensional case but cannot learn anything from dimension 5 on (whereas our method still reaches performances of 80% or above, as reported in our paper). This is probably due to the problem that for higher dimensions the region splitting is less efficient at concentrating on the right areas, and then expanding until reaching uniform coverage.
Here we study modifying the competence measure to be defined as -1 if the goal is not reached and 0 if it is reached. In this case, all infeasible goals (that are never reached) will always have a competence of -1 and hence the regions that only have wall area (or those that are too far for the current policy) will have a 0 interest. This can be seen in the below figures. Here we observed a slightly better performance if decreasing g_max to 100.
Although the wall-areas are no longer sampled (except in mode(2)), the overall performance on the full maze is still below 0.35 after 300 iterations. One possible explanation for this is that, as written above, regions are sampled proportional to their interest, regardless of their size, so if one part of the goal-space has many small regions, these areas will be sampled more frequently than a single large region of similar interest. This appears to create a local optimum in which many samples are created in the area inside of the spiral close to the robot's starting position.
Comparison with Asymmetric Self-Play
We have also reimplemented Asymmetric Self-play  and tested on “Point-mass in Spiral Maze”, as shown above. We find that our approach outperforms this method as well. They train a separate policy to find new “slightly harder” goals; however, as reported in their paper, a goal-generating policy may only generate goals in a small region of the goal-space, having difficulties to quickly cover the full set of goals, as our tasks require. Further, we have observed that inevitable differences in improvement rate of the goal-generating and the goal-learning agents leads to instabilities and local optima.
 Baranes, Adrien, and Pierre-Yves Oudeyer. "Active learning of inverse models with intrinsically motivated goal exploration in robots." Robotics and Autonomous Systems 61.1 (2013): 49-73.
 Sukhbaatar, Sainbayar, et al. "Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play." arXiv preprint arXiv:1703.05407 (2017).