Population-Based Training aims to jointly optimize reward sets and hyperparameters to maximize performance. A set of learning processes is run concurrently, with differing reward values. Over time, underperforming agents are eliminated and replaced with the modified (mutated) copies of top performers.
In this approach, a similar setting to this Deep Mind's paper is used. Population-Based Training optimized the custom-designed reward set.
This experiment was developed in parallel with the Single-Agent Team's work and it shares some of its design assumptions and ideas. It was technically detached from their contributions and thus it is presented separately.
The League aims to improve agent performance, eliminate catastrophic forgetting and keep the stable pace of learning. It is a general, distributed system aimed at coordinating concurrent trainings, based on mutual selfplay. It tracks players, stores past results, determines the odds of one player winning against the other one, and selects opponents for a given player based on a set distribution.
In this work, the main state representation is Floats.
It is a numerical, non-pictorial representation denoting positions of players, ball, and other important information about the game as floating-point numbers.
It was initially labeled as inefficient and inferior to picture-based representations like Minimap.
Rewards are given for performing certain actions and penalties for having opponents do so. The magnitude of a single reward is drawn randomly and remains constant until elimination. Examples:
Agent made a successful pass
Agent owns the ball
Agent gets a yellow card
Opponent gets a red card
Agent scores
Based on both internal Elo rating and game backlog
Selection function prioritizes selecting similarly performing agents as opponents
Agent is eliminated if likelihood of its loss with any other agent is bigger than 0.9
A weaker agent, together with its reward set, is replaced by a better performing one. Approximately 20% of rewards are mutated by up to 20% both ways.
Fixed probability selection for copies of agent itself and built-in bot
Population-Based Training is designed to introduce diversity, leading to improved performance over baselines. The goal was to demonstrate this improvement.
There were 15 concurrent learners. The network was a simple 5x256 MLP.
Agents were initialized with a baseline player, which was trained on simple settings involving a built-in bot and simplified self-play.
Agent had a notable performance gain over baseline methods on benchmark opponents, but it still did not trigger the wow effect. An interesting correlation was found - agents were more likely to win if their penalty for losing goals was mildly larger than rewards for scoring.
Below is the plot of score reward / lose penalty vs. internal Elo rating.
A typical strategy was based on long-distance runs towards the opponents' goal, as seen on this Leaderboard video. Small number of passes per game is another piece of evidence to this empirical spotting.
Evidence of exploration was also found. For instance, some agent, colored green on the figure on the left, tried lots of sliding tackles and fouls: this was very efficient at preventing the opponent from scoring, but it failed to score as well. In another example, agent stopped scoring and passing, and was eliminated despite losing fewer goals than its siblings. Ultimately, defense-focusing agents were eliminated from the game.
Reward shaping-based resets are clearly visible on the figure, as well as exploration and ultimate convergence of the PBT. Adaptive reward shaping initially brought diversity, which was later eliminated. This elimination led to insufficient diversity, with all agents similar to each other at the end of the training, as measured by action frequency.
Motivation: (1) create an agent performing differently to the agent trained in Experiment 1, and (2) check if agents learning from scratch with PBT can be effective as well.
Agents were trained from scratch. There were 20 concurrently trained agents.
A custom network architecture was used, which combined the numerical conciseness of Floats and spatial efficiency of Minimap. Many fewer training steps have been performed here than in Experiment 1 due to notably bigger inference time.
The goal of introducing a diverse agent was reached, although performance against benchmark bots was notably smaller. It was weaker than Experiment 1 agents due to limited step count.
Sliding tackles were ordered frequently, leading to many yellow cards. These tackles were not always reasonable, as seen on this video. Number of goals scored per match was increasing over time, and plateaued below 7. Agents performed very few passes, but once the number of goals plateaued, they started to order short and high passes.
Plot on the left provides noticeable hints about potential later developments. Steady increase in goals scored and conceded suggests that agents learned to perform offensive strategies with shrinking regard to defenses. Slowly decreasing sliding tackle directives could lead to players performing tackles with more sense, whereas a sudden surge in short pass directives indicates when agents learned to pass. Players’ statistics with regard to fouls and pass directives were relatively dissimilar at the end of the training, suggesting a notable degree of diversity. They would likely keep improving, without converging to very similar models, in many more millions of steps.
Motivation: Combining agents trained in previously described experiments to improve performance.
Pink plot represents one of the two Additional Learners, and the remaining plots represent selected Main Agents. The curve for an additional learner is significantly shorter, because inference was much slower.
This training can be considered as an enhanced continuation of Experiment 1. Thus, all 15 agents were copied from there, together with their reward sets.
There were also non-mutable snapshots: the built-in bot and two pretrained networks. Snapshots are similar to players, but they don’t learn, cannot be evicted, and cannot get other player’s reward set changed.
Additional Learners comprised two best agents from Experiment 2. Thanks to their different behavioural set, they were a challenge to Main Agents. Like Snapshots, these Additional Learners cannot be evicted or have their rewards altered, but they do learn in the process of training. Additional Learners had much slower inference speed and learned for fewer steps.
Performance improvement over Experiment 1 and Experiment 2 agents was noted. Main Agents (initially copied from Experiment 1) were significantly better than Additional Learners. Additional Learners failed to match performance of Main Agents, but nevertheless improved.
Behaviours of Main Agents was similar to those of their predecessors in Experiment 1. Most plots suggest moderate-speed, non-volatile improvement. Additional Learners were notably more volatile, which can be seen with regard to passes, for instance, where short passes were replaced by long passes, or in a sudden slump in sliding tackles. It suggests that these agents still have a lot of remaining learning potential.
Ensuring that agents are diverse is essential to benefit from population-based approach. Performance improvement between Experiments 1 and 3 might be due to increased diversity of opponents.
In general, agents trained with fewer steps perform worse than agents trained with more. Performance of agents from Experiment 2 and their followups from Experiment 3 was thus worse.
Reward shaping can, and often should, be performed adaptively. Population-Based Training is a way of optimizing on a schedule of reward sets, rather than single sets. It increas
A robust system was necessary to run the Population-Based Training. Key components of the distributed system included:
Custom wrappers to Google Research Football to support the League and generate baseline agents.
Customized Google's SEED RL framework - separate instances were run on Google Cloud for each player in the League.
League Database - stores all relevant information about agents in each League. Uses Google Cloud Firestore (as it is scalable and supports transactions).
League Server - an entrypoint to all League services, including opponent selection, score reporting, eliminations etc. Written in Flask, uses App Engine.
The League is understood here as an orchestration of concurrently learning agents interacting within one experimental run.
League Database and Server were both designed from scratch for this project. League Database stores league players' locations, reward records and Elo ranking. It also tracks one-to-one statistics, and provides data for the Server to select similarly-leveled opponents and eliminate players that fail. The League Server was designed to handle significant traffic and has a significantly bigger capacity than it was needed for this training thanks to App Engine's scalability.
Franciszek Budrowski - University of Warsaw 2020 graduate, current Master's student at Cambridge. His scientific interests focus on reinforcement learning and its applications in robotics and healthcare. In his free time, he participates in long-distance runs and plays the piano. This page describes his research for Bachelor's thesis, undertaken at the University of Warsaw.