466 Project

Payton Webber and Dhowa Husrevoglu

Distributed Deep Reinforcement Learning

Project Report

Project Presentation

Previous work / Related work

Appendix: Background on Reinforcement Learning and AlphaZero

Project Report

466-report.pdf

Project Presentation

Biweekly Update 3

466project-2.pdf

Biweekly Update 2

biweekly-update2.pdf

Biweekly Update 1

466-update-1.pdf

Proposal

The Problem

Deep reinforcement learning has shown remarkable potential across a variety of domains [1]. However, training these models requires large-scale computational resources; as a result, the field is dominated by large organizations with well-funded infrastructures, which stifles open innovation and collaboration. We propose addressing this bottleneck by creating a distributed, community-driven platform that pools the computational capabilities of individual volunteers.

Previous work / Related work

Numerous projects and initiatives highlight the feasibility and advantages of distributed, collaborative model training. A compelling example of a successful open, distributed reinforcement learning effort is Leela Zero, an open-source implementation based on DeepMind’s AlphaGo Zero [2]. Instead of using a single compute cluster to train the model, Leela Zero utilized data generated by many contributors who lent their computational resources in a distributed manner, resulting in a model comparable to AlphaGo Zero [3]. For additional context on reinforcement learning and the AlphaZero framework, please refer to the Appendix: Background on Reinforcement Learning and AlphaZero.

Our Approach

Our focus will be on reinforcement learning algorithms that utilize a replay buffer, as this separates learning from acting and allows many actors to generate data in parallel. Specifically, we will limit the scope of this project to the AlphaZero algorithm [4], where self-play data is stored in a replay buffer for future training. We will experiment with distributed reinforcement learning by training an AlphaZero model on the game Othello. This involves setting up a communication network between multiple actors and a centralized learner. Our primary goal is to improve learning efficiency by distributing the computationally intensive task of generating self-play games. As a secondary goal, we will analyze bottlenecks in the network and develop fixes. One potential challenge is distributing the model’s weights to the actors once the model has finished training and signals them to start generating self-play data. To address this, we plan to use a hierarchical structure to propagate updated weights. To ensure reliability, we will introduce basic fault-tolerance strategies that manage node variability and guard against partial data corruption. We also aim to implement secure communication protocols to prevent malicious or invalid data from polluting the replay buffer. As a measure of success, we plan to track the speed-up factor relative to a single-machine baseline and record the number of self-play games generated over time.

Timeline

Feb 8–Feb 14 - Decouple the actor from the learner in the current implementation.
Feb 15–Feb 21 - Implement actor clients and design the network structure.
Feb 22–Feb 28 - Conduct a trial run to identify shortcomings and bottlenecks, then produce a brief report.
Feb 29–Mar 7 - Present a model progress report; implement fixes and refine solutions based on the trial run.
Mar 8–Apr 4 - Finalize results, apply any remaining fixes, and compile overall reflections.

Resources

AlphaZero implementation - https://github.com/PaytonWebber/alphazero

Personal VPN testbed - https://www.wireguard.com/

Testbed PlanetLab - https://planetlab.cs.princeton.edu/

References

[1] K. Arulkumaran, M. P. Deisenroth, M. Brundage and A. A. Bharath, "Deep Reinforcement Learning: A Brief Survey," in IEEE Signal Processing Magazine, vol. 34, no. 6, pp. 26-38, Nov. 2017, doi: 10.1109/MSP.2017.2743240.

[2] Leela Zero (github) https://github.com/leela-zero/leela-zero

[3] Silver, D., Schrittwieser, J., Simonyan, K. et al. Mastering the game of Go without human knowledge. Nature 550, 354–359 (2017). https://doi.org/10.1038/nature24270

[4] D. Silver et al., “Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm,” arXiv.org, 2017. https://arxiv.org/abs/1712.01815

[5] C. B. Browne et al., “A Survey of Monte Carlo Tree Search Methods,” IEEE Transactions on Computational Intelligence and AI in Games, vol. 4, no. 1, pp. 1–43, Mar. 2012, doi: https://doi.org/10.1109/tciaig.2012.2186810

[6] R. S. Sutton, “The bitter lesson,” Incomplete Ideas, Mar. 13, 2019. [Online]. Available: http://www.incompleteideas.net/IncIdeas/BitterLesson.html

Appendix: Background on Reinforcement Learning and AlphaZero

Reinforcement learning (RL) is a branch of machine learning in which an agent learns to make optimal decisions through trial and error in an interactive environment. Instead of relying on labelled datasets (as in supervised learning) or discovering hidden patterns in unlabelled data (as in unsupervised learning), an RL agent explores possible actions in an environment and receives feedback (rewards or penalties) based on those actions.

RL algorithms are typically categorized according to the following approaches:

Value-Based: The agent focuses on estimating a value function, such as the expected future reward for being in a certain state and taking a specific action.
Policy-Based: The agent directly learns a policy function, mapping states to actions without explicitly estimating value functions.
Model-Based: The agent learns a model of the environment, enabling it to predict future states and plan its actions accordingly.

AlphaZero, developed by DeepMind, is a novel RL algorithm capable of achieving superhuman performance in perfect-information settings. Its key innovation is the combination of a deep neural network with a Monte Carlo tree search [5] for both training and inference. Instead of relying on a static, human-annotated dataset, AlphaZero generates its own training data through self-play, where the AI competes against itself. This approach eliminates the need for human expertise, which can limit the agent’s potential to learn [6], but it also demands substantial computational resources.

Page updated

Google Sites

Report abuse