Atari Reinforcement Learning Exploration

By: Grant Huie, Sungmin Kim, and Caleb Huck

Introduction

What is the problem with gaming? The gaming industry is rapidly evolving, and there’s a call for smarter and better AI to improve the gaming experience. The problem is that most games, as of right now, are not using modern-day machine learning algorithms in their games. However, you may be thinking, why is this important? The usage of machine learning has many applications in games that have served to impact the immersion of a game drastically. Perhaps the most notable first impression of how machine learning may impact games is the NPCs (Non-[Player/Playable] Characters). Machine learning can serve in developing the algorithm of these NPCs so that they may appear as human as possible. This is done by having these models learn from human data and testing their own. On that note, algorithms can also help NPCs feel more realistic in their interactions with the player so that they feel more lifelike. Using machine learning, or reinforcement learning in games, can also immensely increase the user's immersion in the game and other factors. These factors can include creating more challenging and engaging gameplay, with NPCs that have better precise movements and enhanced decision-making abilities, adapting the game experience to players based on their decisions, behaviors, and skill, which can provide a personalized experience that can make the game more engaging and enjoyable, optimizing and balancing game mechanics so the game will be fair and enjoyable, training AI agents that can play the game at a very skilled level, which can help produce needed data, and game testing, as well as other different fields, and reducing total production time and costs typically needed in games, as machine learning can automate certain tasks. We believe these enhancements can help make the video game experience much more enjoyable, so we decided to do our take on it. We decided to use modified models of DQN and PPO algorithms to analyze how well Machine Learning can perform in Atari-based games and, using what they learned, see if we can also build regular AI/Computer Player Algorithms off of the Machine Learning performance to have improved but still cost-efficient algorithms for those aspects of the game.

Results

First is some analysis of the DQN's performance in the Assault environment. The blue line represents the DQN algorithm, and the orange is the random algorithm. The X-axis shows the number episode the algorithm is on, and the Y-axis shows the total score the algorithm got over each episode. As shown, the DQN algorithm massively increases in its average acquired score over time whereas the random algorithm averages a score of 250.

This graph shows the amount of times each individual action is pressed within each episode. Over time we can see a great preference in the algorithm to use Action 4 which in the the specific Assault Environment space correlates to moving to the right, then followed by moving up which is effectively a NOOP and so on.

Best Episode Actions

The actions recorded correlate to the actions in the array below in the order shown and are the viable actions allowed in the Atari Assault environment space.

[NOOP, FIRE, UP, RIGHT, LEFT, RIGHTFIRE, LEFTFIRE]

With the action ratios taken from the very best performing episode in the DQN runs, we could potentially create a new random algorithm with a better performance than the regular one.

In this graph, the blue representing the new algorithm developed from DQN ratios but still random, and the orange representing a true random agent. The visual shows the modified random algorithm consistently better, meaning if implemented in the game for the player to fight agaisnt, it would be a more difficult challenge on average than the pure random in a competition.

This graph shows the average performance of each algorithm in its entire running. DQN Whole is the measurement of the entire run-through of DQN training and DQN Testing showing the average value once the DQN has gone through most of its training and is consistently getting high scores. The better random algorithm, on average scoring 100 points more than pure random but still small scale compared to what the trained DQN can do.

PPO Performance on Assault and Journey Escape

Assault

Journey Escape

The returns of PPO, as shown in the above two photos, seems to absolutely be working, but maybe not as effectively. Two games are presented here, on the left is Assault v5 which has a goal of achieving the most points possible which you can see over time (x axis means a set of frames, y is the return) does become more common. And on the right you can see Journey Escape which has the goal of losing as little as possible with infrequent but possible cases where you gain 4000 if you hit a specific falling object. One thing to note is that this graph, in relation to the DQN results, actually has more frequent hits on those rare targets which would imply this algorithm is generally more accepting of risk. So from these two graphs we can see that PPO generally provides a successful training model that is more accepting of risk but with that comes the increased chance of loss as well which is likely why in both graphs we are slightly under the results of DQN.

Analysis of Results Gathered from TensorBoard

Assault (DQN: Orange, PPO: Blue)

Journey Escape (DQN: Blue, PPO: Pink)

As we can see in these two diagrams, we see that DQN and PPO have differing values when training the Atari games. In DQN, for both games, we can see that steps per second, or SPS category has a sharp decrease over time, while it manages to stay constant for PPO. This is likely due to the complexity of the neural network in DQN growing, as it considers past actions taken, which increases the complexity of how it will take certain actions. PPO, on the other hand, has a consistent SPS, likely due to the nature of it taking action only from the present environment, meaning that the complexity will stay constant. For episodic length, we see that it is the same for Journey due to the nature of the game, in which every episode is set to a time limit. However, for Assault, we can see that the episode length for DQN sharply increases while staying relatively constant for PPO. This can also be attributed to the reasons for SPS, with how complex the DQN neural network becomes with every passing step. This can be seen in the episodic return, in which while both models are increasing, DQN is shown to have a sharp advantage over PPO. Hence, when using these two models, PPO can be said to be faster, with subpar improvement compared to DQN, while DQN, albeit costing more time, will have a higher rate of return.

Below are links to videos of the different algorithms performing in the respective environments. These are recommended viewing as it provides a more visual aspect of the performance of each algorithm.

Random Assault Video
DQN Assault Video
PPO Assault Video
Random Journey Escape Video
DQN Journey Escape Video
PPO Journey Escape Video

Links to all our final stuff:

Final Paper

Code Repository

Final Presentation

Below you can view our Final Presentation as well, which gives better explanations on what DQN and PPO algorithms are and the specific resources and packages used for this project.

Project Presentation