Team Members: Gautam Tangirala, Jaivi Chandola, Alex Mullins, Mingyan Zhou, and Winston Cai
Faculty/Graduate Students: Prof. Kaiqing Zhang, Haoyi You, and Xiangyu Liu
I4C Teaching Assistant: Dakshita Pal
How does risk sensitivity affect the ability of AI to learn and develop its driving capabilities in different environments?
Project Overview
We aim to find how adding risk sensitivity to models that are being trained with reinforcement learning in different road environments impacts the accuracy of the model.
Project Approach
To accomplish this, we ran two versions of each model, one using risk sensitivity and one without using risk sensitivity, called the standard model. We also used four different environments with the model - Highway, Intersection, Roundabout, and Parking. We used the reward curve to determine accuracy, as maximum reward leads to maximum accuracy.
Project Resources
We used Python in Google Colab to run our models, and used the Gymnasium library created by Farama Foundation in GitHub for the different environments.
What are Reinforcement Learning and Risk Sensitivity?
Reinforcement learning is a type of machine learning where an agent learns to make decisions by trial and error, aiming to maximize rewards in a particular environment. There are five parts to reinforcement learning - the agent, environment, actions, rewards, and policy. The agent is what interacts with the environment; in this case, the agent is the car. Actions are moves that the agent can make that affect the state of the environment. Rewards are feedback from the environment after taking an action, and a policy is the strategy the agent uses to determine their next action based on the state of the environment.
The goal of Risk Sensitivity is to train models to make decisions which have more risk, but will lead to a better reward. This is achieved by using a modified reward function, e^(alpha*x), where alpha is a hyperparameter (a configuration variable) and x is the original reward. Changing the alpha hyperparameter changes the reward, which affects the accuracy of the model. With this function, the difference between a small negative reward (penalty) and a larger penalty is much smaller than the difference between a small positive reward and a larger positive reward, leading the model to pursue better rewards more aggressively.
Algorithms Used
Deep Q Networks are a form of neural network whose goal is to approximate the Q function of the environment. The Q function is the formula that, given a state of the environment and an action, predicts the expected reward of taking that action in that state. The agent can, after training, can take the action with the highest expected reward.
PPO belongs to the policy gradient family of reinforcement learning algorithms. It directly learns a policy (a mapping from states to actions) that maximizes cumulative rewards. PPO: Uses a "proximal" objective function to constrain how much the policy can change between updates. This helps in stabilizing training and avoiding large policy updates.
Environment Examples
PPO Only
DQN & PPO
DQN & PPO
Results
Highway:
In the Highway environment, the agent has to avoid crashing into other cars while driving as fast as possible.
With the Standard DQN without risk sensitivity, the agent drives relatively slowly which is not ideal.
With the risk-sensitive DQN, it is clear that the agent drives much faster, however it often crashes because of the aggressive driving.
When the alpha hyperparameter is lower, the agent tends to be less bold and more risk-averse. This model appears to be a good balance of speed and crash avoidance.
This graph shows lower reward values than the risk-sensitive graph, showing that the standard model does not perform as well as the risk-sensitive model.
This graph shows higher reward values than the standard (non risk-sensitive) graph, showing that the risk-sensitive model performed better.
Intersection:
In the Intersection environment, the agent has to go as fast as possible through an intersection where four roads converge, without hitting another car.
In this video, the car is able to navigate the intersection without crashing into another car.
In this video, the car is also able to navigate the intersection without crashing into another car.
There is not much of an increase in value throughout the graph, and it shows relatively lower reward values. This reveals how the standard model does not have as well as a performance as the risk-sensitive one.
The reward curve of the risk-sensitive model show much higher values than the other model, meaning it performs better than the standard DQN.
Roundabout:
In the Roundabout environment, the agent has to go through a roundabout as fast as possible without hitting another car.
In this example, the car has successfully reached the end of the scenario. Learning from this failure, the AI is training to be cautious of potentially dangerous zones.
In this example, the car failed to reach the end of the scenario. Learning from this failure, the AI is training to be cautious of potentially dangerous zones.
Reward Curve:
The reward curve of the standard model is very irregular, leading to inconsistent results and lower accuracy overall.
Reward Curve:
The reward curve of the risk-sensitive model increases at a more constant rate than the standard model does, demonstrating that the risk-sensitive model provided better reward and more accuracy.
Parking:
In the Parking environment, the agent has to park in a specific parking space as fast as possible.
This video demonstrates one of the best outputs of this model, where the car hits the target parking space. However, it's not able to park fully.
This video demonstrates one of the best outputs of this model, where the car both hits the target parking space and parks in the space.
Reward Curve: The reward curve is not as steep as the reward curve for the risk sensitive model, and the values are smaller. This means the model did not perform as well and improved its performance at a slower rate than the risk sensitive model.
Reward Curve: The reward curve is steeper than the reward curve for the non risk sensitive model, and the values are higher. This means the model performed better and improved its performance at a faster rate than the non risk-sensitive model.
By comparing the results of the risk sensitive and non risk sensitive models, we can see that using risk sensitivity does provide tangible benefits to the accuracy of the model trained with reinforcement learning, as long as the alpha hyperparameter is chosen with care. This is shown in the reward graphs, where the models that were trained with risk sensitivity performed better initially and improved at a faster rate than the non risk sensitive models.