LEARN

Cooperative Reinforcement Learning

Team: Lucus Wu, Olamide Falayi, Erol Ozel, Zachary Akycampong, Zoey Zhang

Faculty/Graduate Students: Kaiqing Zhang, Souradip Chakraborty, Xiangyu Liu

I4C Teaching Assistant: Christopher Song

Project Question

Can risk-seeking improve a model’s total rewards?

How did we try to solve this problem?

Create Environment
Train Agent
Attempt to Increase Risk-Seeking by Changing Reward Values
Determine Conclusion

Creating an environment (in our case), consists of creating the road with the pre-programmed blue cars which move according to a simple algorithm. In reinforcement learning, the environment is what the agent (the car), interacts with and is what gives the car its reward, punishment, and new state (or options for the next step).

So, How Did Our Cars Perform?

Looking at the intersection environnment, our car agent performed poorly. We wanted it to move quickly through the intersection without any collisions, but in both videos that was not always the case. Because the car agent did not meet this goal, we can determine that our original reward incentive was not good for promoting risk-seeking (like speeding up through the intersection).

Start of training

The car did not perform very well. How do we know? The green car (our agent) hardly moved across the intersection.

After a while of training

The car also did not perform very well. How do we know again? The green car (our agent) moved slowly across the intersection and ended up crashing into another car.

Here, we are looking at the roundabout environment, but again, our car agent performed quite poorly. In one video, it hardly moves, in another it crashes into a car, and in another it moves very slowly: three things opposite to our goal. And so, we can determine that our original reward incentive was bad at promoting good car performance and risk-seeking.

Start of training

Middle of training

End of training

So, getting to our question...

If we change our cars' reward incentive, can we make it reach our goal (to drive quickly through the roundabout/intersection)?

We tested this out by exponentiating our rewards!

Here's our process and outcome:

Exponentiating involves raising e to the power of our current rewards times β (beta: some factor that you choose).

Here's what that looked like:

Here, we're accessing our various rewards (like for driving fast or changing lanes) and raising e to the power of that reward (that's what np.exp(number) does!

For our roundabout environment, we can clearly see how it improved from before exponentiating to after, not only did it move much faster, but it avoid collisions better and was able to change lanes very well.

Before exponentiating

After exponentiating

Results with multiple agents:

In this multi-agent version of the intersection, two cars controlled by different agents must go straight through an intersection. They get a penalty for crashing and a reward for arriving at the other side.

Random

Regular

Exponentiated

The regular agent is too risk-averse: It brakes and doesn't go through the intersection.

However, as the exponentiated agent is more risk seeking, it is able to successfully navigate the intersection.

Look at these slides for more!

Reinforcement Learning

Page updated

Report abuse