Results

Our Method formulation is able to optimize the average reward obtained by all the agents . As we can see an upward trend of our reward and the reward starts to saturate after 3000 steps. A simple Multi-layer-Perceptron model using Deep Q Learning method is employed to get results to verify our problem formulation.

Fig 1. Episode Reward for 2 agent training

The above video shows the evaluation of our policy in a 2-agent highway environment where the agents are cooperatively trying to navigate through the traffic conditions in the highway. The p is the immediate reward and v is the expected reward . As the episode is terminated after collision we plan to increase the time of the agents in the environment to navigate without collision by modifying reward formulation