RL Details and Discussion

Reward design

The current reward design solely focuses on the RL agent completing its individual tasks (See the left image). Throughout training, the RL agent develops a variety of behaviors to effectively engage with other IPG agents. However, he may become increasingly aggressive since other IPG agents' reward is not taken into consideration. How to design reward functions to encourage human-like behavior remains an open question for future work.

Training steps: 500k Speed: x2.5

RL agent yields to other agents.

Training steps: 1000k Speed: x2.5

As training progresses, RL agent becomes more aggressive.

Observation design

A well-constructed observation space significantly enhances the learning process. The observation design depicted in the left gif below effectively facilitates the rapid and improved acquisition of the navigation policy. We found that using the components of speed in the x-axis and y-axis as inputs yields better results than using absolute speed and heading angle.

Training steps: 1000k Speed: x2.5

State representation: [x, y, v * sinθ, v * cosθ]

Training steps: 1000k Speed: x2.5

State representation: [x, y, θ, sinθ, cosθ, v]

Fail to interact with Ignoring blind agents

In our experiments conducted in HallWay, we encountered significant challenges in training a RL agent capable of interacting effectively with a blind agent using conventional RL algorithms, reward structures, and observation designs. We hypothesize that the primary difficulty stems from inadequate exploration strategies. Addressing this issue, we propose the refinement of RL algorithm designs as a direction for future work.

More RL agent interacting with IPG agent examples

(Training RL agent is beyond the scope of the current project, but we will keep updating when we get interesting results here!)

RL agents learn to interact with different IPG agents.

Page updated

Google Sites

Report abuse