RL

Ultimately, we have decided on an alternative approach, that would not incentivise collective behaviour directly. We really wanted the flocking behaviour to emerge as a result of physical constraints only. As such, we have framed the problem through the lens of Multi-Agent Reinforcement Learning (MARL), with multiple levels of hierarchy within the agents. In this setting, each boid is an agent that is consuming a stream of observations, performs actions and receives rewards from the environment. Although each individual is selfish and only cares about its own rewards, the presence of hierarchy encourages cooperation within species in order to increase the chance of survival.

We decided to use Unity Engine for the project as it conveniently provides interfaces to both realistic 3D physics engine and state of the art machine learning algorithms.

We started with separating all boids into 2 categories which we can call prey and predators. Each category of species is endowed with a shared Brain (an agent's policy represented by a Neural Net), which is responsible for receiving observations from the environment and converting them into actions. Every boid has Vision within which it can determine the position (x, y, z) and the velocity (x, y, z) of other boids. A boid is also able to distinguish between the species, which gives it a state space of (px, py, pz, vx, vy, vz, type) for each agent within the sensor radius. The action space consists of x, y, z vector in which direction the force is applied, proportional to the speed of the agent. A reward of +1 is given to the predator upon predating on a prey, while the penalty of -1 is received by a prey in such scenario. In all other cases the reward is staying at 0.

The Vision (outer sphere) and predation region (inner sphere) of both the smaller prey boid (blue) and a bigger predator boid (red)

The agents learn by maximizing a discounted sum of the future rewards with respect to their own policy. We used an untuned version of Proximal Policy Optimization(PPO) available in ml-agents Unity library, which utilizes policy gradients to guide the search for the best actions.

In our first experiment, we have paired a prey with a single predator to test the validity of RL approach. We have set the reward for the predator to be inversely proportional to the distance from the prey. That is, the closer the predator was to the prey, the bigger the reward it received. At first, the predator was dominating, but the prey was able to `outrun` the predator as the training evolved. We suspect that this is due to prey's lower cross-sectional area and therefore less drag, (which we discuss in Physics section)

Our second experiment was a larger scale simulation with 15 prey boids and 3 predators. Although we were expecting some fluid flocking patterns to emerge, it got us by surprise that the prey has chosen to hide in the corner instead. We think that it might actually be optimal in this particular environment as the prey learned to set up a living shield!

Page updated

Google Sites

Report abuse