Our project is aimed at developing optimum navigation policies for UAVs with the help of Deep Q Learning. These policies are to be ultimately deployed on UAVs in situations such as monitoring of agriculture, structural inspection etc. These UAVs can also be deployed as first respondents during a crisis or disasters in inaccessible areas.
This project has been inspired from the thesis written by Nahush Gondhalekar[1]; which describes the implementation of a naïve Reinforcement Learning algorithm with Gaussian Processes for infrastructure and environmental monitoring.
We defined a goal position in the Neighborhood environment of AirSim which we expected the drone to reach on its own via a number of trial and errors.
We carried out a number of simulations, both successful and unsuccessful, in AirSim, a simulator by Microsoft. We defined our state spaces as the Depth Images from the front camera of our drone and the actions possible were to make the drone go forward, backward, rotate left and rotate right.
Our implementation of Deep Q Neural Network agent was adapted from: Nature 518. "Human-level control through deep reinforcement learning" (Mnih & al. 2015)[2]. The structure for this network was predefined in the CNTK library.
This Neural Network helped us to estimate the Q values which were in turn used to improve the optimal policies.
The reward function was defined as a sum of the distance of the drone's current position to its goal position along with its velocity. As the drone moved closer to the goal, the reward increased. Similarly, as the drone collided into obstacles, the drone would be penalized and the rewards obtained by the drone would be drastically reduced. The velocity of the drone was included in the reward function to ensure that the drone doesn't learn that being stationary is a way to avoid obstacles.
The optimal policy designed also imposed a strong penalty if the drone wandered too far from the goal position.
This reward function was formulated to teach our drone, (the policy) to execute actions which would guide it towards the goal position while avoiding obstacles, thereby leaning how to navigate through the environment.
After running the simulation for more than three hundred and fifty episodes, which took almost 8 hours, we plotted the total rewards collected per episode. We observed that as the number of episodes increased, the rewards collected per episode showed a small ascent.
Considering that other naive implementations of the idea similar to ours took almost six thousand episodes to converge, we estimate our algorithm would take much more than that. Unfortunately, we did not have a GPU strong enough to take on that load.
Here, we show a couple of episodes being carried out in the simulator and how the drone returns to its start position after every collision. In the command line towards the right we can see the penalty being imposed for every collision. The image on the bottom left corner indicates the input to the Neural Network which are depth images captured by the drone's on board front camera.
msgpack-rpc
which communicated with the simulator over a TCP protocol.