Safe & Efficient Ground Exploration Using Deep Reinforcement Learning

In Fall 2021 I took ROB 537: Learning Based Control. This course focused on utilizing learning methods (neural networks, RL, evolutionary algorithms) to control robots. For our course project my team (Kyle Mathenia and Aiden Shaevitz) and I sought to create a simulated agent that was capable of high level navigation of rough terrain, trained on real world elevation data. The goal of this project was to train an agent that would be able to leverage local elevation readings to safely and efficiently aid in search and rescue missions. The culmination of this project was the writing of a conference style paper, for which we were awarded with the distinction of runner up for best paper of the class.

Need for Rough Terrain Navigation - Search and Rescue

According to the US National Park services over 45000 people were lost and required search and rescue from 2004 - 2014. Drones are already being leveraged to great effect to aid in these operations but can be limited by weather and dense foliage. Ground based robots don't face these same challenges and are also able to deliver larger payloads such as critical first aid supplies. The main challenge for ground based robots is instead navigating harsh natural terrain.

Existing work has shown great results by training and testing agents in simulated environments but to our knowledge no one has yet approached the critical realism of creating an agent that can navigate through real world extreme terrain.

Our team utilized Deep Q-learning to train an agent to make terrain informed path planning decisions using real digital elevation data from Badlands National Park.

South Dakota Badlands National Park

Problem Formulation

Our environment was formulated as a classic grid world (just with very large grids) where the agent spawns in a random location and must reach a random target destination. The agent is a deep neural network that takes as input a small local viewing window of the surrounding elevation and a guiding compass unit vector that points towards the target. The output of the network is one of four actions representing the cardinal directions.

Simplified structure of agent

The elevation data utilized was collected from 0.5m LIDAR scans. The corresponding slopes of the terrain were converted into cost maps for each location with harsher penalties associated with crossing more treacherous terrain.

Elevation

Slope

Cost Penalty

Rewards and Training

To increase the reward density of the problem we utilized a banded reward structure where positive rewards are received at intervals of distance radiating outwards from the target. As mentioned penalties were associated with crossing each space of terrain with harsher punishments associated with steeper sections.

Perhaps the central issue of this work was trying to figure out how to get the agent to properly balance its two competing interests - heading towards the goal and considering its surroundings.

We utilized a curriculum training approach to first teach agents to reach the goal with denser banded rewards, then trained successive agents with sparser rewards hoping that they would learn to leverage the terrain better to reach the goal while maximizing rewards.

Banded rewards example

Results

This project proved to be a difficult problem but very rewarding and informative. Our aim was to show that we could train an agent that was capable of making terrain aware planning decisions on a large space.

As a baseline we trained a more myopic agent with very dense rewards that was very good at getting to the goal but not very terrain aware. We compared this to our curriculum trained agent that saw sparser rewards and (we hoped) was more aware of its surroundings.

In the end we had mixed success overall but some very promising results. Our "terrain aware" agent was proven to on average take paths that had ~20 lower cost than those taken by the myopic agent. However, this terrain aware behavior was at the cost of successfully reaching the target where the myopic agent succeeded 100% of the time, the terrain aware agent could only get to the goal ~60%, often getting stuck in large loops.

When the "terrain aware" agent was successful we were able to see some great glimpses of the types of "intelligent" behaviors that we were expecting. In this example the myopic agent simply travels in a more direct route incurring large terrain penalties, while our "terrain aware" agent smartly navigates down the opening channel and hugs closely along the hill side, taking a short route that avoids needless elevation changes.

Full Project Paper

ROB_537_Final_Paper.pdf

Page updated

Google Sites

Report abuse