Authors: Max Pflueger, Ali-akbar Agha-mohammadi, and Gaurav S. Sukhatme
Max and Gaurav are in the Computer Science Department at the University of Southern California. Ali is at the NASA Jet Propulsion Laboratory.
When sending rovers to other planets, two important questions that arise are "Where do we land?" and "Where should we go after we land?". Although there are many considerations at play in these decisions, one important technical capability is being able to plan good paths across the planet surface at long ranges. Automated path planning tools allow mission planners to know which potential landing sites are likely to have good access to scientifically interesting areas, and once the rover lands setting a long range plan can direct the rover's short range path choices.
In this work we propose a flexible new way to develop long range path planners for planetary rovers. Our approach is based on the principle of inverse reinforcement learning, where we look at plans generated by an expert and learn what they were trying to optimize. Then we can take that objective and use to produce behavior similar to the expert in new environments.
The dangers faced by planetary rovers are not always visible or obvious from overhead imagery. For example, the Curiosity Mars rover sustained damage to its wheels from small rocks embedded in hard ground that are too small to see from overhead. Although the rocks themselves are not visible from orbit, they may be associated with certain terrain or mineral formations that can be detected and avoided.
When rover drivers steer around dangerous rocks that they can see on the surface, they create a training signal that we can use to detect and avoid similar looking pieces of terrain in the future.
Inverse reinforcement learning (IRL) is the reverse of the more common reinforcement learning paradigm. Where as reinforcement learning tries to learn an optimal action policy based on experimental evidence of the reward function, IRL tries to learn the reward function based on experimental evidence of the optimal action policy. In our context this evidence of the optimal action policy comes in the form of path demonstrations, and we attempt to learn why the rover (or planner) chose the path it did.
The 'why' in this context is conceived as a parameterized reward function (in our case a convolutional neural network) that looks at a part of the map and assigns a real value reward (positive or negative) for how much we want to visit a particular piece of terrain. We would expect most terrain to have a slight negative value (there is always some risk to traveling through a piece of terrain), with more dangerous terrain having a larger negative value. The goal location on the other hand is expected to have a large positive value.
An IRL process will attempt the optimize the parameters of that reward function so that when it is used with some planning procedure the behavior will be similar to what was shown in our demonstrations. In general this can be tricky to do because it requires getting the gradient of how much our behavior matches the demonstrations with respect to the parameters of the reward function (we discuss why this is difficult in more detail in our paper).
Our approach to solve this gradient calculation problem relies on a differentiable approximation of the value iteration algorithm called a value iteration network. However, this is not quite enough, because the gradients calculated in an environment with deterministic actions and outcomes are sometimes not very informative, so we have modified the value iteration network to improve the quality of the gradients in an algorithm we call the soft value iteration network. (Please reference our paper for a deeper explanation of why this is necessary and how it works.)
The diagram below shows the architecture of our system. The map data and other layers used could be visible or hyper-spectral imagery, terrain slope, elevation, or any other data products we have available. The goal is typically a one-hot image to denote the goal position. These layers are stacked together and run through the reward function f_R. The state transition function is supplied by f_P, which in our case we treat as fixed and deterministic.
The value iteration module, expanded below, approximates the value iteration algorithm using the operations of a convolutional neural network. It produces two data products, the value function and the Q function. The value function is interpreted as the expected total future reward that could be achieved if agent started in a given state and followed the optimal policy. The Q function is the action conditioned variant of that, telling us the expected future reward for being in a certain state and taking a certain action.
These data products can then be used in multiple ways, for example the value function could be integrated with a local planning algorithm to choose a destination at the edge of rover's range of vision. The Q function is used during training to compare the calculated optimal actions with those given in our training data and produce a loss function that can then be optimized with standard gradient back-propagation.
We started by verifying how well our algorithm worked with a simple dataset. Our gridworld environment contains about 50,000 example paths on 32x32 grid environments where cells are either free or occupied. Here the the effect of our soft value iteration network (SVIN) was shown to be critical to getting good performance. The training curves below show the imitation accuracy and loss of our model as it was trained. We can see that in accuracy the standard VIN tops out around 73%, whereas our SVIN gets to about 89%.
To see how the algorithm is thinking, the following 3 figures show, for a fully trained model, the map, the reward function as it heads into the VI module, and the value map that comes out of the VI module.
To test our algorithm we created a dataset based on JPL's Mars Terrain Traversility Tool (MTTT). The tool was built by experts to look at the navigability of potential landing sites, and so only works in some small regions such as Jezero Crater. Our dataset used short paths that fit in 64m square tiles, with a pixel resolution of 256x256 (pixels are 25cm on a side).
The examples below show what this data looks like and how the plans of our fully trained model compare with the plans supplied by the MTTT tool. From the left they show: the map with the goal location marked as a red star, the reward map sent to the VI module, the value map produced by the VI module, and the map with the MTTT path overlaid in yellow, and some paths off our calculated policy overlaid in red.
In some cases (2nd and 3rd rows below) we see indications that our algorithm has learned better behavior than the MTTT tool, possibly due to the MTTT tool relying on coarse terrain labels, where we always use the highest resolution available. This is an interesting possibility that will require more detailed investigation in the future to see if the effect is real.
After training a reward network, the next question is: can we use this on a much larger piece of terrain? We may be able to train our network on short, computationally manageable, and abundant path demonstrations, and then take the network weights and use them to plan paths on much larger chunks of terrain. This will create new challenges such as choosing hyperparameters that work for both small and large maps, and we look forward to seeing what this technique can do at larger scales.
Read our paper: "Rover-IRL: Inverse Reinforcement Learning with Soft Value Iteration Networks for Planetary Rover Path Planning." Published in IEEE Robotics and Automation: Letters, 2019.
We will present this work at ICRA 2019 in Montreal, Canada, if you are attending come talk to us in person!