MSc. Thesis

Master's Thesis Report

Supervised-Reinforcement Learning Approach

Results: Supervised-Reinforcement Learning

Motivation

A lot of research has been conducted in the field of autonomous navigation of mobile robots with focus on Robot Vision and Robot Motion Planning. However, most of the classical navigation solutions require several steps of data pre-processing and hand tuning of parameters, with separate modules for vision, localization, planning and control. All these modules work independently and make their own parameter assumptions to optimize their own performance without taking into account the effect these assumptions have on the performance of rest of the modules. Hence, even though each module in the whole system tries to achieve an optimal performance for the task it has been assigned, the lack of interdependence exhibited by these modules for decision making means that the overall performance of the whole system is sub-optimal in most of the cases. An alternating approach for addressing these issues is to train certain parts of the vision module to incorporate partial tasks from the planning module.

Deep Learning architectures have achieved great success in the field of pattern recognition and object detection and as a result are usually being deployed to design such a module that jointly learns to carry out perception and path planning. In my master’s thesis, making use of Deep Learning, I propose an End-to-End Learning architecture that learns to directly map raw sensor readings to control commands for a ground based mobile robot. The research makes use of the simulation of Jackal UGV from Clearpath Robotics and the proposed network is able to produce collision free trajectories for the robot to navigate in it’s environment.

Supervised Learning Approach

The proposed End-to-End planner is depicted in the figure above and the planner is trained using Supervised Learning. However, in most of the Supervised Learning methods, the input-output data pairs are assumed to be I.I.D. But for tasks like robot navigation, where the future input received by the robot's sensors is heavily influenced by the current control commands taken by the robot, it is quite necessary that the motion planner has the ability to establish a temporal dependency between successive control commands and input sensor readings. An I.I.D assumption about the navigation behavior puts the the planner at the risk of taking unstable decisions. The required temporal behavior is achieved with the use of Long Short Term Memory (LSTM) cells in the network.

In order to train the End-to-End planner, the training data is collected using the gazebo simulation of Jackal UGV equipped with a LiDAR. ROS Move Base is used as an expert, which makes use of Dijkstra algorithm as the global planner and Dynamic Window Approach as the local planner. Around 4,000 trajectories with different start and goal positions are collected which constitute around 2 million input-output data tuples consisting of a vector of laser scan readings of size 720 over a FOV of 360 degrees, the current position of the robot in the world frame , the relative distance and orientation between the robot and the goal in polar coordinates and the control commands output by the Move Base in the form of Linear and Angular velocities.

Results: Supervised Learning

In order to test the trained planner, it is made to execute 100 random trajectories on three different test maps, each of size 20x20 meters. The goal point of each trajectory acts as the start point for the next. In the images on the right, the said trajectories are represented in blue, the goal points of the successful trajectories are depicted by red dots, collisions by green dots and time-outs by yellow dots. The statistics for all the 300 trajectories on the three map are presented in the table below.

The trained end-to-end planner is able to mimic the expert as the planner is able to complete a trajectory around 80 % of the time on all the three maps. But for the remaining 20 %, it also suffers from time-outs and collisions. There are two major reasons for this, one is the inertia induced by the LSTMs and the other is the deterministic nature of the Supervised Learning approach.

The Inertia Effect

The LSTMs provide the necessary stability to the planner by introducing inertia in the network, which prevents the planner from making any drastic or unstable changes to it's control commands. However this inertia can sometimes backfire. Sometimes, the robot is required to drastically alter it's velocities. For example, if the robot has been travelling in an empty space with a significant linear velocity for a considerable amount of time and an obstacle suddenly appears in front of it, it is required to drastically alter it's linear and angular velocity in order to avoid a collision. But due to the said inertia, sometimes the robot is not able to make these changes in time and ends up colliding with the obstacle. Another behavior observed due to this inertia effect is when the robot is not able to slow down in time and make required orientation changes as it approaches the goal. In such situations, due to the inertia, the robot takes longer to slow down and align itself with the goal and ends up drifting around the goal, which eventually leads to time-outs. This behavior is depicted by the circular regions visible in the trajectory plots above and also in the final trajectory presented in the video on the right.

The Deterministic Nature

The supervised learning approach used to train the network is deterministic in nature. As a result, the robot will always have access to one velocity value at a time step rather than a distribution of potential velocity values. This can sometimes turn out to be detrimental for the robot, especially in cases when the goal is nearby and located directly at the opposite end of the obstacle in front of the robot. Since the network learns to mimic the behavior of the expert planner, it will always try to output the velocity that will take the robot to the goal in the shortest possible time. Hence, the network can discard the velocities that make the robot navigate around the obstacle if the goal is at the opposite side. In such cases, three different types of behavior are usually exhibited by the planner:

The robot will just stop in order to avoid collision and not move as the view of the goal gets blocked by the obstacle and the deterministic nature of the network prevents the robot to find an empty space in front of it to overcome this blockade, which in turn leads to a trajectory timeout.
Due to the inertia from the LSTMs and the network’s behavior to make the robot reach it’s destination in the shortest possible time, the robot will directly ram into the obstacle in an effort to reach the goal at the opposite end.
If the robot is present near one of the corners of the obstacle, it will try to rotate and change it’s orientation in order to find some empty space in front of it and navigate around the obstacle. But despite this rotation, the magnitude of the linear velocity is not high enough due to the presence of the obstacle in the vicinity and eventually a timeout occurs. Sometimes, a collision can also occur if the robot tries to rotates when it is quite closer to the obstacle.

Hundred Random Trajectories

Trajectory comparison with ROS Move Base

Comparison with ROS Move Base

The performance of the trained planner is also compared with that of the expert used for training it. In order to compare, the robot is made to execute a closed loop trajectory on all the three test maps consisting of 6-7 way points. The recorded trajectories are compared on various fronts like mean distance covered in a trajectory, time taken to complete a trajectory and mean absolute velocities during the course of the trajectories. The concerned plots are depicted in the images above and the results are summarized in the table below. Based on the statistics observed, it can be concluded that the End-to-End planner is able to capture the behavior of the expert planner to a significant extent, despite the fact that the expert has access to both the global and the local information while the End-to-End planner only has access to the information related to it's local environment.

But, despite this success and the fact the planner suffers from issues like inertia, the Supervised Learning approach is not sufficient enough to be presented an alternate to the expert planner as it is still not collision proof. This vulnerability of the Supervised Learning approach also stems from the fact that during training, the network doesn’t have access to collision data and there is no way to penalize the network for attempting to take the robot to unsafe collision states. A possible solution is to combine this approach with another learning technique like Reinforcement Learning and learn a new policy that is better equipped at avoiding obstacles.

Supervised-Reinforcement Learning Approach

One of the ways to improve the End-to-End planner's performance is to further train it with a different Machine Learning approach like Reinforcement Learning, where the network can have access to the collision data and can be suitably penalized for entering unsafe states in an environment. The structure of the proposed Reinforcement Learning approach along with the reward function and the optimization algorithm used to learn the required policy is based on the work done on Reinforced Imitation presented here.

For a Markov Decision Process, M = <S, A, P, R, γ>, where S is the state space, A is the action space, P(·|s_t, a_t ) : S × S × A :→ R ₊ is the transition probability distribution, R(·, ·) : S × A→R is the reward function and γ ∈ [0, 1] is the discount factor, Reinforcement Learning (RL) aims to find a policy π_θ , mapping states to actions and parameterized by θ, that maximizes the expected sum of discounted rewards,

Here, T is the time horizon of a navigation trajectory. In this work, the states are represented by the laser scan readings and the goal information and actions are represented by the control commands. The RL based approach is formulated using Actor-Critic methodology, where the Actor learns the desired policy π(a|s) and the Critic learns to approximate the value function. The Actor then makes the changes to it’s parameters based on the update direction suggested by the Critic. In the proposed algorithm, the Actor network is the same as the one used for Supervised Learning. But in this case, the Actor learns a stochastic policy in the form of a 2D Gaussian distribution with the network outputs being the mean values of linear and rotational velocities and a 2D standard deviation which is a separate learn-able parameter. During training, the Actor model is initialized with the weights learnt by the Supervised Learning model, thus making the algorithm, a Supervised-Reinforcement Learning method. The Critic model on the other hand, consists of three Fully Connected hidden layers of size 512, 256 and 128 that take as input the states observed by the robot of size 722, which include the laser scan readings and the goal information in relative polar coordinates. The output of the Critic network is of size one that contains the value function.

Reward Function

The reward function for the RL problem is designed such that the agent learns to navigate to the goal in the shortest possible time while avoiding collisions with obstacles in it’s environment. The reward function is formulated as:

The next step is to assign a value for d(s). If d(s) is set to 0 for all s, only a minimum amount of information gets encoded to carry out the required task. This ’sparse reward’ makes the learning process difficult as all the actions taken in an episode get credit for their outcome regardless of whether they contributed to it or not. Another alternative is to use Euclidean distance between s and goal as the value for d(s). This helps provide continuous feedback for each action by rewarding/penalizing the agent for getting closer/further to/from the goal. However, this approach does not consider the placement of obstacles in the environment. In order to overcome this issue, the value of d(s) is set as the distance between s and the goal along the ’shortest feasible path’. This path is computed by making use of the Dijkstra algorithm.

Constrained Policy Optimization

In a standard RL approach, the agent explores by trying many different policies using trial and error. During the exploration period, the agent can exhibit an unsafe behavior in order to achieve the optimum. For mobile robots, that are required to learn a navigation behavior, this can turn out to be a very serious problem. Hence, the need arises to address a very important concern among the RL algorithms: safety. An autonomous system is considered safe if it’s failure modes are rare and happen within some specified rate. This paves way for constrained RL formulation as a means to incorporate safety into RL and ensure that every exploration policy also satisfies the said safety constraint. Constrained Policy Optimization (CPO) helps satisfy these needs.

Local policy search is a standard iterative way to learn policies. Policy gradient methods are local search methods that use modifications of stochastic gradient descent to optimize J(θ) with respect to policy parameters θ. However, they suffer from high variance in gradients that results in undesirable large updates to the policy. Trust Region Policy Optimization (TRPO) methods are a type of local policy search algorithm that reduce the model variance and provide stability by ensuring that each new policy is close to the old one in terms of average KL-divergence. CPO is a trust region method for constrained RL which approximately enforces the constraints in every policy update. Using the approximations of the constraints, it predicts the amount by which the constraint costs may change after an update and then chooses the update that will most improve the performance while keeping the constraint costs below the specified limits. Given a cost function C : S × A :→ R, let J_C (θ) indicate the expected discounted return of π _θ with respect to this cost

CPO returns a solution to the following problem,

In the proposed RL approach, the collision avoidance is encoded through a constraint on the expected number of crashes allowed per episode. Considering S_c⊂ S as a set of collision states, the required state dependent cost function is defined as,

Here, I is the indicator function. For the proposed model, the value of α is set to 0.4

Results: Supervised-Reinforcement Learning

Similar to the Supervised Learning, the trained model is tested on three different test maps with the Jackal simulation on Gazebo. The stipulations for testing the modified network are kept the same as in the previous case. It is observed that for all the three maps, the new network is able to overcome it’s previous issues like LSTM induced inertia to execute collision proof trajectories and achieve a success rate of 100% without any collisions and time-outs (see figures on the right ). In order to make sure that the trained model is robust to changing environments and has really improved it’s performance, a new test map is created with a much higher difficulty level. A total of 200 random trajectories are executed on this map with the same conditions as the previous maps and it is observed that every single trajectory is collision proof and the network is able to achieve a success rate of 97% on this particular map (see the bottom figure), thus verifying it’s robustness and superior collision avoidance capabilities, compared to the Supervised Learning approach.

Supervised Reinforcement Learning (S-RL) approach is able to address the pitfalls suffered by it’s Supervised Learning (SL) based counterpart. It has learnt the ability to avoid overshooting the goal position and unlike Supervised Learning, it does not struggle with the robot’s orientation when navigating around an obstacle’s corner. It is also able to keep the robot at a safe distance from the obstacles while the SL based planner would occasionally navigate quite close to the obstacles, a behavior it inherited from it’s expert, ROS Move Base.

Hundred random trajectories

Performance of the Supervised Reinforcement Learning based approach on a new test map with more obstacles. Successfully reached goals are depicted by yellow dots, time-outs by blue and crashes by cyan.

Trajectory comparison of Supervised Learning approach and Supervised-Reinforcement Learning Approach. While S-RL approach is able to complete the given trajectory, the SL approach fails to do so and struggles with time-outs and collisions around points 3, 7, 9.

The SR-L method has also learnt to counter the negative effects of the LSTM induced inertia. In order to achieve the desired results, it is observed that the SR-L based planner steadily starts decelerating linearly when it is close enough to the goal. As a result of this deceleration, the planner is able to comfortably adjust it’s angular motion in order to align itself with the goal and hence avoid situations where it has to drift around the goal in circles or overshoot it’s goal position and execute a recovery behavior.

The SR-L planner has also been able to overcome earlier issues, where if a goal was located close enough to the robot but was present at an opposite end of an obstacle lying between the two, the robot would usually struggle to reach it’s destination as it had not been able to learn the behavior of navigating around the obstacle in such cases during it’s training. But with S-RL, the planner is able to learn the required navigation behavior and is able to comfortably complete the trajectories that could earlier have ended in a collision or time-out. The planner’s ability in this regard can be visualized from it’s performance around the long corridor region on the left side of the new test map in figure above and also on the trajectory comparison plot. On the trajectory comparison plot, during navigating from point 9 to 10, while the supervised learning planner struggles and eventually ends up in a collision, the S-RL planner is comfortably able to navigate around the obstacle and complete the required motion between the two points. Similar results are observed around points 3 and 5. The SL planner struggles in the transitions, 3 to 4 and 5 to 6 because of being surrounded by obstacles in a close proximity. But the S-RL planner faces no such issues during any of the transitions. A video of the robot navigation behavior in simulation using the S-RL based planner is also presented below.

After observing the test results of the S-RL based planner, it can be concluded that the new planner is able to address the issues faced by the SL based planner. The S-RL planner is able to achieve a success rate of 100% on the test maps used for SL based planner and is also able to deliver similar results with 97% success rate on a much harder map. The new planner does not encounter even a single collision for all the 500 trajectories executed on the four test environments, making the new planner collision proof. It is also able to address the drifts and overshoots caused by the SL planner.

Page updated

Google Sites

Report abuse