Evolutionary Curriculum Training for DRL-Based Navigation Systems
Max Asselmeier Zhaoyi Li Kelin Yu
Abstract
The task of multi-agent collision avoidance – involving an ego-robot navigating from a starting pose to a goal pose while avoiding both static obstacles and other agents in the environment – is a classical motion planning problem. In recent years, multi-agent collision avoidance has also become a robot task of interest in the reinforcement learning (RL) community. However, current RL-based approaches solve a simpler version of this problem that either involves simple static obstacles or no static obstacles at all, placing more emphasis on the avoidance of dynamic obstacles such as other agents. As a result, such collision avoidance policies are unusable in structured environments such as mazes with several significant static obstacles. For this project, we sought to incorporate a multi-agent collision avoidance policy, namely the GPU-accelerated Asynchronous Advantage Actor-Critic for Collision Avoidance through novel Deep RL model (Evolution GA3C-CADRL) [1], into a hierarchical navigation framework to allow for the ego-robot to negotiate complicated structured environments. To do so, we integrated a local waypoint planner into the Evolution GA3C-CADRL pipeline that operates on a path from a global navigation planner and local perception data to provide the collision avoidance model with a waypoint for the ego-robot to move towards while avoiding dynamic obstacles. In doing so, we bypass the collision avoidance model’s neglect of static obstacles by ensuring that no static obstacles will lie between the agent and the local waypoint. We experimentally validated the hypothesis that this local waypoint integration would yield fewer collisions by benchmarking the collision avoidance model against the local waypoint planner + collision avoidance model across five structured environments and found that our proposed planner + avoidance model method resulted in a higher success rate (evaluated in terms of the robot reaching a global navigation goal in an allotted planning time) and yielded a lower number of average collisions across the trials.
Methodology
Simulation Environment
We designed and developed a simulation environment for both evaluation and training. We have divided the design variables of the training environment generator into two primary parts, map structure and dynamic obstacles.
Map Structure
To instantiate the map structure, we have developed a room and maze generator capable of producing interconnected rooms linked by corridors. The map generator incorporates the following four design variables to enable the diverse generation of maps.
Room Number: Room number indicates the number of individual rooms generated in the environment.
Room Size: The room size variable is a scalar value range from 0 to 1 that is able to modify both the average and maximum room sizes.
Corridor Width: Similar to room size, corridor width is a scalar value range from 0 to 1 that can adjust all the generated corridors' widths.
Convexity: Convexity is a positive scalar value that indicates how rooms are connected. A lower convexity value indicates a reduced likelihood of sharp turns in the map. If convexity is 1, the connections appear as a zigzag between rooms, and if convexity is ∞ , the connection between two rooms must be either straight or paths that are perpendicular.
Maps with different setup
Dynamic Obstacles
To generate the dynamic obstacles, we have developed a pedestrian generator that spawns a diverse set of agents in the environment. The pedestrian generator incorporates the following three design variables:
Pedestrian Number: Pedestrian numbers indicate the number of pedestrians that will be generated in the environment.
Pedestrian Speed: In simpler pedestrian policies, the pedestrian speed remains constant, resulting in all pedestrians with such policies having the same speed. However, in more complex policies such as circle walking and random walking, the pedestrian speed influences the average value of the randomly generated speeds.
Pedestrian Policy: The pedestrian policy variable, ranging from 0% to 80%, indicates the percentage of challenging policies, including circle walking and random walking, present in the environment. Higher values of pedestrian policy variables correspond to a greater proportion of challenging policies.
Pedestrians with different polies
Performance Evaluation
To gauge training environment difficulty, we define a unified PerfScore as both the evaluation metric and reward function for learned collision avoidance model. PerfScore measures a model's ability to navigate, avoid static obstacles, and avoid dynamic obstacles.
In order to analyze which types of environments are more challenging, we analyze the impact of environmental variables on PerfScore. We select the minimum and maximum values for each variable and generate 500 maps for each combination. Through 5000 iterations, we calculate the average PerfScore. Based on the difference between the minimum and maximum PerfScore, we rank the variables and types of enviroment from hard to easy.
Evaluation of the GA3C-CADRL model
Evolutionary Curriculum Training
We use a curriculum training approach to improve the DRL model's performance in difficult environments. We deliberately expose the model to demanding scenarios, induce failures, and reduce the PerfScore. This ensures that failures occur within the targeted environment. During each training iteration, the training environments generate random maps and pedestrian setups based on the environment variable. Once the model's PerfScore increases beyond a threshold, the training environment gradually increases the difficulty by adjusting variables until reaching the maximum level. After every curriculum iteration, we increase the overall difficulty by adjusting all other variables. And after every two curriculum iterations, we move on to training the model in the next most challenging environment.
Experiment Setup
We conducted an experiment where we evaluated five distinct models on five different maps. The models we tested included one RL-based model CADRL without using waypoint planner or map information, as well as four RL-based models that used waypoint planner. These included MPC with waypoint planner, CADRL with waypoint planner, GA3C with waypoint planner, and our final model, GA3C with waypoint planner and evolutionary training environment.
The five maps we used were: "quarry," which had randomly generated squares and points, "forest," which had only random points, "room1" and "room2," which were generated from Evolution Map Generator, and an irregularly shaped room containing obstacles.
In each map, we generated 20 agents in various locations. These agents began moving in a cycle when our trained agent approached them. Additionally, some agents unexpectedly emerged from the walls, which the sensor was unable to detect, to assess the speed and generalization capabilities of our model.
Five different maps named Campus, Forest, Room 1, Room 2, and Irregular Room.
Results
Results of baseline methods against waypoint planner (WP) +Evolutionary-GA3C. It includes five different maps and 15 different goals. The first column of each method shows the success rates in ten navigation, and the second column shows the total amount of collisions in ten navigation. The left side of the vertical line is the version without Waypoint Planner, and the right side has the Waypoint Planner
Evolution GA3C in Empty Room with Random Dynamic Agents
Example environment with 30 randomly generated agents (red) with different velocities and radii. The blue lines show the trajectory of the robot (green), which navigates across the dynamic obstacles.
WP + Evolution GA3C Simulation Demos
Demos of success examples in 15 different paths of 5 different maps.