Quality Diversity Reinforcement Learning for Motion Control Tasks
Master Thesis, Fudan University
Abstract
Reinforcement learning (RL) has demonstrated immense potential in robot locomotion control tasks in recent years, as it can acquire intricate control strategies from high-dimensional state and sensory data. Nonetheless, due to the limited prior knowledge, these approaches may struggle to extract effective information from environmental interactions swiftly and fully. To address this issue, this paper introduces the concept of Quality-Diversity as a form of prior knowledge for motion control tasks, with the aim of enhancing the performance of RL methods in these tasks.
Building on this idea, this paper proposes two reinforcement learning methods: one based on action quality and the other on action diversity. The former encourages the robot to make high quality decisions during locomotion, ensuring learning to be more stable and preventing potential errors from causing damage to the robot. The latter enables the robot to explore a broader range of actions, allowing for greater exploration of uncertain factors in the task environment and providing the robot with a more comprehensive understanding of the task, thus improving overall decision-making and performance.
This paper conducted sufficient experiments on 12 different motion control tasks in 3 different environment settings using 4 types of robots with different morphologies. The experiments were analyzed and compared from various aspects such as reward curve, final performance, sample efficiency, statistical indicators, and cross-task performance. The experimental analysis showed that the proposed method in this paper can improve the learning efficiency and final performance of RL methods in various tasks, providing insights and empirical evidence for further research in this field.
Method
Part 1: Action-based Quality-driven RL
To address the challenge of RL methods being unable to rapidly extract useful information from locomotion control tasks, this paper introduces the Action Quality Evolution (Acque) method, which draws inspiration from the Eureka effect in neural and cognitive psychology.
This method enhances decision-making quality by allowing the controller to have an epiphany of a more valuable action after making a decision, thus improving learning of decision quality.
The core idea is to conduct the quality evolution towards the control policy’s action 𝑎 and make it a higher-quality action called Eureka action 𝑎+. By motivating the controller’s decision-making pro- cess with Eureka actions, it is possible to improve decision quality and discover superior decision schemes that were previously unknown.
Part 2: Action-based Diversity-driven RL
To address the issue of RL methods not being able to sufficiently extract effective information from locomotion control tasks, this paper proposes the Action Diversity Advantage (Acdia) method.
It incentivizes control policies to execute diverse actions in stationary state conditions and encourages the robot to explore various uncertainties in the environment to improve control over the task.
The core idea is to evaluate the diversity of state-action pairs 𝛷1(𝑠, 𝑎) experienced by the control policy and the diversity of states 𝛷2(𝑠). The difference between the two 𝛷1(𝑠, 𝑎) - 𝛷2(𝑠) is used as a diversity intrinsic reward introduced into the RL control policy’s learning process, encouraging the control policy to execute diverse actions in familiar states and facilitate the robot’s ability to transition from stationary scenarios to various non-stationary scenarios, thus enhancing the control policy’s mastery of the task.
Part 3: Quality-Diversity Driven RL
Finally, the two methods are combined to consider both action quality and action diversity in the RL control strategy learning process. By balancing the proportion of diversity coefficient λdiversity and quality coefficient λquality, the quality-diversity reinforcement learning method for locomotion control tasks is formed.
Diverse actions can promote the robot to transfer from a stationary state to various non-stationary states, while high-quality actions can help the robot quickly restore a stationary state under non-stationary state. A control strategy with both abilities can quickly and fully extract effective information from the motion control task.
Motivation Example
Part 1: Quality Part
This image shows the performance of the controller of the Walker on uneven terrain. Specifically, we selected the snapshot of the controller when the maximum episode reward of the task was first over 3500.
The visualization above the image shows the origin trajectory of the robot, where all time steps are executed with the actions output by the RL controller, which can be represented as (s1, 𝑎1, s2, 𝑎2, ..., sT). The quality trajectory executes the Eureka actions after action quality evolution, which can be represented as (s1, 𝑎1+, s1+, 𝑎2+, ..., sT+).
The reward and velocity curves of the two trajectories are shown below the image. As can be seen, the quality trajectory that executes Eureka actions is more stable and has a higher long-term episode reward, which demonstrates the superiority of Eureka actions. Under the stimulation of Eureka actions, RL controller can quickly learn high-quality decision-making actions, thereby improving performance in motion control tasks.
Part 2: Diversity Part
This figure shows the diversity intrinsic rewards when executing a trajectory in uneven terrain, represented by the difference between state-action divetsity 𝛷1(𝑠, 𝑎) and state diversity 𝛷2(𝑠). Specifically, we selected a snapshot of the controller when the maximum trajectory reward in this task first exceeded 2000. All marked points in the figure represent moments when 𝛷1(𝑠, 𝑎) is relatively large, indicating that the current state-action pair (𝑠, 𝑎) is novel.
Yellow points 1, 3, 4, 5 represent moments when 𝛷2(𝑠) is relatively small, indicating that the current state 𝑠 is familiar to the robot. In this case, the diversity of (𝑠, 𝑎) is due to the action 𝑎 itself, i.e., the robot executes a different action in a familiar state, which is rewarded to encourage the robot to continue exploring more uncertainty in the task environment.
Red points 2, 6, 7, 8 represent moments when 𝛷2(𝑠) is relatively large, indicating that the current state 𝑠 is novel to the robot. In this case, the diversity of (𝑠, 𝑎) is due to the state 𝑠. We avoid encouraging the robot to execute diverse actions in unfamiliar states, thus the lower diversity reward corresponding to the red points.
Experiment and Evaluation
Experiment:
Dense Reward (a-d)
Sparse Reward (e-h)
Uneven Terrain (i-l) (video see the beginning)
Evaluation:
Performance evaluation across all tasks with rliable.
Zero-shot Adaptation:
Broken Joint
Morphology Shift
Sensor Failure
Furture Work
Future research can be improved in the following aspects:
extending the method to robots with higher degrees of freedom, such as robots driven by the muscular skeletal system, which is crucial for robots to interact with the environment in a more human-like way.
extending the method to a wider range of motion control tasks, such as flips, strides, jumps, and other more delicate tasks than running, which can make robot behavior more universal and crucial for real-world applications.
exploring different control modes, such as using a proportional derivative (PD) controller as the lower-level controller, while using a quality-diversity driven RL control strategy as the upper-level controller. This mixed method can provide better performance, stability, and interpretability for tasks with higher complexity.