Quality Diversity Reinforcement Learning for Motion Control Tasks

Abstract

To address the challenge of RL methods being unable to rapidly extract useful information from locomotion control tasks, this paper introduces the Action Quality Evolution (Acque) method, which draws inspiration from the Eureka effect in neural and cognitive psychology.
This method enhances decision-making quality by allowing the controller to have an epiphany of a more valuable action after making a decision, thus improving learning of decision quality.
The core idea is to conduct the quality evolution towards the control policy’s action 𝑎 and make it a higher-quality action called Eureka action 𝑎+. By motivating the controller’s decision-making pro- cess with Eureka actions, it is possible to improve decision quality and discover superior decision schemes that were previously unknown.

To address the issue of RL methods not being able to sufficiently extract effective information from locomotion control tasks, this paper proposes the Action Diversity Advantage (Acdia) method.
It incentivizes control policies to execute diverse actions in stationary state conditions and encourages the robot to explore various uncertainties in the environment to improve control over the task.
The core idea is to evaluate the diversity of state-action pairs 𝛷1(𝑠, 𝑎) experienced by the control policy and the diversity of states 𝛷2(𝑠). The difference between the two 𝛷1(𝑠, 𝑎) - 𝛷2(𝑠) is used as a diversity intrinsic reward introduced into the RL control policy’s learning process, encouraging the control policy to execute diverse actions in familiar states and facilitate the robot’s ability to transition from stationary scenarios to various non-stationary scenarios, thus enhancing the control policy’s mastery of the task.

Finally, the two methods are combined to consider both action quality and action diversity in the RL control strategy learning process. By balancing the proportion of diversity coefficient λdiversity and quality coefficient λquality, the quality-diversity reinforcement learning method for locomotion control tasks is formed.
Diverse actions can promote the robot to transfer from a stationary state to various non-stationary states, while high-quality actions can help the robot quickly restore a stationary state under non-stationary state. A control strategy with both abilities can quickly and fully extract effective information from the motion control task.

This image shows the performance of the controller of the Walker on uneven terrain. Specifically, we selected the snapshot of the controller when the maximum episode reward of the task was first over 3500.

The visualization above the image shows the origin trajectory of the robot, where all time steps are executed with the actions output by the RL controller, which can be represented as (s1, 𝑎1, s2, 𝑎2, ..., sT). The quality trajectory executes the Eureka actions after action quality evolution, which can be represented as (s1, 𝑎1+, s1+, 𝑎2+, ..., sT+).

The reward and velocity curves of the two trajectories are shown below the image. As can be seen, the quality trajectory that executes Eureka actions is more stable and has a higher long-term episode reward, which demonstrates the superiority of Eureka actions. Under the stimulation of Eureka actions, RL controller can quickly learn high-quality decision-making actions, thereby improving performance in motion control tasks.

This figure shows the diversity intrinsic rewards when executing a trajectory in uneven terrain, represented by the difference between state-action divetsity 𝛷1(𝑠, 𝑎) and state diversity 𝛷2(𝑠). Specifically, we selected a snapshot of the controller when the maximum trajectory reward in this task first exceeded 2000. All marked points in the figure represent moments when 𝛷1(𝑠, 𝑎) is relatively large, indicating that the current state-action pair (𝑠, 𝑎) is novel.
Yellow points 1, 3, 4, 5 represent moments when 𝛷2(𝑠) is relatively small, indicating that the current state 𝑠 is familiar to the robot. In this case, the diversity of (𝑠, 𝑎) is due to the action 𝑎 itself, i.e., the robot executes a different action in a familiar state, which is rewarded to encourage the robot to continue exploring more uncertainty in the task environment.
Red points 2, 6, 7, 8 represent moments when 𝛷2(𝑠) is relatively large, indicating that the current state 𝑠 is novel to the robot. In this case, the diversity of (𝑠, 𝑎) is due to the state 𝑠. We avoid encouraging the robot to execute diverse actions in unfamiliar states, thus the lower diversity reward corresponding to the red points.

Future research can be improved in the following aspects:

extending the method to robots with higher degrees of freedom, such as robots driven by the muscular skeletal system, which is crucial for robots to interact with the environment in a more human-like way.
extending the method to a wider range of motion control tasks, such as flips, strides, jumps, and other more delicate tasks than running, which can make robot behavior more universal and crucial for real-world applications.
exploring different control modes, such as using a proportional derivative (PD) controller as the lower-level controller, while using a quality-diversity driven RL control strategy as the upper-level controller. This mixed method can provide better performance, stability, and interpretability for tasks with higher complexity.