Abstract: This paper develops KL-Ergodic Exploration from Equilibrium (KL-E^3 ), a method for robotic systems to integrate stability into actively generating informative measurements through ergodic exploration. Ergodic exploration enables robotic systems to indirectly sample from informative spatial distributions globally, avoiding local optima, and without the need to evaluate the derivatives of the distribution against the robot dynamics. Using hybrid systems theory, we derive a controller that allows a robot to exploit equilibrium policies (i.e., policies that solve a task) while allowing the robot to explore and generate informative data using an ergodic measure that can extend to high-dimensional states. We show that our method is able to maintain Lyapunov attractiveness with respect to the equilibrium task while actively generating data for learning tasks such as Bayesian optimization, model learning, and off-policy reinforcement learning. In each example we show that our proposed method is capable of generating an informative distribution of data while synthesizing smooth control signals. We illustrate these examples using simulated systems and provide simplification of our method for real-time online learning in robotic systems.
Our method generates ergodic exploration using a sample-based KL-divergence measure, which minimizes the distance between the measure of the time averaged distribution (blue) of the robot's trajectory (pink line) to the measure of a target spatial distribution (green).
Below are more examples and multimedia results referenced in the paper. For example code please see the link to the GitHub repository below:
In many examples of Bayesian optimization, the assumption is that the learning algorithm can freely sample anywhere in the sample space (the cart position space), however; this is not always true. Consider an example where a robot must collect a sample from a Bayesian optimization step where the search space of this sample intersects the state-space of the robot itself. The robot is constrained by its dynamics in terms of how it can sample the objective. Thus, the Bayesian optimization step becomes a constrained optimization step where the goal is to reach the optimal value of the acquisition function subject to the dynamic constraints of the robot. Furthermore, assume that the motion of the robot is restricted to maintain the robot at an equilibrium (such as maintaining the inverted equilibrium of the cart double pendulum). The problem is then to enable a robot to execute a sample step of Bayesian optimization by taking into account the constraints of the robot. We use this example to emphasize the effectiveness of our method for exploiting the local dynamic information using a cart double pendulum where the equilibrium state is at the unstable upright inverted state where a policy maintains the double pendulum upright.
In this next example KLE3 is used to collect data for learning a stochastic transition model of a quadcopter dynamical system by exploring the state-space of the quadcopter while at remaining at a stable hover. Our goal is to show that our method can efficiently and effectively explore the state-space of the quadcopter (including body linear and angular velocities) in order to generate data for learning a transition model of the quadcopter for model-based control. In addition, we show that the exploratory motions improve the quality of data generated for learning while exploiting and respecting the stable hover equilibrium in a single execution of the robotic system Abraham et al. 2019.
Shown is our method (blue quadcopter) exploring the state-space of the stochastic model next to an information maximizing method (green). Generating dynamic coverage rather than strictly optimizing an information objective avoids local optima and enables the quadcopter to be Lyapunov attractive (where instability is allowed so long as eventually the robot will return to an equilibrium).
In our last example, we explore KL-E^3 for improving robot skill learning (here we consider off-policy reinforcement learning). Our goal is to show that we can view a robot skill as being in equilibrium (using a feedback policy) where our method can explore within the vicinity of the robot skill in an intentional manner, improving the learning process.
A common mode of failure in many examples of robot skill learning is that the resulting learned skill is highly dependent on the quality of the distribution of data generated that is used for learning. Our approach uses dynamic coverage for planning exploratory actions that assist the learning process.
Our method combined with DDPG improves the number of iterations it takes to learn a running policy (top). Policy learned with 750 episodes (about 150000 sampling instances) compared to DDPG at the same time which learns a policy with suboptimal lunging behavior. KL-E^3 enhanced DDPG explores around the learned skill, assisting and improving the quality of the learned skill.
Cart pole swing-up skill learned within 10000 steps (50 episodes) using KL-E^3 enhanced DDPG. DDPG using the same parameters is unable to learn a swing-up policy within the same number of iterations (right).