Robust Value Iteration for Continuous Control

Abstract:

When transferring a control policy from simulationto a physical system, the policy needs to be robust to variationsin the dynamics to perform well. Commonly, the optimal policy overfits to the approximate model and the corresponding state-distribution. Therefore, the policy fails when transferred to the physical system. In this paper, we present robust value iteration, which uses dynamic programming to compute the optimal value function on the compact state domain and incorporates adversarial perturbations of the system dynamics. The adversarial perturbations encourage a optimal policy that is robust to changes in the dynamics. Utilizing the continuous-time perspective of reinforcement learning, we derive the optimal perturbations forthe states, actions, observations and model parameters in closed-form. The resulting algorithm does not require discretization of states or actions. Therefore, the optimal adversarial perturbations can be efficiently incorporated in the min-max value function update. We apply the resulting algorithm to the physical Furuta pendulum and cartpole. By changing the masses of the systems, we evaluate the quantitative and qualitative performance across different model parameters. We show that robust value iteration is more robust compared to deep reinforcement learning algorithm and the non-robust version of the algorithm.

DP rFVI - Furuta Pendulum

Videos of three rollouts per configuration as well as the videos of the baselines can be found in the separate Furuta pendulum page.

Δm = -5g

Δm = -2g

Δm = 0g

Δm = +1g

Δm = +3g

Δm = +5g

DP rFVI - Cartpole

Videos of three rollouts per configuration as well as the videos of the baselines can be found in the separate cartpole page.

Δm = -20g

Δm = -10g

Δm = 0g

Δm = +10g

Δm = +20g

Δm = +50g