Mengdi Xu, Peide Huang, Yaru Niu, Visak Kumar, Jielin Qiu, Chao Fang, Kuan-Hui Lee, Xuewei Qi, Henry Lam, Bo Li, Ding Zhao
AISTATS 2023
Abstract
One key challenge for multi-task Reinforcement learning (RL) in practice is the absence of task indicators. Robust RL has been applied to deal with task ambiguity, but may result in over-conservative policies. To balance the worst-case (robustness) and average performance, we propose Group Distributionally Robust Markov Decision Process (GDR-MDP), a flexible hierarchical MDP formulation that encodes task groups via a latent mixture model. GDR-MDP identifies the optimal policy that maximizes the expected return under the worst-possible qualified belief over task groups within an ambiguity set. We rigorously show that GDR-MDP's hierarchical structure improves distributional robustness by adding regularization to the worst possible outcomes. We then develop deep RL algorithms for GDR-MDP for both value-based and policy-based RL methods. Extensive experiments on Box2D control tasks, MuJoCo benchmarks, and Google football platforms show that our algorithms outperform classic robust training algorithms across diverse environments in terms of robustness under belief uncertainties.
Group Distributionally Robust MDP
To handle belief uncertainty, we formulate the group distributionally robust Markov Decision Process (GDR-MDP). GDR-MDP naturally balances the worst-case (robustness) and average performance by leveraging the adaptive belief and the distributionally robust formulation. To build GDR-MDP, we first formulate Hierarchical-Latent MDP (HLMDP), which utilizes a mixture model over MDPs to encode task subpopulations. HLMDP has a high-level latent variable z as the mixture, and a low-level m to represent tasks. GDR-MDP also encodes task subpopulations and additionally formulates the robustness w.r.t. the ambiguity of the adaptive belief b(z) over mixtures.
Figure 1. Task groups and the graphical model.
Figure 2. Hierarchical latent bandit examples and ambiguity sets.
Experiments
To solve the proposed GDR-MDP, we propose novel robust deep RL algorithms, including GDR-DQN based on Deep Q learning, GDR-SAC based on soft actor-critic, and GDR-PPO based on PPO. We conduct extensive experiments and evaluate GDR-DQN in Lunarlander, GDR-SAC in Halfcheetah, and GDR-PPO in Google Research Football.
We compare our proposed GDR methods with robust and non-robust baselines in terms of training stabilities and robustness under belief noise.
Figure 3. Training performances of GDR and baselines. GDR has stronger training stability compared with DR, showing that the hierarchical structure helps regularize the adversary's strength.
Figure 4. Robustness to belief noise. GDR is more robust to belief noises compared with all the other baselines.
Task Visualization
We visualize the 4 tasks from 2 groups in the Google Research Football environment. In this section, the videos are generated with the policies using ground-truth group indexes.
Group 0
CM vs. CB
Group 1
CB vs. CM
Task 0: CM (0.9) vs. CB (0.6)
Task 2: CB (0.9) vs. CM (0.6)
Task 1: CM (1.0) vs. CB (0.7)
Task 3: CB (1.0) vs. CM (0.7)
Robustness Evaluation
We show the rollouts of different methods under various levels of noisy beliefs. The reported success rates here are averaged across the visualized 20 episodes. We term our proposed method as GDR and denote the best success rates in red.
Noisy Level =0.2
GDR - Overall Success Rate: 94%
Task 0 - GDR (85%)
Task 1 - GDR (95%)
Task 2 - GDR (95%)
Task 3 - GDR (100%)
G-Exact with Noisy Belief - Overall Success Rate:70%
Task 0 - G-Exact with Noisy Belief Input (60%)
Task 1 - G-Exact with Noisy Belief Input (65%)
Task 2 - G-Exact with Noisy Belief Input (75%)
Task 3 - G-Exact with Noisy Belief Input (80%)
DR - Overall Success Rate: 79%
Task 0 - DR (55%)
Task 1 - DR (70%)
Task 2 - DR (100%)
Task 3 - DR (90%)
Noisy Level =0.6
GDR - Overall Success Rate: 88%
Task 0 - GDR (65%)
Task 1 - GDR (95%)
Task 2 - GDR (90%)
Task 3 - GDR (100%)
G-Exact with Noisy Belief - Overall Success Rate:40%
Task 0 - G-Exact with Noisy Belief Input (50%)
Task 1 - G-Exact with Noisy Belief Input (40%)
Task 2 - G-Exact with Noisy Belief Input (20%)
Task 3 - G-Exact with Noisy Belief Input (50%)
DR - Overall Success Rate: 66%
Task 0 - DR (35%)
Task 1 - DR (70%)
Task 2 - DR (85%)
Task 3 - DR (75%)