Parameterizing Non-Parametric Meta-Reinforcement Learning Tasks via Subtask Decomposition
NeurIPS 2023
Suyoung Lee Myungsik Cho Youngchul Sung
KAIST
Parameterizing Non-Parametric Meta-Reinforcement Learning Tasks via Subtask Decomposition
NeurIPS 2023
Suyoung Lee Myungsik Cho Youngchul Sung
KAIST
Meta-World ML-10 test tasks
The last rollout episode with the best return.
■ Fail ■ Success
SDVT-LW
Drawer-open
RL2
Drawer-open
VariBAD
Drawer-open
LDM
Drawer-open
SDVT-LW
Door-close
RL2
Door-close
VariBAD
Door-close
LDM
Door-close
SDVT-LW
Shelf-place
RL2
Shelf-place
VariBAD
Shelf-place
LDM
Shelf-place
SDVT-LW
Sweep-into
RL2
Sweep-into
VariBAD
Sweep-into
LDM
Sweep-into
SDVT-LW
Lever-pull
RL2
Lever-pull
VariBAD
Lever-pull
LDM
Lever-pull
SDVT-LW's generated virtual tasks
Redner of the last rollout episode and the visitation map for all rollout episodes.
All states are from the task "Reach".
● Object Π Gripper
Learned Context of all ML-10 Tasks
We report the parameters of the learned contexts of three seeds for all tasks during a meta-episode:
Categorical weight for subtask decomposition y
Mean and log std of the Gaussian distribution to sample continuous context z.
1 meta-episode = 10 rollout episodes 1 rollout episode = 500 steps
We demonstrate that the learned contexts mostly stay constant after some initial steps on training tasks.
How the tasks are decomposed varies over random seeds.
ML-45 Subtask Decomposition
We demonstrate decompositions for SDVT (K=45) and SDVT-LW (K=5)
(First random seed for both figures)
Analogous to the result on ML-10, similar tasks share subtasks.
Discrepancy between Success Rate and Return
SDVT-LW achieves high returns but low success rates on simple tasks such as Reach, Push, and Pick-place.
This is due to the discrepancy between return and the success rates as SDVT with mean returns solve the tasks well as below.
RL2 with high variance in action succeeds by chance but with low returns.
■ Fail ■ Success
Return of the rollout episode: 3662 (fail)
SDVT-LW mean return: 3681
Return of the rollout episode: 3416 (fail)
SDVT-LW mean return: 3493
Return of the rollout episode: 2241 (fail)
SDVT-LW mean return: 2241
Return of the rollout episode: 2098 (success)
RL2 mean return: 2291
Return of the rollout episode: 706 (success)
RL2 mean return: 815
Return of the rollout episode: 540 (success)
RL2 mean return: 501