SDVT

Parameterizing Non-Parametric Meta-Reinforcement Learning Tasks via Subtask Decomposition
NeurIPS 2023

Suyoung Lee Myungsik Cho Youngchul Sung

KAIST

Meta-World ML-10 test tasks

The last rollout episode with the best return.

■ Fail ■ Success

SDVT-LW
Drawer-open

RL2
Drawer-open

VariBAD
Drawer-open

LDM
Drawer-open

SDVT-LW
Door-close

RL2
Door-close

VariBAD
Door-close

LDM
Door-close

SDVT-LW
Shelf-place

RL2
Shelf-place

VariBAD
Shelf-place

LDM
Shelf-place

SDVT-LW
Sweep-into

RL2
Sweep-into

VariBAD
Sweep-into

LDM
Sweep-into

SDVT-LW
Lever-pull

RL2
Lever-pull

VariBAD
Lever-pull

LDM
Lever-pull

SDVT-LW's generated virtual tasks

Redner of the last rollout episode and the visitation map for all rollout episodes.

All states are from the task "Reach".

● Object Π Gripper

ỹ = [1, 0, 0, 0, 0]

ỹ = [0, 1, 0, 0, 0]

ỹ = [0, 0, 1, 0, 0]

ỹ = [0, 0, 0, 1, 0]

ỹ = [0, 0, 0, 0, 1]

ỹ = [0.5, 0.5, 0, 0, 0]

ỹ = [0, 0, 0, 0.5, 0.5]

ỹ = [0, 0, 0.333, 0.333, 0.333]

Learned Context of all ML-10 Tasks

We report the parameters of the learned contexts of three seeds for all tasks during a meta-episode:

Categorical weight for subtask decomposition y
Mean and log std of the Gaussian distribution to sample continuous context z.
1 meta-episode = 10 rollout episodes 1 rollout episode = 500 steps

We demonstrate that the learned contexts mostly stay constant after some initial steps on training tasks.

How the tasks are decomposed varies over random seeds.

— Dim. 1 — Dim. 2 — Dim. 3 — Dim 4 — Dim 5

ML-45 Subtask Decomposition

We demonstrate decompositions for SDVT (K=45) and SDVT-LW (K=5)
(First random seed for both figures)

Analogous to the result on ML-10, similar tasks share subtasks.

SDVT-LW

SDVT

Discrepancy between Success Rate and Return

SDVT-LW achieves high returns but low success rates on simple tasks such as Reach, Push, and Pick-place.

This is due to the discrepancy between return and the success rates as SDVT with mean returns solve the tasks well as below.

RL2 with high variance in action succeeds by chance but with low returns.

■ Fail ■ Success

SDVT-LW Reach

Return of the rollout episode: 3662 (fail)

SDVT-LW mean return: 3681

SDVT-LW Push

Return of the rollout episode: 3416 (fail)

SDVT-LW mean return: 3493

SDVT-LW Pick-place

Return of the rollout episode: 2241 (fail)

SDVT-LW mean return: 2241

RL2 Reach

Return of the rollout episode: 2098 (success)

RL2 mean return: 2291

RL2 Push

Return of the rollout episode: 706 (success)

RL2 mean return: 815

RL2 Pick-place

Return of the rollout episode: 540 (success)

RL2 mean return: 501

Page updated

Google Sites

Report abuse