A Distributional View on Multi-Objective Policy Optimization

Peg-in-hole

The policy trained with scalarized MPO initially learns to insert the peg in a forceful way - by bumping the peg against the side of the hole, to detect where the hole is. In contrast, MO-MPO learns to carefully position the peg above the hole before inserting. This is visible in the videos, taken after 18 hours of training (indicated by dotted line in learning curves).

scalarized MPO

hits edge of hole

MO-MPO (ours)

hovers above hole without bumping into it

Page updated

Google Sites

Report abuse