A Distributional View on Multi-Objective Policy Optimization
A Distributional View on Multi-Objective Policy Optimization
The policy trained with scalarized MPO initially learns to insert the peg in a forceful way - by bumping the peg against the side of the hole, to detect where the hole is. In contrast, MO-MPO learns to carefully position the peg above the hole before inserting. This is visible in the videos, taken after 18 hours of training (indicated by dotted line in learning curves).
hits edge of hole
hovers above hole without bumping into it