Efficiently Learning Small Policies for Locomotion and Manipulation

Shashank Hegde Gaurav S. Sukhatme

University of Southern California

Abstract

Neural control of memory-constrained, agile robots requires small, yet highly performant models. We leverage graph hyper networks to learn graph hyper policies trained with off-policy reinforcement learning resulting in networks that are two orders of magnitude smaller than commonly used networks yet encode policies comparable to those encoded by much larger networks trained on the same task. We show that our method can be appended to any off-policy reinforcement learning algorithm, without any change in hyperparameters, by showing results across locomotion and manipulation tasks. Further, we obtain an array of working policies, with differing numbers of parameters, allowing us to pick an optimal network for the memory constraints of a system. Training multiple policies with our method is as sample efficient as training a single policy. Finally, we provide a method to select the best architecture, given a constraint on the number of parameters. 

Video Summary

ICRA23_2789_VI_i.mp4

Rollouts

For each task we find the smallest network architecture that at least achieves 90% peak performance

hopper.mkv

Hopper [8,4,4,4] - 187 parameters

hc.mkv

HalfCheetah [64,4,4,32] - 1790 parameters

walker.mkv

Walker2D [16,16,4] - 658 parameters

humaoid.mkv

Humanoid [8,32,8] - 3721 parameters

ant.mkv

Ant [16,32,32,4] - 3564 parameters

reach.mkv

FetchReach [4,8] - 132 parameters 

push.mkv

FetchPush [32,32,8,16] - 2460 parameters

slide.mkv

FetchSlide [64,64,32,16] - 8692 parameters

pap.mkv

FetchPickAndPlace [16,16,8] - 824 parameters

Method Overview

GHP control policy weight estimation: The GHP can estimate the desired weights for a given Control policy network architecture. This way we can estimate the optimal control policy weights for multiple architectures.

Off-Policy RL using the GHP: For an existing Actor Critic, Off-Policy Reinforcement Learning algorithm, we replace the vanilla actor policy with a Graph Hyper Policy (GHP).


Using DLM for best architecture: Among architectures trained with the GHP, we can predict the best performing architecture by maximizing the DLM

Citation

@inproceedings{hegde2023efficiently,

  title={Efficiently Learning Small Policies for Locomotion and Manipulation},

  author={Hegde, Shashank and Sukhatme, Gaurav S},

  booktitle={2023 IEEE International Conference on Robotics and Automation (ICRA)},

  pages={5909--5915},

  year={2023},

  organization={IEEE}

}