Generalizing Skills with Semi-supervised Reinforcement Learning


Abstract
Deep reinforcement learning (RL) can acquire complex behaviors from low-level inputs, such as images. However, real-world applications of such methods require generalizing to the vast variability of the real world. Deep networks are known to achieve remarkable generalization when provided with massive amounts of labeled data, but can we provide this breadth of experience to an RL agent, such as a robot? The robot might continuously learn as it explores the world around it, even while it is deployed and performing useful tasks. However, this learning requires access to a reward function, to tell the agent whether it is succeeding or failing at its task. Such reward functions are often hard to measure in the real world, especially in domains such as robotics and dialog systems, where the reward could depend on the unknown positions of objects or the emotional state of the user. On the other hand, it is often quite practical to provide the agent with reward functions in a limited set of situations, such as when a human supervisor is present, or in a controlled laboratory setting. Can we make use of this limited supervision, and still benefit from the breadth of experience an agent might collect in the unstructured real world? In this paper, we formalize this problem setting as semi-supervised reinforcement learning (SSRL), where the reward function can only be evaluated in a set of “labeled” MDPs, and the agent must generalize its behavior to the wide range of states it might encounter in a set of “unlabeled” MDPs, by using experience from both settings. Our proposed method infers the task objective in the unlabeled MDPs through an algorithm that resembles inverse RL, using the agent’s own prior experience in the labeled MDPs as a kind of demonstration of optimal behavior. We evaluate our method on challenging tasks that require control directly from images, and show that our approach can improve the generalization of a learned deep neural network policy by using experience for which no reward function is available. We also show that our method outperforms direct supervised learning of the reward.

Videos of Results:

obstacle navigation
 Iteration:  0  5  10  15  Test on wall height = 0.4
 RL policy (Wall height=0.3)
     
 S3G (Wall height=0.4)          
 Oracle (Wall height=0.4)          


2-link reacher / mass

 
mass:
 
103.5

   103.75

   104

   104.1

   104.2





RL policy
 
 
 
 
 
 




Reward Regression
 
 
 
 
 
 




S3G
 
 
 
 
 

 



Oracle
 
 
 
 
 

2-link reacher with vision / target position
   labeled MDP      labeled MDP  labeled MDP  unlabeled MDP  unlabeled MDP
 RL


   
 S3G



half-cheetah jumping
   Iteration 0  5  15  30  Test on wall height=1.0
 RL



   
 S3G