Thinking While Moving: Deep Reinforcement Learning with Concurrent Control

Ted Xiao, Eric Jang, Dmitry Kalashnikov, Sergey Levine, Julian Ibarz, Karol Hausman*, Alexander Herzog*

(* indicates equal contribution)


We study reinforcement learning in settings where sampling an action from the policy must be done concurrently with the time evolution of the controlled system, such as when a robot must decide on the next action while still performing the previous action. Much like a person or an animal, the robot must think and move at the same time, deciding on its next action before the previous one has completed. In order to develop an algorithmic framework for such concurrent control problems, we start with a continuous-time formulation of the Bellman equations, and then discretize them in a way that is aware of system delays. We instantiate this new class of approximate dynamic programming methods via a simple architectural extension to existing value-based deep reinforcement learning algorithms. We evaluate our methods on simulated benchmark tasks and a large-scale robotic grasping task where the robot must "think while moving".

Simulated Grasping

Real Robot Grasping

Environment Comparison

Concurrent Knowledge Models

We introduce an algorithmic framework to analyze concurrent environments in continuous and discrete time. When actions are allowed to be executed concurrently in standard Markov Decision Processes, we show that the increased partial observability of the environment causes learning challenges for the policy. However, by providing value-based reinforcement learning methods with additional information about the state delays in the environment, concurrent knowledge models maintain theoretical Q-learning convergence guarantees. We compare concurrent knowledge feature representations that include previous action, inference heartbeat (H), action selection time (t_AS), vector-to-go (VTG), and timestep in the continuous time case (t).

Experimental Results

We evaluate our method on toy ablation study MuJoCo tasks, large scale simulated grasping, and real world grasping. Concurrent knowledge models are able to achieve comparable task success to baseline methods while acting 37% faster. As seen in the demonstration videos, concurrent knowledge models are able to act quickly and smoothly without the need to pause and think between actions.

Supplementary Video: Robot Demonstrations