Real World Fluid Directed Rigid Body Control via
Deep Reinforcement Learning

Mohak Bhardwaj*, Thomas Lampe*, Michael Neunert*, Francesco Romano*, Abbas Abdolmaleki*

Arunkumar Byravan*, Markus Wulfmeier*, Martin Riedmiller*, Jonas Buchli*

† - University of Washington * - Google DeepMind

Supplementary Material:  PDF 

Abstract

Recent advances in real-world applications of reinforcement learning (RL) have relied on the ability to accurately simulate systems at scale. However, domains such as fluid dynamical systems exhibit complex dynamic phenomena that are hard to simulate at high integration rates, limiting the direct application of modern deep RL algorithms to often expensive or safety critical hardware. In this work, we introduce "Box o’ Flows", a novel benchtop experimental control system for systematically evaluating RL algorithms in dynamic real-world scenarios. We describe the key components of the Box o' Flows, and through a series of experiments demonstrate how state-of-the-art model-free RL algorithms can synthesize a variety of complex behaviors via simple reward specifications. Furthermore, we explore the role of offline RL in data-efficient hypothesis testing by reusing past experiences. We believe that the insights gained from this preliminary study and the availability of systems like the Box o' Flows support the way forward for developing systematic RL algorithms that can be generally applied to complex, dynamical systems. 

Smoke-based Visualization of Flow Patterns

dbof-flow-visu.mp4

We use a smoke-generator to visualize the complex flow patterns generated by valve opening in the Box o' Flows. The non-steady dynamics of airflow is impossible to model at the high-integration rates required by reinforcement learning and model-predictive control based approaches forcing the use of simplified models in practice. 

Videos of Learned Behaviors

Online RL: We use Maximum A-posteriori Policy Optimization (MPO), a state of the art model-free RL algorithm to learn a wide range of dynamic behaviors with minimally specified reward functions. In all tasks, the agent gets as input a history of tracker measurements indicating the pixel location of the center of each ball. 

maximize_orange.webm

Hovering in the Presence of Distractors

Task is to maximize the height of the orange ball while the others act as distractors

orange_in_right_purple_in_left.webm

Rearrangement: Orange in Right, Purple in Left

Rearrange the orange ball to be in the right half plane and purple in the left half plane.

orange_over_purple_v5.webm

Stacking: Orange over Purple

The agent must learn a strategy to stack the orange ball over the purple one using a reward based on  

Goal Directed RL: We train an agent to learn goal directed policy to stabilize the ball at randomly chosen pixel targets. This task is used to characterize the reachability of different target regions. Since we do not have access to a simulator, and it is apriori unknown what the capabilities of the real hardware are, the performance of model-free RL for the reaching task serves as an auxillary metric.  

reaching_2.mp4
reaching_1.mp4

We visually analyze reachability by plotting a coarsely discretized heatmap of reaching errors for different target regions. Pixel intensity is proportional to cumulative error for episodes when the target was in that pixel’s bin. Error is the average distance between the ball and target in the last 200 episode steps. The analysis shows that target locations closer to the bottom and center are easier to reach in general. Also, targets near the bottom right are harder than bottom-left and bottom-center, which reveals an imbalance in the airflow through different

nozzles. Interestingly, targets closer to the walls are also easily reachable since the agent can better exploit airflow.