Seyed Kamyar Seyed Ghasemipour, Byron David, Daniel Freeman, Shixiang (Shane) Gu,
Satoshi Kataoka†, Igor Mordatchâ€
†equal advising
ICML'22 Accepted
Assembly of multi-part physical structures is both a valuable end product for autonomous robotics, as well as a valuable diagnostic task for open-ended training of embodied intelligent agents. We introduce a naturalistic physics-based environment with a set of connectable magnet blocks inspired by children’s toy kits. The objective is to assemble blocks into a succession of target blueprints. Despite the simplicity of this objective, the compositional nature of building diverse blueprints from a set of blocks leads to an explosion of complexity in structures that agents encounter. Furthermore, assembly stresses agents' multi-step planning, physical reasoning, and bimanual coordination. We find that combination of large-scale reinforcement learning and graph-based policies is an effective recipe for training agents that not only generalize to complex unseen blueprints in a zero-shot manner, but even operate in a reset-free setting without being trained to do so. Through extensive experiments, we highlight the importance of large-scale training, structured representations, contributions of multi-task vs. single-task learning, as well as the effects of curriculums, and discuss qualitative behaviors of trained agents.
Our goal is to design a minimal tractable assembly environment to study generalization in a naturalistic, multi-step, combinatorial, dynamic problem requiring bi-hand coordination. We construct a three-dimensional environment containing a fixed set of 16 cuboid blocks of 6 different types. Blocks contain positive and negative magnet points, rendered as red and blue respectively, positioned on the block surface. Positive and negative magnets "snap" together when sufficiently close, and disconnect when adequate pulling force is applied. Magnets enable creation of arbitrarily complex composed structures from the given building blocks. To simplify the problem, in lieu of robotic arms we opt for the use of virtual grippers which can directly manipulate desired blocks. More specifically, each gripper can decide which block to move, and set its positional and rotational velocities. The use of direct manipulation abstracts away the challenges of grasping and manipulation with a robotic arm, and enables us to focus on research questions concerning higher-level assembly behaviors such as planning and generalization to unseen structures.
To specify the assembly task, we designed 165 blueprints (split into 141 train, 24 test) describing interesting structures to be built, although the blueprints can potentially be procedurally generated. The complexity of the created blueprints range from requiring only a single magnetic connection, up to challenging structures that make use of all 16 available blocks. The problem statement in our magnetic assembly environment is simple to describe: In each episode, the agent must assemble the blocks to create the desired blueprint. Each episode begins with either all blocks randomly scattered around the environment, or from a randomly sampled pre-constructed blueprint -- with unused blocks dispersed on the ground. Episodes are 100 environment steps long, translating to a length of 10 seconds in the real world.
In the assembly task, observations pertaining to the blocks can be naturally organized into a directed graph, with each node containing information about a particular block, and each directed edge representing relative information about the two blocks.
The information contained in each node is very minimal: the z height of the block from the ground, and whether it was being held by each gripper in the previous timestep. The majority of observations are placed on the directed edges. An edge connecting two blocks contains the information regarding: relative position and orientation of their magnets that need to be connected, change in relative position and orientation of the blocks needed to match the blueprint, relative position of center of mass of the blocks, whether the blocks are magnetically attached, and whether the blocks should be magnetically attached according to the blueprint. These observations can realistically be computed in a real-world setting as well by simply obtaining each blocks position and orientation.
For each gripper we include its orientation, positional and rotational velocities, and which block the gripper was holding in the previous timestep.
Given that our magnetic assembly task can be naturally set up using graph-based observations, prior to extracting actions and critic values, we first encode inputs using a graph neural network architecture, specifically graph attention networks. The two inputs to our encoder are (1) a directed graph containing all block observations (2) a "global node" containing gripper observations.
We have discovered that in addition to a graph neural network encoder, a key design choice is how to extract policy actions from the encoded inputs. From each block node we obtain 2 vectors: (1) a vector representing how the block would like to be moved if a gripper chooses to move it, and (2) a vector representing a key vector for the block. From the global node's hidden features, we obtain one query vector per gripper. To obtain logits representing which gripper decides to move which block, we use dot-product attention between the block keys and the gripper query vectors.
To train our RL agents we also require critic value estimates, which we obtain by passing global features obtained from the graph encoder to a small MLP.
We train our agents using Proximal Policy Optimization (PPO) and Generalized Advantage Estimation (GAE), and follow the practical PPO training advice of Andrychowicz et al. As will be shown below, one of the most key ingredients in enabling the training of our magnetic assembly agents is the scale of training. Unless otherwise specified, our agents are trained for 1 Billion environment timesteps, using 1 Nvidia V100 GPU for training, and 3000 preemptible CPUs for generating rollouts in the environment. 1 Billion steps in our setup amounts to about 48 hours of training. The key libraries used for training are Jax, Jraph, Haiku, and Acme.
The blueprints that we have designed range from very simple 2 block structures, up to complex blueprints containing all blocks. To train assembly agents, we have split blueprints into 141 training and 24 testing structures, and unless otherwise specified, agents are trained on the full training set of blueprints; in each episode, we sample a training blueprint and task the agent with creating that structure.
Episodes start from either (1) all the blocks randomly dispersed on the ground, or (2) a randomly chosen preconstructed blueprint structure with unused blocks randomly arranged on the ground. Resetting from blueprints increases the diversity of initial states, forces the agent to learn how to disassemble structures, and as we found enables a reset-free mode of operation where the agent can continually construct, deconstruct, and reconstruct different blueprints. Unless otherwise specified, we reset from training blueprints with probability 0.2.
We have observed that throughout training, some blueprints can be quickly learned while others can be much more challenging. To emphasize focus on more challenging blueprints, in each episode we sample goal blueprints based on a curriculum that emphasizes currently unsolved blueprints.
During training, we evaluate trained policies continuously approximately every 10 minutes by freezing the policy and computing average success rate over 40 episodes. This continuous evaluation is executed on both training and test environments. Also, in each evaluation cycle, we generate a video to visualize the agents' behavior. Such visualizations have been a valuable asset in iterating over the design of our agents, observations, reward functions, and training setups.
The figure above presents the success rates of our agent (averaged across two runs) throughout training, on blueprints the agent was trained on as well as held-out structures. The first key observation is the compute scale necessary for effectively training our structured agents using PPO. The simplest 2 block structures can take up to 100 million steps to be reliably solved, while it can take up to 500 million environment steps until the first time some of the most complex blueprints are solved. The second observation is that after a long period of training, not only can agents reliably solve all training blueprints, but they can also generalize well to complex held-out blueprints.
As noted earlier, agents are trained to simultaneously learn to construct all blueprints in the training split. To understand the contribution of this "Multi-Task" training, we train three agents in a "Single-Task" setting: one for learning to construct a particular 6 block blueprint, one for constructing a particular 12 block blueprint, and one for constructing a particular 16 block blueprint. Our key observations are the following: (1) While the 6 and 12 block blueprints are eventually learned, the 16 block blueprint is not learned, (2) In the single-task setting, the 12 block blueprint requires approximately 500 million environment steps to be learned, while in the multi-task setting it is learned within 300 million steps, (3) the single-task agents can transfer to some blueprints of equal or lower complexity than they were trained on, but mostly fail to transfer to any blueprints they were not trained to solve. This is in sharp contrast to the multi-task agents which can even transfer to complex held-out blueprints. These results highlight the necessity of multi-task training, not only for generalization to unseen blueprints, but for quickly and reliably solving complex tasks, despite the fact that agent architectures are well-matched to the problem domain.
Given that state information for the assembly task can be naturally organized into a graph representation, the use of graph neural networks imbues agents with an inductive bias that is well-matched to the domain. We compare to two variations of our agent architecture: 1) Removing the attention in the graph layers, 2) Removing the relational inductive bias by using a residual network encoder instead of graph neural networks.
The results in the table below clearly demonstrate the necessity of the relational inductive bias provided by graph neural networks, as well as the impact of the attention mechanism in the graph neural network layers.
We find that while our single-gripper agents finds unique strategies to complete some of the structures, as shown in the table above, its overall success rate is lower than that of a dual-gripper agent, particularly on the more complex blueprints. This indicates the necessity of using two grippers in our proposed Magnetic Block Assembly domain.
At a rate of 20%, the initial state of an episode is set to be a randomly selected pre-constructed blueprint, with remaining blocks dispersed on the ground. This choice has two advantages: (1) it provides an opportunity for agents to learn how to disassemble incorrect constructions, and (2) it enables the evaluation of our agents in a reset-free manner, where we continually task agents with constructing new blueprints without resetting the environment to an initial state.
We compare the success rate of two agents, one with and one without blueprint resets, in a reset-free setting. Specifically, within one reset-free episode we ask agents to build 10 consecutive blueprints without resetting to an initial state: once an agent successfully constructs a blueprint, or the maximum of 100 steps has elapsed, we change the target blueprint. As an additional challenge, we sample blueprints from the training set structures requiring a minimum of 12 blocks. We report the success rate aggregated across 50 reset-free episodes (i.e. 50 x 10 total episodes).
When resetting from blueprints is disabled, our agent achieves a sucess rate of 69.4% +/- 17.0%. In contrast, with blueprint resets, the success rate increases to 93.1% +/- 7.5%. This is an exciting finding as it demonstrates a scenario where episodic training enables agents to be deployed in the practically-relevant reset-free scenario.
Our results indicate that curriculums do not have a clear-cut benefit, but may be leading to improvements in generalization to blueprints in the held-out test set. Please refer to our paper for additional details and figures.
Due to the use of direct manipulation, agents can rapidly switch which block they are holding, which can result in sometimes unrealistic maneuvers not achievable by physical robot grippers. Thus, in the event we wanted to transfer the success from our direct manipulation environment to more realistic settings using robotic arms, it is important to understand how one can mitigate unrealistic behaviors. To this end, after training an agent using the default training procedure, we modify the environment as follows: Whenever a gripper chooses to change the object it is holding, we disable that gripper for 2 steps. After this change, we continue to train our agent.
While initially the agent's success rate drops very significantly, within less than 100 million environment steps the agent recovers its strong performance. This is a small amount of steps compared to the 2.5+ billion environment steps to train the agent. Our results also demonstrate that training agents from scratch using gripper transition delays is a significantly more challenging problem, and even after 1.5 billion environment steps, the agent is still unable to make significant progress on many of the blueprints in the training set. These results demonstrate that an efficient approach towards training practical agents is to first train agents in the simplest settings, and continue to finetune those agents in more realistic scenarios. Videos demonstrating behaviors of agents with and without delays can be viewed below.
Visualizing trained agents, we observe a number of interesting learned behaviors. Examples of such behaviors include: (1) Despite environment observations not providing fine-grained detail about free-space, agents appear to have learned robust collision avoidance skills. (2) When building complex structures, agents appear to first build separate smaller substructures, and subsequently attach the substructures to construct the full blueprint.
In the videos below we visualize the attention patterns learned by a trained agent. In each video below, for a given blueprint structure, we illustrate the attention pattern learned in particular layer's attention head. Each video consists of 16 copies of the same rollout episode, where in each copy we color a specific block in GREEN and visualize which blocks it is attending to in RED. A stronger red color indicated stronger attention to that block.
The learned attention pattern in this layer appears to be the following: Once the green block is attached to the main structure, it attends to the block being held by the green virtual gripper.
The learned attention pattern in this layer appears to be the following: If the green block is part of the blueprint structure, it attends to blocks that it is connected to, or is about to be connected to. If it is not part of the blueprint, it attends to other blocks that are also not part of the blueprint.
The learned attention pattern in this layer appears to be the following: The green block tends to attend to all the blocks that are in the main structure being constructed.
The learned attention pattern in this layer appears to be the following: The green block tends to attend to the blocks it needs to be connected to.
We introduced a new blueprint assembly environment for studying bimanual assembly of multi-part physical structures, and demonstrated training of a single agent that can simultaneously solve all seen and unseen assembly tasks via a combination of large-scale RL, structured policies, and multi-task training. While our work showed that a solution to our problem exists, it is by no means efficient - requiring billions of training episodes. It is likely that by incorporating planning or hierarchical methods, the training time can be significantly shortened. Additionally, upon maturity of accelerated simulation engines such as Brax and Isaac Gym, our agents may be trained at a similar compute scale using much more modest hardware infrastructures.
Beyond more efficient training, in this work we chose to abstract away complexities of manipulation and perception. Concurrent to this work, we have been investigating how bimanual robotic policies can be transferred to real-world robotic systems. Our current efforts in this directions can be viewed in the link above, with a video of our results presented below.