Tabletop Manipulation

[Edit: I wrote the blurb below in ~2010, well before target networks and DQN were invented. Naturally, being at DeepMind I train non-linear value nets all the time, so it made me laugh when I read this, and I decided to keep it around for posterity]

Splinter 1.0 (circa 2009)

Splinter 1 was my internal name for the first project I worked on in Mike Stilman's lab. Mike's lab had a ninja-turtles theme, and after a too much work, I came up with SPLinTER as short for "Simultaneous Planning and Learning in a Tabletop Exploring Robot". It was a manipulation project on our Schunk arm, and had the rather lofty goal of solving generic manipulation tasks using the magic of reinforcement learning. In particular, I had recently read about TD-Gammon, and was excited about the prospect of finding meaningful abstractions in the nodes of a neural network. As a publishable project it was largely a dud, but since it was a lot of fun, and since it led to Splinter 2.0, I have no regrets.

Here's a blurb I found on my computer which pretty well summarizes what I was thinking at the time:

I chose to focus my research on the domain of object manipulation as a way of striking a balance between simplicity, applicability, and (theoretical) depth. As a research problem, this topic offers a set of challenges which are at the same time clear enough to imagine writing computer programs to solve, relevant enough to generate excitement (and funding!), and deep enough to use as models for exploring fundamental issues in representation and reasoning. For example, I'd like to see my robot to discover for itself the concepts required to open a door, and later to transfer this understanding to related tasks such as opening cabinets, chests, or books. The perceptual and kinematic concepts involved in these tasks are well understood, at least compared to higher cognitive phenomena like social behavior, yet the behaviors remain an elusive goal for roboticists. My goal, in a nutshell, is to create learning algorithms that are capable of generating structured knowledge representations over sensorimotor primitives through unsupervised exploration. I am currently exploring this possibility with a combination of techniques in reinforcement learning and function approximation, including TD learning and a variant of the venerable artificial neural network.

These issues were important enough to me to switch to a machine learning lab, but I've also come a long way since. My oversight in Splinter 1 was that the gap between low-level control and a high-level task space was too wide to be solved by any known method. I'd initially hoped that the RL model, combined with a neural-network function approximator, would magically bridge this gap, and leave me with a policy for pushing blocks that could be tailored to any reward function I came up with.

In fact what I found was that the approximator tended to diverge or fluctuate in all but the simplest of tasks (see below for one working example). I have since come to understand this phenomenon thanks to the work of one of my lab mates, Peng Zang, who's dissertation was about scaling RL with function approximation. In short, the problem is that function approximators can behave erratically when used in iterative settings as we often have in RL. Even if the target value function is in the hypothesis space of the approximator, the intermediate value functions needed to get there from a random starting point may not be. For further discussion, check out his thesis.

Extracting Image Features

Value Function Instability

Splinter 2.0 (circa 2010)

This project was a follow-up to Splinter 1, but with a pivot from RL to motion planning and optimization methodology. From its predecessor, I inherited the notion of posing a manipulation task in some functional form (a "clutter cost"), but left behind the model-free optimization. Rather than using a neural-network and hoping for the best, I took a new approach with more theoretical guarantees. The intuition was simple: if a full policy over low-level actions is too difficult, then try a one-shot plan on higher-level actions. By providing the robot with a very small "basis-set" of actions for pushing tabletop objects, I was able to use standard planning methods to compose them into sequences for solving arbitrary tasks.

The primary contribution of this work was in recognizing the relationship between the planning algorithm and the optimization problem. Since I wanted the robot to discover plans which lead to optimal tabletop configurations, I needed an algorithm which would explore the cost space quickly and coarsely at first, and then more finely given time. That is, I wanted to explore deeply without strong bias at first, and become progressively greedier over time. Clearly A* is not an appropriate choice, since it reduces to breadth-first-search under a flat heuristic, and BFS would concentrate its effort near the starting state. The solution I chose was to use Rapidly-Exploring Random Trees (RRT), modified to search over action primitives. Putting this together with my cost-based task model led to a planning that was both general-purpose (a task is just a cost function), as well as robust: compared to other methods which decouple the planning and optimization problems, my approach guarantees that all discovered states are reachable, since I'm exploring the cost landscape using the actual actions the robot can use for the task.

I published this project at Humanoids 2010 in Nashville, and was very fortunate to receive a best paper award.