Heni Ben Amor1,2, Laura Graesser2, Atil Iscen2, David D'Ambrosio2, Saminda Abeyruwan2, Alex Bewley2, Yifan Zhou1, Kamalesh Kalirathinam1, Swaroop Mishra2, Pannag Sanketi2,
1Interactive Robotics Lab, Arizona State University, 2Google Deepmind
We demonstrate the ability of large language models (LLMs) to perform iterative self-improvement of robot policies. An important insight of this paper is that LLMs have a built-in ability to perform (stochastic) numerical optimization and that this property can be leveraged for explainable robot policy search. Based on this insight, we introduce the SAS Prompt (Summarize, Analyze, Synthesize) – a single prompt that enables iterative learning and adaptation of robot behavior by combining the LLM’s ability to retrieve, reason and optimize over previous robot traces in order to synthesize new, unseen behavior. Our approach can be regarded as an early example of a new family of explainable policy search methods that are entirely implemented within an LLM. We evaluate our approach both in simulation and on a real-robot table tennis task.
SAS: Summarize, Analyze, Synthesize
Traditional methods that enable robot learning through self-improvement require a set of components such as (a) the identification of critical feature variables, (b) the design of a loss/reward function involving these features, and (c) an update rule in order to iteratively synthesize better parameters. We introduce the SAS Prompt – an approach for robot learning and self-improvement that implements all three of the above steps within a single LLM prompt. SAS Prompt enables robots to effectively understand and interpret previous robot behavior from in-context examples in order to perform policy search and synthesize new, unseen behavior. The result is a family of algorithms in which self-improvement and numerical optimization is performed through repeated calls to an LLM with increasing context window size. An example for the SAS prompt can be seen below:
User Objective: "Hit the ball to the right side of the table!"
Step 1: Create a table that summarizes each in-context example.
Step 2: From the table above give me the parameters that are closest to fulfilling the objective.
Step 3: Take these parameters and the summary table and analyze the effect of the control parameters. Let's think step by step!
Step 4: Based on this analysis, propose a new set of values for the control parameters which will bring us closer to the objective than any of the previous examples.
We validate our methodology in extensive experiments in a robot table tennis control task in both simulation and the real-world. In our experiments, the human user provides a task objective in natural language. The LLM, then, identifies robot control parameters that achieve this objective via retrieval and self-improvement.
🧑: "Hit the ball to the far left."
🧑: "Hit the ball as far right as possible."
🧑: "Hit the ball to the middle of the top edge of the table."
SAS: Visualizing the Self-Improvement Process
The following animations visualize the self-improvement process in simulation. At each iteration, the LLM is asked to analyze all the previous executions of the robot and, in turn, synthesize new control parameters that would bring us closer to the human objective. We provided three different self-improvement objectives, namely "Hit the ball to the far right!", "Hit the ball to the top edge!" "Hit the ball to the left corner!" and ran 20 experiments for each objective. In each experiment, 30 iterations of self-improvement are conducted. The landing position of the ball at each iteration is depicted below. We can observe that the landing position slowly shifts towards the (S1) right edge, (S2) top edge or the (S3) left corner.
To better understand how each iteration of the learning process leads to new, updated robot parameters, we print below an example LLM response to an SAS prompt query. In the first step, the LLM summarizes all previous robot executions and extracts the critical information which is needed to evaluate the robot behavior. In the second step, the example(s) which best fit the human objective are automatically identified by the LLM. In the third step, the effect of each robot control parameter on robot performance is identified. Finally, given the identified impact of the parameters, new, updated parameters are synthesized that will bring the robot closer to achieving the human objective.
SAS Response
SAS: Numerical Optimization inside the LLM
One of the key tenets of this paper is that LLMs are capable of numerical optimization and that this property can be exploited for self-improvement in robotics. We evaluate on widely used optimization functions (minimization), namely Rastrigin’s and Ackley’s function, in both 2D and 8D. To avoid any potential for guessing the answer (e.g., the LLM generating (0, 0) due to this being frequently used as location for optima in benchmarks) we added a constant shift to both functions. For every algorithm 50 experiments are performed, each with 100 update steps.
The following animations visualize the LLM-based optimization process. We notice how the LLM gradually moves the current best estimate closer and closer to the global minimum. Our results indicate that this LLM-based numerical optimization is competitive with standard optimization algorithms (Adam, Nelder-Mead, GD) in low-dimensional settings (up to 8 dimensions studied in the paper).
Numerical Optimization Quantitative Results
Related Works by this Team: