Grounding Large Language Models with Online Reinforcement Learning

Abstract

Recent works successfully leveraged Large Language Models’ (LLM) abilities to capture abstract knowledge about world’s physics to solve decision-making problems. Yet, the alignment between LLMs’ knowledge and the environment can be wrong and limit functional competence due to lack of grounding. In this paper, we study an approach (named GLAM) to achieve this alignment through functional grounding: we consider an agent using an LLM as a policy that is progressively updated as the agent interacts with the environment, leveraging online Reinforcement Learning to improve its performance to solve goals. Using an interactive textual environment designed to study higher-level forms of functional grounding, and a set of spatial and navigation tasks, we study several scientific questions: 1) Can LLMs boost sample efficiency for online learning of various RL tasks? 2) How can it boost different forms of generalization? 3) What is the impact of online learning? We study these questions by functionally grounding several variants (size, architecture) of FLAN-T5.

Recently, LLMs were shown to capture aspects of the physical rules in our world, e.g. about space Patel and Pavlick [2022], colors Abdou et al. [2021] or even affordances between bodies and objects Ahn et al. [2022]. This form of prior knowledge was exploited to suggest plans of action to solve goals in robotics Huang et al. [2022b], Ahn et al. [2022], Liang et al. [2022]. However, LLMs are known to suffer from a lack of grounding which prevents them from properly dealing with the meaning of inter-related concepts and their use for functional competence in interactive environments Mahowald et al. [2023]. Indeed, alignment between statistical structures in such LLMs and environments can be very limited, or even sometimes entirely wrong. This is partly due to 1) a training process (predicting next words) that is not directly incentivized to solve problems in an environment, 2) lack of abilities to intervene in the environment to identify causal structures, 3) lack of abilities to learn based on data collected as a result of interacting with the environment [Bender and Koller, 2020, Bisk et al., 2020]. In such cases where the alignment between the LLM and its environment is wrong, methods like SayCan [Ahn et al., 2022] would suffer from the LLM not learning the outcome of its decisions. 


In the literature, language grounding has referred to various related objectives Thill et al. [2014]. First, symbol grounding can be formulated as the general problem of connecting a symbol system [Harnad, 1990], internal to an agent, to the environment, in such a way that internal processing of these symbols can be used to to act appropriately in this environment. 

One dimension of this problem is associating "elementary" symbols, such as the names of objects, with invariant structures in high-dimensional perceptual modalities such as vision Cangelosi et al. [2010], Wiriyathammabhum et al. [2016]. Such a grounding, called "direct grounding", has been extensively studied in the past leading to various efficient methods [Alayrac et al., 2022, Radford et al., 2021, Lu et al., 2023], included in the context of robotic bodies Cangelosi and Stramandinoli [2018]. Another dimension is how to ground higher-order symbolic tokens, or abstract concepts, into elementary symbols, often through approaches such as distributional semantics Harris [1981], Boleda [2019]. This has been called "grounding transfer" Cangelosi and Stramandinoli [2018]

Beyond such mere associations, a key question about grounding is how internal processes that manipulate symbols can model, predict and control external physical and social processes: they need to be aligned on and constrained by these external dynamics and relational structures (at various levels of abstraction). This last notion of grounding, which we refer here as "functional grounding", is relative to a particular environment which may be the human physical environment but also more abstract interactive environments simulated in computers (where abstract physics can differ from human environments).


We propose the GLAM method: we use an LLM as agent policy in an interactive textual RL environment (BabyAI-Text) where the LLM is trained to achieve language goals using online RL (PPO), enabling functional grounding.

(a) BabyAI-Text provides a goal description for the current episode as well as a description of the agent observation and a scalar reward for the current step. (b) At each step, we gather the goal description and the observation in a prompt sent to our LLM. (c) For each possible action, we use the encoder to generate a representation of the prompt and compute the conditional probability of tokens composing the action given the prompt. Once the probability of each action is estimated, we compute a softmax function over these probabilities and sample an action according to this distribution. That is, the LLM is our agent policy. (d) We use the reward returned by the environment to finetune the LLM using PPO. For this, we estimate the value of the current observation by adding a value head on top of our LLM. Finally, we backpropagate the gradient through the LLM (and its value head).

We show how GLAM, which requires almost no environment-specific modifications on the LLM, enables to drastically improve performances to solve RL tasks in this environment as compared to zero-shot use the LLM, to supervised finetuning and to RL finetuning of non-pretrained LLMs. We also show how it boosts both sample efficiency and generalization abilities in zero-shot tests (both to new objects and several new tasks). 

In addition to these key results, we provide in-depth ablations showing the effect of several parameters (e.g. size) on grounding. We believe this method can act as a milestone towards grounding and using LLMs in interaction with our world.

Recommended citation:

Carta, Thomas, Clément Romac, Thomas Wolf, Sylvain Lamprier, Olivier Sigaud, et Pierre-Yves Oudeyer. Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning. In Proceedings of the 40th International Conference on Machine Learning, 3676‑3713. PMLR, 2023. https://proceedings.mlr.press/v202/carta23a.html.


BibTeX:

@InProceedings{pmlr-v202-carta23a,

  title = {Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning},

  author =       {Carta, Thomas and Romac, Cl\'{e}ment and Wolf, Thomas and Lamprier, Sylvain and Sigaud, Olivier and Oudeyer, Pierre-Yves},

  booktitle = {Proceedings of the 40th International Conference on Machine Learning},

  pages = {3676--3713},

  year = {2023},

  editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},

  volume = {202},

  series = {Proceedings of Machine Learning Research},

  month = {23--29 Jul},

  publisher =    {PMLR},

  pdf = {https://proceedings.mlr.press/v202/carta23a/carta23a.pdf},

  url = {https://proceedings.mlr.press/v202/carta23a.html},

}