Multimodal Web Navigation with Instruction-Finetuned Foundation Models

Hiroki Furuta, Kuang-Huei Lee, Ofir Nachum, Yutaka Matsuo, Aleksandra Faust, Shixiang Shane Gu, Izzeddin Gur

International Conference on Learning Representations (ICLR2024)

OpenReview, arXiv

Overview

The progress of autonomous web navigation has been hindered by the dependence on billions of exploratory interactions via online reinforcement learning, and domain-specific model designs that make it difficult to leverage generalization from rich out-of-domain data. In this work, we study data-driven offline training for web agents with vision-language foundation models. We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages and outputs web navigation actions, such as click and type. WebGUM is trained by jointly finetuning an instruction-finetuned language model and a vision encoder with temporal and local perception on a large corpus of demonstrations. We empirically demonstrate this recipe improves the agent's ability of grounded multimodal perception, HTML comprehension, and multi-step reasoning, outperforming prior works by a significant margin. On the MiniWoB, we improve over the previous best offline methods by more than 45.8%, even outperforming online-finetuned SoTA, humans, and GPT-4-based agent. On the WebShop benchmark, our 3-billion-parameter model achieves superior performance to the existing SoTA, PaLM-540B. Furthermore, WebGUM exhibits strong positive transfer to the real-world planning tasks on the Mind2Web. We also collect 347K high-quality demonstrations using our trained models, 38 times larger than prior work, and make them available to promote future research in this direction.

Autonomous Web Navigation

Autonomous web navigation is a sequential decision making problem where the agent controls computers or crawls the Internet on the browser to satisfy given instructions. Our model takes in a command for a web-based task via a natural language instruction (e.g., in an email client, Find Gisele's email and forward it to Siana, please.) and uses multimodal observations of the computer interface to complete the task via a sequence of computer actions such as click and type.

Bottleneck in Current Web Navigation

Prior works have studied web navigation problems as online RL to learn the optimal action distribution with task-specific models from scratch. However, online RL requires massive trials-and-errors and is often infeasible in practice since the failure in web navigation would result in undesirable consequences (e.g. wrong password may lead to account freeze, and sending email to the wrong person could be problematic in a business scene). Offline training from the static dataset is a promising alternative for safe development of web agents, but the performance has been sub-optimal compared to those online RL counterparts.  Another issue with many of the prior works has been unable to generalize from rich out-of-domain data, as they usually use specialized models to explicitly handle the hierarchical structures of document object model (DOM) and their dependencies, for example, with LSTM, self-attention, or GNN. Furthermore, many of them only output a fixed set of categorical actions, which is unfavorable for truly open-ended web navigation in the real world.

Web navigation via Grounded Understanding Models (WebGUM)

To overcome the bottleneck above, we introduce a competitive offline learning recipe for autonomous web agents, by removing such web-specific architectures and converting web navigation into visual question-answering format (text, image → text). This allows us to leverage pre-trained foundation models as rich prior knowledge on the web, and then to learn the capable agents even with offline training.

WebGUM is a multimodal encoder-decoder transformer model. It takes screenshots, action history, instruction, and HTML as inputs. Visual tokens contain rich temporal information from recent two-step and local information from 16 × 16-size patches extracted from pre-trained vision transformer (ViT). Multimodal language-image tokens are fed into pre-trained T5 encoder-decoder transformer, and then predict executable actions in text formats.

As a base language model, we leverage Flan-T5, an instruction-finetuned LLM, finetuned with large-scale instruction-following format problems and few/zero-shot chain-of-thought examples across various domains, including reasoning or programming. Because web navigation is inherently an instruction-following task, carefully trained instruction-finetuned models are expected to generalize well to enhance the alignment with user instruction and zero-shot reasoning in the web-navigation, interactive decision making context.

Experiments on MiniWoB++

WebGUM, with a small 2.8K dataset and Base-size model (310M parameters), significantly outperforms previous offline methods for web navigation (WebN-T5, CC-Net (SL)). In addition, scaling dataset and model size, WebGUM achieves 94.2% success rate, exceeding the previous best offline model, WebN-T5, by over 45.8% and even surpassing the online RL-finetuned SoTA, CC-Net (SL+RL), despite our fully offline training and much fewer experiences. Moreover, WebGUM surpasses humans and recent LLM-based agents, such as RCI and AdaPlanner, even with GPT-4.

Experiments on WebShop

In WebShop benchmark, an online-shopping website simulator, WebGUM achieves 45.0% success, significantly outperforming not only simple baselines, such as supervised imitation learning (IL), IL plus RL-finetuning, and WebN-T5 (by more than 15%), but also recent prompt-based LLM agents, including ReAct (i.e. PaLM-540B with one-shot prompt and reasoning annotations), while our model only has 3 billion parameters.

Experiments on Mind2Web

In Mind2Web benchmark, a real-world demonstration dataset, WebGUM, transferred from MiniWoB (trained with 401K episodes), achieves superior performance to MindAct-Large/XL and even GPT-4 in all the categories (cross-task/website/domain). Because both MindAct and WebGUM are based on Flan-T5, these results support that WebGUM exhibits strong positive transfer to real-world action prediction.

Example Videos on MiniWoB++

book-flight

click-button

click-shape

enter-text-dynamic

multi-layouts

click-checkboxes-soft

click-widget

grid-coordinate

multi-orderings

search-engine

click-checkboxes-transfer

count-shape

identify-shape

navigate-tree

social-media-all

click-dialog

enter-password

email-inbox-forward-nl

login-user-popup

social-media-some