STEVE-1: A Generative Model for Text-to-Behavior in Minecraft

(Spotlight at NeurIPS 2023)

Shalev Lifshitz*, Keiran Paster*, Harris Chan†, Jimmy Ba, Sheila McIlraith

Abstract

Constructing AI models that respond to text instructions is challenging, especially for sequential decision-making tasks. This work introduces a methodology, inspired by unCLIP, for instruction-tuning generative models of behavior without relying on a large dataset of instruction-labeled trajectories. Using this methodology, we create an instruction-tuned Video Pretraining (VPT) model called STEVE-1, which can follow short-horizon open-ended text and visual instructions in Minecraft™. STEVE-1 is trained in two steps: adapting the pretrained VPT model to follow commands in MineCLIP's latent space, then training a prior to predict latent codes from text. This allows us to finetune VPT through self-supervised behavioral cloning and hindsight relabeling, reducing the need for costly human text annotations, and all for only $60 of compute. By leveraging pretrained models like VPT and MineCLIP and employing best practices from text-conditioned image generation, STEVE-1 sets a new bar for open-ended instruction-following in Minecraft with low-level controls (mouse and keyboard) and raw pixel inputs, far outperforming previous baselines and robustly completing 12 of 13 tasks in our early-game evaluation suite. We provide experimental evidence highlighting key factors for downstream performance, including pretraining, classifier-free guidance, and data scaling. All resources, including our model weights, training scripts, and evaluation tools are made available for further research.

Introducing STEVE-1, a generative model for Minecraft that can follow open-ended text and visual instructions.

Aproach

Our approach involves two models. First, we train the policy by finetuning VPT to achieve goals given by pretrained MineCLIP [ 17] visual embeddings using our gameplay dataset. Second, for the prior model, we train a Conditional Variational Autoencoder (CVAE) to sample MineCLIP visual embeddings given a text prompt. The combination of these two models enables our agent to follow text and visual instructions

Demo: Interactive Sessions with STEVE-1 Following Text Instructions

In these clips, we use text instructions in real-time to control STEVE-1 playing Minecraft.

Demo: Interactive Sessions with STEVE-1 Following Visual Instructions

In these clips, we use visual instructions in real-time to control STEVE-1 playing Minecraft.

Demo: STEVE-1 with Prompt Chaining

In these clips, we chain together different instructions to get STEVE-1 to accomplish tasks which require multiple parts. Half way through each clip, we switch to the second prompt.

Gather dirt -> Build a tower

Gather wood -> Make wooden planks

Citation

@article{lifshitz2023steve1,

title={STEVE-1: A Generative Model for Text-to-Behavior in Minecraft},

author={Shalev Lifshitz and Keiran Paster and Harris Chan and Jimmy Ba and Sheila McIlraith},

year={2023},

eprint={2306.00937},

archivePrefix={arXiv},

primaryClass={cs.LG}

}

Acknowledgements

All of the authors gratefully acknowledge funding for this research from the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Canada CIFAR AI Chairs Program (Vector Institute for Artificial Intelligence). SL is supported by a Vector Institute internship and by an NSERC Discovery Grant. KP is supported by an NSERC PGS-D award. HC is supported by an NSERC CGS-D award. JB acknowledges funding from the Canada CIFAR AI Chairs program, Fujitsu Japan, and an Amazon Research Award. In addition to NSERC and CIFAR (Vector Institute), SM acknowledges funding from Microsoft Research. We thank Silviu Pitis, Romi Lifshitz, Forest Yang, and Yongchao Zhou for their helpful comments; Alisa Wu and Ziming Chen for their contributions to the instruction dataset; and Finn Paster for the logo and graphic for the website. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute for Artificial Intelligence (www.vectorinstitute.ai/partners).

Page updated

Google Sites

Report abuse