We propose Creative Agents, the first framework that can handle creative tasks in an open-ended world. Using this framework, we implement various embodied agents through different combinations of imaginators and controllers. Creative Agents is an initial attempt in this field, aimed at raising the awareness of building intelligent agents with creativity.
Figure 1. Overview of creative agents for open-ended creative tasks. A creative agent consists of two components: an imaginator and a controller. Given a free-form language instruction describing the creative task, the imaginator first generates the imagination in the form of text/image by LLM with Chain-of-Thought (CoT)/diffusion model, then the controller fulfills the imagination by executing actions in the environment, leveraging the code generation capability of vision-language model (VLM) or a behavior-cloning (BC) policy learned from data. We implement three combinations of the imaginator and controller: ① CoT+GPT-4, ② Diffusion+GPT-4V, and ③ Diffusion+BC.
Basic Requirements
MineDojo works with Ubuntu>=18.04, Python>=3.9 and Java jdk 1.8.0_171, while Voyager works with both Ubuntu>=20.04 and Windows, with Java 17 installed. On Windows 10, installing a virtual environment with Python>=3.10 is recommended.
Install Packages and Environments
Install MineDojo
You can download the MineDojo environment for Creative Agents Here: Modified MineDojo Environment.
Install Voyager
Please follow Voyager Install Tutorial. Note that you should accomplish all the four parts:
Install Dependencies for BC Controller
Download code of BC controller at https://doi.org/10.5281/zenodo.10346783, then extract the zip file in your directory.
Install Dependencies and Download Models, Datasets(for training) for BC Controller
To install packages for BC Controller, run:
pip install -r ./BC_Controller/requirements.txt
After downloading our models and datasets, for split files, run:
cat NAME.zip.* > NAME.zip
Then unzip them, and move models to ./BC_Controller/models, move datasets to ./BC_Controller/datasets
Get Started
Our models, code, and dataset, have been released. You can find them at:
https://zenodo.org/records/10251970 (code of GPT-based controller)
https://doi.org/10.5281/zenodo.10346783 (code of BC controller)
https://doi.org/10.5281/zenodo.10275049 (dataset)
Before running with CoT+GPT-4 or Diffusion+GPT-4V, please make sure you have an OpenAI API Key that can access to GPT-4 and GPT-4V(ision).
In our experiments, we use Minecraft Official Launcher instead of Microsoft Azure, following this link.
To test with CoT+GPT-4, run:
python cot_gpt4.py --api_key <API_KEY> --task <TASK_DESCRIPTION> --mc_port <LAN_PORT>
To test with Diffusion+GPT-4V, run:
python diffusion_gpt4.py --api_key <API_KEY> --image_path <IMAGE_PATH> --mc_port <LAN_PORT>
To test with Diffusion+BC, run:
python ./BC_Controller/BC_pipline.py
Diffusion+BC results (images, voxels, voxel sequence and building result) will be saved in ./BC_Controller/results.
To type your prompts for building, modify ./BC_Controller/results/validation_prompt.txt.
To train image generation part in Diffusion+BC, run:
python ./BC_Controller/txt2img/train_text_to_image_lora.py
To train voxel generation part in Diffusion+BC, run:
python ./BC_Controller/img2vox/runner.py
To train sequence part in Diffusion+BC, run:
python ./BC_Controller/vox2seq/train.py
Showcases and Demonstrations
Figure 2. Examples of the language description, the generated visual imagination, and the created building of each variant of creative agents. Visual imagination generated by the diffusion model has great diversity, which is an important manifestation of creativity.