Visual Chain of Thought

Background

Let's say we want to bake a cake and we lost a part of the instructions. So, now we need to find out what happens in between preheating the oven and putting the baking pan filled with batter in the oven, so we can successfully bake our cake. First, let's turn to GPT-3 for advice. Following the Chain of Thought method with GPT-3 we get the following result:

Preheat the oven Prepare the batter Put baking pan into the oven

While this is good information, it doesn't really represent the way we think. Humans are visual creatures and so when we're asked this question, not only do we deduce the fact that we need to prepare the batter, we also visualize the process of preparing the batter. However, the text produced is not visually descriptive and so does not show the visual process. So, let's try to fix this by using images to create a Chain of Images (CoI) to determine the intermediate step

This is also pretty good! However, the intermediate image generated is not very consistent with the images on either side. In addition, in problems more complex, just the image won't give us enough information to understand what the intermediate step/ logical gap is.

It looks like the output from the CoI (Chain of Images) and CoT (Chain of Thought) have their downsides. The CoT output isn't very visually descriptive and CoI isn't very consistent or grounded with the instructions. In addition, it might be more clear to humans if they are able to see both the text and the visual instructions when baking a cake.

To make the ideal intermediate step, we propose using visuals to augment the Chain of Thought method and create multimodal infillings (text-image pairs that fill in logical gaps). To ensure consistency and novelty in the generated text and images, we use multimodal information we already have to guide the generation (more information about this is in the VCoT process section) and we use image captions to guide the generation of visually descriptive text. We call this method VCoT, or Visual Chain of Thought. Let's see what VCoT gives us!

Preheat the oven

Gather ingredients and stir them together in a mixing bowl.

Put baking pan into the oven

We can see that the image generated is much more consistent with the images surrounding it. In addition, each Infilling has an image and visually descriptive text which provides us enough information to understand the intermediate step. We can apply this method to fill in gaps with Multimodal Infillings for any type of sequential information and datasets to augment the information we have, which can aid in improving many real-world tasks like summarization and storytelling. Now that it's clear that VCoT can give us good results when filling in logical gaps, let's now take a closer look at the VCoT process!

The VCoT Process

The VCoT Process can be broken down into 3 steps: task unification, multipoint foveation, and generating multimodal infillings.

Task unification: This step is concerned with reformating the datasets by generating text-image pairs. For example, let's say we have these two steps from a Wikihow article talking about how to harvest honey:

Step 1: It is generally recommended to only harvest honey from a 3-part OATH hive if the hive

has been established for 12-18 months and the area does not experience temperatures below 18 degrees Celsius.

Step 2: A hive tool can be used to separate the top box from the bottom two boxes of a beehive.

To use the VCoT method, we need multimodal information (text-image pairs), but the WikiHow steps above only give us text. So we generate images for each of the steps.

Step 1: It is generally recommended to onlyharvest honey from a 3-part OATH hive ifthe hive has been established for 12-18months and the area does not experiencetemperatures below 18 degrees Celsius.

Step 2: A hive tool can be used to separatethe top box from the bottom two boxes of a beehive.

Multipoint Foveation: For this step, we determine the important details from each image-text pair from the task unification step. To get image-information, we use image captions from an image-captioning model. An example of generating multipoint foveation is shown below:

For the two text-image pairs we generate the following foveation:

Harvesting honey from a 3-part OATH hive using a hive tool.

Generate Multimodal Infillings: Now, we pass in our foveation from Step 2 and two text-image pair generated in Step 1 to generate an intermediate "infilling" (text-image pair that fills in the logical gap between the 2 steps passed in)

For example, once we pass in the foveation from step 2 and the text-image pairs from Step 1, we generate step 1.5:

Step 1: It is generally recommended to onlyharvest honey from a 3-part OATH hive ifthe hive has been established for 12-18months and the area does not experiencetemperatures below 18 degrees Celsius.

Step 1.5: Carefully inspect the frames of thehive to assess the amount of honey thatcan be harvested.

Step 2: A hive tool can be used to separatethe top box from the bottom two boxes of a beehive.

We can continue to generate infillings until we determine that there are no more logical gaps. For example, in the above, we can continue the process recursively to generate Steps 1.25 and 1.75. In our method we fix a recursive depth of 2.

A high level picture of our method of generating infillings. These infillings can be used to improve on a variety of downstream tasks, like storytelling or summarization for example

Results

In this figure, it's clear that the in-between image (Step t) generated by the VCoT process is much more consistent with Steps t-1 and t+1 than the image generated by the CoI (Chain of Images) baseline. The CoI Baseline shows an image of a woman painting which is not consistent with the workspace, the desks, and the computers in step t-1 and t+1. On the other hand, the VCoT image is much more consistent with the computers and the desks in Steps t-1 and t+1. The text for VCoT is also more consistent with Steps t-1 and t+1 and flows better than the text generated through CoT (Chain of thought)

For every example, we can see that the top image and text (generated through VCoT) is more consistent and informative with the images on the left and the right. For example, in Example 1, the image generated with VCoT is more consistent with the home interior shown in the image on the right than the image generated with the CoI (Chain of Images) baseline whose home interior is not consistent with the left and right images.

Examples of Multimodal Infillings

Multimodal Infillings