Illiterate DALL-E Learns to Compose

Gautam Singh¹ Fei Deng¹ Sungjin Ahn²

¹Rutgers University ²KAIST

ICLR 2022

SLATE: SLot Attention TransformEr

Overview of our proposed model with respect to prior works.

DALL-E learns to map text prompts to images. It considers the words in the prompt as composable units for generating the desired novel image. These generations have shown impressive zero-shot generalization to novel text prompts with the ability to combine abstract and somewhat unrelated concepts such as "Avocado" and "Chair" to generate plausible images. Crucially, DALL-E, having with the Image-GPT decoder, generates images that have global consistency because each pixel predicted by the decoder depends non-linearly on all previous pixels and the input word embeddings.

Slot Attention: However, DALL-E requires text supervision for training and this plays as a major limitation of this model. On the contrary, Slot Attention provides an auto-encoding framework in which object slots act as the composable units that are inferred purely from raw images. However, Slot Attention uses a pixel mixture decoder to decode slots into images. In this decoder, the object slots are composed via a simple weighted sum of pixels obtained without any dependency on other pixels and slots which harms the global consistency and quality of the composed images.

Can we combine the best of both models?

Proposed Model: Our model, SLATE, combines the best of both models. Like Slot Attention our model is free of text-based supervision and like DALL-E, decodes novel image compositions with better global consistency and better quality using the proposed model based on the Image-GPT decoder.

UNSUPERVISED OBJECT DISCOVERY

SLATE implicitly discovers object slots from images - including in images having complex textures as illustrated below.

Illustration 1a. Object discovery in SLATE for texturally-simple datasets.

Illustration 1b. Comparison of object discovery in Textured-MNIST and CelebA. We note that our model produces more meaningful segmentations.

Illustration 1c. Comparison of object discovery in CLEVRTex. We note that our model produces more accurate segmentations of the objects.

BUILDING THE CONCEPT LIBRARY

While the proposed model provides us with object-centric slot representations of a given image, however, when we humans gather experience and observe various objects, we build an internal library of those observed objects and concepts. This enables us to imagine novel scenes by imagining novel compositions of the objects that we have seen in the past. Similarly, we would like to endow intelligent agents with such an ability. For this, we shall consider a set of images as the experience of the agent and build a library of concepts from it using our object-centric encoder. Then we shall use the concepts from the library to imagine and compose novel scenes using our proposed decoder.

To build such a concept library, we take each image in the given set and extract the slots. Next, we apply K-Means to the aggregated set of slots to cluster them into various concepts. This process is visualized below.

Illustration 2. Building concept library using slots obtained from a set of given images.

BENEFITS OF OUR DECODER

Benefit I. Our decoder composes scenes with better global consistency than the mixture decoder

While Slot Attention with pixel mixture decoder provides us with object-centric slots, these slots are optimized for the reconstruction of the given image alone and do not care about whether these slots can be recomposed in other contexts. However, we would like to use the slots as re-usable concepts, that is, we would like to combine arbitrary slots obtained from arbitrary images and compose a novel scene. We test this ability for both Slot Attention having a pixel mixture decoder and for the proposed model.

Our Model

Slot Attention

Illustration 3. Top: Compositions from randomly drawn slot prompts in the Bitmoji dataset. Bottom: Compositions of randomly drawn slot prompts using Slot Attention which uses the mixture decoder. We note that the compositions generated by the pixel mixture decoder are incoherent due to the lack of interaction between pixels and the slots being composed. To see the true images of this dataset, click here. To see more generation samples from our model, click here.

Illustration 4. Out of Distribution Hair Slot Replacements in Bitmoji in Our Model. Top: Replacement of hair slot on a female face with male hair. Bottom: Replacement of hair slot on a male face with female hair. To see the true images of this dataset, click here.

Illustration 5. Dress Slot Replacements in Bitmoji in Our Model. Top: Replacement of the dress slot on a given image.

Illustration 6. Face Slot Replacements in Bitmoji in Our Model. Top: Replacement of the face slot on a given image.

Slot Attention

Illustration 7. Out of Distribution Hair Slot Replacements in Bitmoji using Slot Attention with Mixture Decoder. Replacement of hair slot on a female face with male hair. We note again that the replacement slots affect only the color of the hair whose shape remains the same as that of the source image. Overall, it produces an incoherent and undesired generation. To see the true images of this dataset, click here.

Our Model

Slot Attention

Illustration 8. Comparison of Compositions between Our Model and Slot Attention in 3D Shapes. Left: We show the generations from our model. Right: We show the generations from Slot Attention which uses the pixel mixture decoder. We observe that with pixel mixture decoder, the object shadows are inconsistent with the actual object due to the lack of interaction between pixels and the slots being composed. Our model is significantly more robust to this issue. To see the true images of this dataset, click here. To see more generation samples from our model, click here.

Benefit II. Our decoder renders images with better visual quality than the mixture decoder

When using pixel mixture decoder for object-centric learning, previous works faced a dilemma between the decoding capacity of the object-wise decoder and the ability of the model to decompose the scene into objects. Our model lifts this dilemma. Our model can not only discover objects well but also render higher quality details of the images. We demonstrate that our proposed decoder provides this benefit.

Our Model

Slot Attention

Illustration 9: Comparison of Compositional Generation between our model and Slot Attention having a pixel mixture decoder. We note that our powerful decoder is able to produce the composed image with significantly finer details and textures.

Out of Distribution (OOD) Generation: Further note that the models were never trained to render two towers side by side and were only trained on images with a single tower. However, we find that our decoder is also able to generalize and render two-tower scenes. To see the true images of this dataset, click here. To see more samples generated by our model, click here.

Our Model

Slot Attention

Illustration 10. Comparison of Compositional Scene Generation for CLEVR-Mirror between our model and Slot Attention having mixture decoder. Randomly drawn objects are added one by one. We note that in our model, the rendered reflections of the objects in the mirror are significantly more clear and distinct than those drawn by the pixel mixture decoder.

Out of Distribution (OOD) Generation: Note that the models were trained on images containing 3-6 objects only. However, we note that our decoder can also generalize in rendering 0-2 and 7-8 objects -- which is a setting never shown during training. To see the true images of this dataset, click here. To see more generation samples from our model, click here.