In this page, I briefly explain my recent advances in experimenting with Glow [1], a flow-based generative models proposed recently for image generation. For better understanding the terms used in this page, one needs to be familiar with the architecture and mathematical foundation of Glow.
Note: This page is rather informal and high-level and does not include the mathematical details. I am currently working on the project as my thesis, and will put the link to my thesis once it is complete.
When it comes to generative models for images, the popular names that pop up in everyone's mind are variational auto-encoders and generative adversarial networks. While these two are probably the most popular generative models, a third category of generative models, namly flow-based models, have gained much less attention, albeit having huge benefits of exact likelihood evaluation and exact inference (which VAEs lack) and useful latent space encoding (which GANs lack). Most importantly, they are very efficient and parallelizable for image synthesis.
Flow-based models are built based on a sequence of invertible transformations - namely flow - which map every point in data space (which has a very complex density function) to latent space (which has a simple density function like a unit Gaussian). Hence, they are able to learn useful representation of data points and can be used for density estimation.
The following image (from this blog post) might be helpful to understand how they work. When defining transformations, the usual direction to consider is to go from x
to z
(forward direction, as opposed to the reverse direction for going from z
to x
). However, the author of the post decided to visualize it vice versa (which makes no big difference - it is just notation).
Like GANs and VAEs, flow-based models can also be made conditional by making their transformations conditioned on some auxiliary information when training (learning the mapping between x
and z
), yet satisfying the invertibility condition by providing the same auxiliary information at synthesis time (z
to x
mapping by sampling z
and applying the reverse of transformations).
Experiments with the conditional variant of Glow [1] can be done in various datasets. For proof of concept, I first ran an experiment on the MNIST dataset, and then extended the experiments to the Cityscapes dataset (images of street scenes) where image generation is much more challenging.
In order to generate conditional digits in MNIST, the easiest way to make it conditional is to provide the label of the images when training the network. To make it even simpler, I provided the one-hot representation of the label in only the Affine Coupling layer (the most complex of among the proposed ones) of each Flow step. Hence, all the Affine Coupling transformations can observe the label of the image on which they are being trained.
1. Interpolation
An interesting property of flow-based models, as explored in [1], is that the representation inferred by these models can be very useful in downstream tasks such as interpolating between two images. To do so, one infers the representations - namely z1
and z2
- of two desired images (x1
and x2
) by passing the images through the network in the forward direction (direction from x
to z
), and linearly generates new representations zi
's in between. The representations are then passed through the network in the reverse direction (direction from z
to x
) to generate new images in between. Note that this is done at inference (and not training) time.
Some results are as follows: the very first and last images in each row are from the dataset, and the middle images are generated by the linear interpolation discussed above.
2. New condition
With similar methodology, one can choose the style of images that are to be generated. In order for doing so, the representation of the desired image is inferred by the forward pass of the network: z_style
. For synthesis, when we perform the reverse operations, rather than sampling from the base Gaussian distribution, we use z_style
and use our desired condition (digit label 0-9) to generate new digits of that style. Essentially, we are replacing the randomness of sampling from the base distribution with the desired representation that we already inferred from the desired image. The representation here essentially encodes the style of a digit (shape, thickness, etc.).
Some results are as follows: the first left image in each row is the one with the desired style, and the next are the generated images with different conditions (digit labels).
There are, of course, some failure scenarios where the network is unable to usefully infer the style of the desired image. One reason could be that these are hard samples the styles of which are not very much distinguishable and understandable compared to those of the successful case. For instance, it is actually hard to draw a 4 which is similar in shape to the 1 in the second row, as this 1 has no actual style except for a straight line. Hence the network drew a 4 as similar as possible to this 1 by squeezing the generated 4.
3. Effect of resampling
It is known that parts of the z
extracted (Gaussianized) after different Blocks of the multi-scale architecture (please refer to the Glow paper) have different interpretations: parts of the z
closer to data space (extracted earlier) have more local, pixel-level, and fine-grained effects on the generated image while the parts of the z
closer to latent space (extracted later) have more global, abstract, and coarse-grained effects on the generated image.
In order to see this effect, one infers the representation of a given image, re-samples one part of the z
(extracted after a Block) and keeps the rest unchanged. In the architecture that I used for training on MNIST, I used 3 Blocks which produce z1
(closer to data space), z2
(middle), and z3
(closer to latent space) respectively.
Some results are as follows: in this example, I interpolated between two images (let's call their representations z
and z'
) by only interpolating a specific portion of z
(z1
, z2
, or z3
). I kept the other portions of z
unchanged (the first row is obtained by resampling all the z
portions). Note that z
and its portions (z1
, z2
, and z3
) all correspond to the first left image and z'
corresponds to the last right image in each row.
In the case of MNIST the effect of z1
and z2
is so low-level that is even visually indistinguishable. However, changing z3
has global effects of changing the shape, thickness, etc. It might seem counter-intuitive, but it seems most of the information about the style of the generated image is encoded in z3
which is responsible for abstract changes in the image.
Generating images of Cityscapes (a dataset of street scenes with their semantic segmentation) is challenging since there are a lot of objects in the image with a lot of variation. For generating real images conditioned on the segmentations (an image-to-image translation problem in computer vision), I use two Glows, one working in the segmentation domain and the other working in the data domain, connected to each other as depicted below. The Glow working in the data domain is conditioned on the output of the Glow working in the segmentation domain in all its layers. Here is the overall structure where xA
denotes a segmentation image and xB
denotes a real image (from [2]):
Here are some results of generating real images conditioned on segmentations with the architecture depicted above: the first block is the segmentations (conditions), the second block is the ground-truth real images, and the third block corresponds to the synthesized images.
Effect of temperature
It is known that, unlike GANS, likelihood based models such as VAEs and flow-based models prefer diversity of the generated images over sample quality: they can generate more varied images at the cost of losing image quality, while GANs are able to generate sharper image but with much lower diversity (also known as the mode collapse problem). In order to observe this effect when working with Glow, one can reduce the std
(referred to as temperature in this jargon) of the base Gaussian distribution (latent space). This squeezing of the base distribution results in images in data space that are more realistic but less varied.
Some examples could be seen below: the first and second rows in each image block correspond to the segmentations and the corresponding real images, and then 5 different samples with different temperatures are shown in the following rows.
One can see that with a higher temperature (Gaussian with a larger std
) samples are more varied and of less visual quality, while with low-temperature samples look more visually appealing but less diverse. When working with models like Glow, one typically needs to reduce the temperature a bit to strike a balance between sample quality and sample diversity. Note that temperature=0
(std=0
) is essentially a single point in the latent space (the mean
of the Gaussian), and hence produces the same result all the time.
[1] Kingma, Durk P., and Prafulla Dhariwal. "Glow: Generative flow with invertible 1x1 convolutions." Advances in Neural Information Processing Systems. 2018.
[2] Pumarola, Albert, et al. "C-Flow: Conditional Generative Flow Models for Images and 3D Point Clouds." arXiv preprint arXiv:1912.07009 (2019).