Introduction:
Variational Autoencoders (VAE) for text generation was introduced by Bowman et. al (2016). The interpolation through the continuous latent space of resulted in smooth transition between the meaningful (grammatically and practically) sentences. However, the work described the problem of posterior collapse.
Fig1. RNN based VAE for text generation. Image taken from Bowman et.al (2016)
This situation arises when due to nature of encoders and decoders used. Generally, RNN based decoders are competent enough to predict the next token in an autoregressive manner. Referencing the above figure's architecture, during training the input tokens are fed into the decoder along with the latent space. The decoder imitates the input tokens in order to minimize the reconstruction loss and eventually the KL Divergence term vanishes. This leads to the decoder ignoring the information from the latent space and losing the generative ability.
Setup:
In order to solve this problem, our approach was to increase dependency of the decoder on the latent space than the input tokens. To take this up, we introduced the concept of timestep wise latent spaces inspired from Chung et.al (2015) https://arxiv.org/abs/1506.02216. The structure of timestep wise latent spaces looks similar to Figure 2. The omission of input spaces from the decoder is displayed in Figure 3.
Fig 2. RNN decoder dependent on input tokens and timestep wise latent spaces during training
Fig 3. RNN decoder dependent only on timestep wise latent spaces during training
Latent Diffusion Process and approach:
Latent Diffusion Models (https://arxiv.org/abs/2309.06642) were introduced to run the diffusion process on the latent spaces. The denoising stage of the diffusion model is compute expensive. Hence, running the process on latent spaces reduces the computational requirements generally, because of having lower dimensional representations than the original data.
We hypothesize to run LDM on the timestep wise latent spaces and follow the Figure 2 architecture to bring in a combination of non-autoregressive (nAR) and autoregressive (AR) nature into text generation. The latent spaces will be reconstructed in a nAR manner followed by the actual output tokens in an AR manner from the decoder. LDM will also save training time and computation complexity as compared to running diffusion process on the actual data.