Visual Paragraph Generation

Topic Generation Net

Sentence Generation Net

In this project, we investigate a 2-layer hierarchical Recurrent Neural Network (RNN) architecture for generating descriptive paragraphs from images. The key elements that characterize the generated paragraphs are:

A principle underlying theme across all sentences of the paragraph.
Thematic coherence between successive sentences.
Diversity of the generated paragraphs.

In broad strokes, the first layer of the network called the Topic Generation Net is responsible for generating a set of topic vectors, one for each sentence and a Global Topic Vector capturing a central theme of the paragraph, given an encoding of the input image. In particular, we compose a visual feature vector that describes all the salient regions in the image. The second layer called the Sentence Generation Net takes these topic vectors along with the global topic vector as input and generates the words of the paragraph.

Central to our paragraph generation scheme is the element of cross-sentence topic coherence. This is ensured by constructing Coherence Vectors, which are a kind of encoding of the previous sentence. Generation of a sentence commences with the topic vector for that sentence being combined with the global topic vector and the coherence vector from the previous sentence via the Coupling Unit. The output is then fed to the RNN, which proceeds with the synthesis in a standard fashion.

Moreover, we also introduce the element of diversity in the generation process by training this model in a Variational Auto-Encoder (VAE) framework. This allows us to generate multiple paragraphs for the same input image.

Code:

Link

Publication:

M. Chatterjee, A. Schwing, “Diverse and Coherent Paragraph Generation from Images”, European Conference on Computer Vision 2018 (ECCV 2018).