Synthetic video to language

Automatic linguistic description of objects, people, and their interactions in an open virtual world.

As part of the Vision in Words Project

Overview

Describing in words the visual world around us is an inherent ability for most humans but for machines this is a very complex and hard to solve problem. To automatically tell the story behind a video, we must employ comprehensive Computer Vision (CV) processing techniques to identify the entities depicted: landscapes, objects, persons, and their interactions. We must monitor all these entities through time and infer meaningful events which in turn can be described from many points of view using Natural Language Processing (NLP).

Such a system would have many real-life applications, from automatic video subtitling to video surveillance and assistants which describe the surroundings (or movies) for visually impaired people.

Previous work shows that grammatically correct sentences can be successfully created for videos using Deep Learning techniques. In our work we focus on the problem of synthetic video description and we aim to answer the following questions:

Can we generate synthetic videos from an open-world realistic virtual environment and take advantage of the fact that inside it we have full control over all the entities displayed at any point in time and therefore we have the ground truth for all the visual information presented in the generated video? This approach, when systems are trained using virtual environments like KITTI or AI2Thor, is already widely used in literature with very good results.
Can we define a more meaningful intermediate representation for videos which would easily indicate the relevant objects and events that take place? Similar representations are automatically learned from real videos using Neural Networks but they are obfuscated in the high-level layers, have no meaning for people and can contain irrelevant noise which in turn produces incoherent sentences.
Can we train a Neural Network on synthetic videos and successfully transfer on real data?

In the scientific report below we present the problem at hand in detail along with an in-depth review of other articles related to our work, the progress we have made so far and our future directions.

Goals

The idea from which we started our work is that, in general, Artificial Neural Networks perform better when they have more training data and that in a sufficiently large artificial environment one would be able in turn to generate an exponentially larger space of human-objects and human-human interactions. More than that, if we are programmatically generating all the entities in such a scenario, then we could also use the ground-truth information about them for a video if we store it in some sort of intermediate representation. Based on this we formulate our objectives:

Our main objective is to develop a fully controllable virtual environment which can be used to generate synthetic videos with a meaningful ground-truth about the events that are taking place in the video at any time.
Our second objective is to create a new graph-based intermediate representation for the meaning of the visual information presented in a video.
Using the proposed system we will create a novel dataset which will contain more than 1000 hours of synthetic videos, their corresponding intermediate representation and textual description.
Finally, we plan to use the work from the previous objective to develop and train a Deep Learning model on virtual data for the problems of video description, video generation from text and to also test the trained models on real-data.

Research report. Synthetic video to language: Automatic linguistic description of objects, people, and their interactions in an open virtual world

Please access the .pdf version of the research report using this link.

Team

Scientific coordinator

Conf. Dr. Marius Leordeanu

E-Mail: leordeanu@gmail.com

Doctoral students

Nicolae Cudlenco

E-Mail: nicolae.cudlenco@gmail.com

Master students

Andrei-Marius Avram

E-Mail: avram.andreimarius@gmail.com

Samples

Funding

This research work is funded by UEFISCDI under Project Code PN-III-P1-1.1-TE-2016-2182. Please see the main website of the project here.

1. Introduction

Figure 1. System architecture. First, in I. Story Generator we randomly create one or multiple actors, spawn them in random locations, we create random objects at those locations and allow the actors to perform some actions by interacting between themselves or with the objects. II. Synthetic video generator converts the simulated story to video and saves it. III. Text logger - while the story unfolds this module logs an automatically generated textual description. IV. Spatio-Temporal Graph Generator creates an intermediate abstract representation of the visual information presented in the video. We plan to train ANN to learn this representation and further use it to create the correlated textual description for a video. It should be also possible to use this representation for decoding problem (generating video from text)

Acknowledgements This work has been funded by UEFISCDI under Project code PN-III-P1-1.1-TE-2016-2182.

Describing in words the visual world around us is an inherent ability for most humans but for machines this is a very complex and hard to solve problem. To automatically tell the story behind a video, we have to employ comprehensive Computer Vision (CV) processing techniques to identify the entities depicted: landscapes, objects, persons and their interactions. We have to monitor all of these entities through time and infer meaningful events which in turn can be described from many points of view using Natural Language Processing (NLP).

Such a system would have many real-life applications, from automatic video subtitling to video surveillance and assistants which describe the surroundings (or movies) for visually impaired people.

Even though until recently CV and NLP have been treated as two separate areas without many ways to benefit from each other, lately there has been an increased interest to bridge the gap between visual information and text. This lead to the emergence of studies for problems at the intersection of CV and NLP from Image Captioning [1, 2, 3, 4, 5], to Visual Question Answering [6, 7, 8, 9, 5] and even Image Generation from Natural Language [10]. Although these problems are far from being completely solved, they proved that automatic description of visual data is possible and provided motivation to take into account the temporal component and tackle the same problems on videos [11, 12, 13, 14, 15]. While there are multiple approaches for the problem of video description, our study only focuses on the latest Deep Learning techniques.

The approach to solving the video description problem is divided in general in two phases [11, 12, 13]:

- The Encoding phase: extract data from visual information and create an inter- mediate representation. The process is done mostly using Convolutional Neural Networks (CNN), but Long Short-Term Memory Networks (LSTM) and Conditional Random Fields (CRF) are also used.
- The Decoding phase: use the encoded data to generate textual description. Re- current Neural Networks (RNN) with their flavors (LSTMs, Bi-directional RNNs) are usually used for this phase.

Previous work shows that grammatically correct sentences can be successfully created for videos using Deep Learning techniques. In this paper we focus on the problem of synthetic video description and we aim to answer the following questions:

Can we generate synthetic videos from an open-world realistic virtual environment and take advantage of the fact that inside it we have full control over all the entities displayed at any point in time and therefore we have the ground truth for all the visual information presented in the generated video? This approach, when systems are trained using virtual environments like KITTI [31] or AI2Thor [32], is already widely used in literature with very good results for the problems of Autonomous Vehicle Driving [33], Embodied Question Answering [34] and many others. In this paper we propose such an environment which can be used to generate synthetic videos, the ground-truth intermediate representation of the events that take place during a video and their textual description from multiple points of view.
Can we define a more meaningful intermediate representation for videos which would easily indicate the relevant objects and events that take place? Similar representations are automatically learned from real videos using Neural Networks [11, 12, 13] but they are obfuscated in the high-level layers, have no meaning for people and can contain irrelevant noise which in turn produces incoherent sentences. Our proposed representation not only addresses these issues but can also be used to go a step further and infer more abstract concepts presented in a video. Therefore, we can cluster the events which target the same entities into one super-event (i.e. a man enters a building, sits down at a desk and writes at a laptop for 8 hours, then exits the building can all be grouped in a single event named a man goes to work ).

Table 1: Datasets with real-world videos for benchmarking video description methods.[30]

There are many databases with real-world videos which were used in previous works for video description (Table 1). Most of these datasets are small with a total video length under 100 hours and have in general only short descriptions which attempt to characterize the whole video. Of all the datasets, only ActyNet Cap [27] and VideoStory [29] are larger and contain more complex descriptions with a sufficient number of words to cover most of the events presented in videos. Our intuition is however that the real visual world is extremely large and complex, with vast categories of objects, many background scenes and a high variance in colors and shades. By training in a virtual world with a controlled environment, we should be able to remove a layer of complexity and in turn obtain better results. Starting from this idea, we propose a fully controllable realistic open-world virtual environment, with a vast number of objects, people, possible actions and landscapes. We plan to use this environment to generate a very large database (over 1000 hours) with synthetic videos, fully annotated with textual descriptions and intermediate representations, with the ground truth for all the entities in each video.

Table 2: Datasets with videos from virtual environments. Only MMG was used for the problem of video captioning.

To our best efforts, we found only a few synthetic video captioning datasets (Table 2). Only three of them have environments with people located in realistic environments (PHAV, SoccER and Virtualhome). All the datasets are quite small, the largest one being PHAV [35] with almost 40.000 videos, totaling 55 hours of recording.

The Moving MNIST GIFs (MMG) dataset [36] contains 20,000 videos showing one or two digits moving in 64 x 64 frames. This dataset was also used to generate GIF given the captions [39]. The textual description explains how the digit moves in the given frame (i.e. ”Digit two is moving up and down and digit 0 is moving left and right.”).

Soccer Event Recognition (SoccER) [37] includes complete positional data and annotations for more than 1.6 million atomic events and 9,000 complex events from virtual soccer games. These events are annotated with tuples (e.g. (ID, KickingTheBall, t, L = hhKickingPlayer, (KickedObject, b))) which can be translated into natural language. There are a total of 23 videos, and each video is 25 minutes long. The authors also designed a two-tier system that firstly detects atomic events based on temporal, spatial and logical combinations of the detected objects, and then detects complex events as a logical and temporal combination of atomic and other complex events using the Interval Temporal Logic.

The Procedural Human Action Videos dataset (PHAV) [35] contains procedurally generated diverse, realistic, and physically plausible human actions. It has a total of 39,982 videos, with more than 1,000 examples for each action of 35 categories. The dataset has 7 modalities (such as depth, segmentation), and only one of them contains textual annotations. They include camera parameters, 3D and 2D bounding boxes, joint locations in screen coordinates (pose), and muscle information (including muscular strength, body limits and other physical-based annotations) for every person in a frame. The Virtualhome [38] dataset is closely related to our work because the videos are recorded in an indoor environment where multiple agents can interact with over 300 objects using 12 actions. To generate the videos, the authors used the MIT Scratch toolkit where they defined the actions of each agent before the recording. However, they manually labeled each video in contrast with our work where we automatically generate the textual description.

Different from all the aforementioned datasets, we started from a very popular video game whose environment we control programmatically. We selected a set of locations where the players can be spawned, with some randomly generated objects and defined multiple points of interest in which the players can perform different actions. At any point in time our environment is aware of every entity from the environment, of the identity of each individual, what actions are being performed and what objects are involved. We can control the position and orientation of the camera. For each video we create the intermediate representation which takes the form of a graph of events that contains information about objects and people, relations between them, the performed actions and also spatial and temporal data.

Therefore, we make the following contributions:

- We propose an unprecedented idea for video description, and create a fully controllable virtual environment which can be used to generate synthetic videos with ground-truth about the visual information presented in the video at any time.
- We propose a new graph-based intermediate representation for the visual information depicted in a video.
- Using the proposed system we are creating a novel dataset which will contain more than 1000 hours of synthetic videos, their corresponding intermediate representation and textual description.

2. Methods

The ideas from which we started our work are that, in general, Artificial Neural Networks perform better when they have more training data and that in a sufficiently large artificial environment one would be able in turn to generate an exponentially larger space of human-objects and human-human interactions. More than that, if we are pro- grammatically generating all the entities in such a scenario, then we could also use the ground-truth information about them for a video if we store it in some sort of interme- diate representation. Starting from these ideas and to further investigate the questions we posed in the previous section, we proposed an architecture with four modules (Figure 1). In the first module we create a short story with randomly created people and objects in random locations. The people perform a set of actions and therefore interact with other people or with objects. Next, module II. Synthetic video generator takes screen- shots of the simulation and converts them to video. Module III. Text logger generates a textual description for the environment, people and the actions that they perform. In the last module, IV. Spatio-Temporal Graph Generator, an intermediate representation containing all the visual information from the video is created and saved.

2.1 The environment

Figure 2: MTA San Andreas. A total of 733 available objects, clustered into 32 categories.

Because developing from scratch a virtual world requires the efforts of a large team of software developers, many resources and a lot of time, we chose instead to make use of a widely known game: GTA San Andreas. To programmatically interact with the players, objects and the inner logic of the game, we used the open source Github project offered by Multi Theft Auto (MTA). MTA allows us to control most of the game mechanics, to generate specific objects in custom locations, to spawn players in custom locations and trigger various player animations.

The MTA engine currently has 312 skins that contain various people with ages ranging from young to old individuals, wearing various clothes. One thing that we noticed about the skins is that they are slightly biased towards young black people, which is understandable given the storyline of the game.

GTA San Andreas is a vast and realistic world. It allowed us to use a total of 733 objects that could be clustered in 32 different categories as depicted in Figure 2. The game comes with 65 different animations and more than 70 contexts (i.e. Bar, Night club, House, Gym, Casino, Restaurant, Sea, River, Forest, Field, City and many more). Different from the original game, where we have AI implemented, pedestrians walking around the player, pizza workers, cars in traffic and others, in MTA we have none of the above and therefore we had to implement any custom behavior.

2.2 Synthetic Data Generation

From an architectural point of view, the system is composed of four main modules: the Story Generator, the Synthetic Video Recorder, the Text Logger and the Spatio- Temporal Graph Generator. A generic overview of the architecture and of the interaction between modules is depicted in Figure 1. Each module is detailed in one of the following paragraphs.

I. Story Generator. The story generator is the main module of the system and its role is to create a story episode which will be used by the other modules to create the spatio- temporal graphs, the synthetics videos or the natural language descriptions. A story episode is a short narrative in a limited environment like a house or a garden, that contains one or more agents that interact with each other or with the objects in the environment.

The episode is generated by firstly setting its points of interests (POI) which are nothing more than locations in the environment where an agent can exhibit various actions. These locations are in general strongly related to some objects and have a specific set of actions which can be performed using those objects. Then, when an episode is instantiated, each agent is spawned randomly in one of the POIs, and starts moving randomly between POIs (by choosing the shortest path between them using a predefined map and the A* algorithm), performing the actions specific to each POI. Usually, the actions in a POI have a chronological order in time and an agent is obligated to perform them in the specified order (e.g. ”sit down” followed by ”stand up”), but there can also be POIs where an agent chooses actions randomly (i.e. ”smoke” or ”answer phone”). Another type of actions are the ones performed with other actors. At any valid point in time, an actor can move towards another actor and interact with her / him (i.e. talk, hug, kiss). The episode ends when a maximum number of actions is achieved.

In the case when there are multiple actors in the environment, the story generator doesn’t assign the same skin (the players’ build and visual appearance) to more than one agent and doesn’t spawn more than one agent in a POI. We achieve this synchronization by simply putting an occupied flag on the POI or skin when it is assigned to an actor. The skin flag remains active for the entire duration of the episode, while the POI occupation flag is only active while an actor is at that location. These flags are also used when asserting if the action to move towards a new POI is valid so that the tasks the actors perform would not collide.

Figure 3: Screenshots taken during two episode creation using the graphical toolkit. We mark with v the vertexes denoting a region, followed by the name of the region (e.g., ”livingroom” in the left picture and ”gym main room” in the right picture.). The video cameras are denoted by the ”cam” label, and are usually put in the corners of the region. The POIs are depicted as the green or the red cilinders, green marking a finished POI, while red an unfinished one. Each POI contains several actions that are enumerated below the label (e.g., ”sit down, stand up” on the POI ”4:livingroom sofa” on the left picture or ”gets on, starts pedaling, gets off” on the POI ”5:gym bike” on the right picture). If there are groups of actions that must be performed in a specific order, their order appear right below the label. Also, there are two special marks for actions: ”mandatory”, meaning that this action is chosen first and ”closed by” depicting that a group of action finished. All the labels depicted in the two figures can be added using the command line interface and through interaction with the graphical environment.

agent and doesn’t spawn more than one agent in a POI. We achieve this synchronization by simply putting an occupied flag on the POI or skin when it is assigned to an actor. The skin flag remains active for the entire duration of the episode, while the POI occupation flag is only active while an actor is at that location. These flags are also used when asserting if the action to move towards a new POI is valid so that the tasks the actors perform would not collide.

The episode is also divided into regions that represent spatial areas such as rooms or hallways. They are used to track the position of an agent and of the objects, providing useful information for the text logger or the synthetic video generator.

To speed-up the creation of an episode, we developed a tool that allowed us to create an episode using the command line interface within the game. The tool allows us to walk with a player through the environment and define the POIs for an episode, the regions, the actions in each POI, the objects and a map of intermediate locations used by the A* algorithm. Moreover, we noticed that in general we only have a specific set of actions that can be performed with objects and we created templates for action-object tuples that can be inserted at a specific location. When the development of an episode finishes, we save it in a .json file that can be automatically loaded by the story generator. In Figure 3 we have two screenshots taken during an episode creation using the toolkit. The red and green circle are POIs (green when at least an action is defined for that POI and red when it has zero actions). The gym bikes and all their possible actions were defined as templates.

The story generator is also responsible for setting the time of the day and the weather, choosing the skin of the actors and setting the position of the camera.

II. Synthetic Video Generator. The synthetic videos are generated by taking screenshots every 33 ms within the GTA San Andreas graphics engine. When an episode ends, we run a script that creates an .avi video from all the screenshots taken. The screenshots are taken on the MTA client computer and sent via Internet to the server. Due to bandwidth limitations, some screenshots (frames) might get dropped in the process. Most of them however manage to be sent to the server. This module also contains a mechanism to wait after a story ends until all the frames were downloaded by the server.

III. Text Logger. The text logger role is to automatically describe the generated narrative through natural language. This is achieved by logging each action performed by an agent using several rules. Firstly, each sentence in the description must have the following structure: (Phrase link) (Subject) (Action) (Target Items). (Phrase link) is a word that links two phrases such as ”Then” or ”Afterwards”, (Subject) represents the agent of the (Action) applied to the (Target Items). The first sentence of an generated text contains a description of the agent on the (Subject) tag, that follows the following template: (Age) (Hair Color) (Race) (Gender) (Upper Body Clothes) (Lower Body Clothes) (Head Clothes), while all other (Subject) tags in the description use a gender coreference to the character (”he” or ”she”). Also, the first sentence does not have a (Phrase link).

To emphasise that an action lasts a longer period of time (e.g. ”dance” or ”eat”), it is described with the present continous tense. Also, the following sentence starts with ”When (Subject) finishes (Action), (Subject) ...”. When an agent enters for the first time in a region, the text logger will make a dynamic description of the region relative to the agent, by randomly selecting up to four objects within the region (e.g. ”When he enters the kitchen he observes a fridge on his left and a table on his right”). To avoid the repetition of the same object in the description in case that it appears multiple times in the region (e.g., ”he enters in the living room where there is a sofa, sofa and a sofa.”), we count the objects that have the same description and replace them accordingly (the previous examples becomes ”he enters in the living room where there are three sofas”). By keeping track of the objects used by each agent, we can add a coreference between the used objects, marked by the preposition ”again” before the object. Moreover, if the actor interacts with a different object of the same type, the preposition ”another” will be added before the object. Also, because in the environment there can be multiple agents of the same gender, the coreference would be confused, so we gave each agent a name by selecting a first name and a last name from the top 1000 first names and last names in United States, according to their gender. We use this name every time a description for the action begins, and we use its coreference further on until it ends.

Figure 4: A simple example of a Spatio-Temporal graph for a video where a man named John first eats a banana in the kitchen, then kisses Mary who is also in the kitchen. The graph is comprised of atomic events (orange), events with properties (yellow) and relations between events (blue arrows).

IV. Spatio-Temporal Graph Generator. The intermediate world model between CV and NLP presented in this paper takes the form of a graph of events that are interlinked with spatio-temporal relations (Figure 4). An event is an encapsulation of either an action that modifies the structure of the environment or a fact about the objects that can be found in the environment. The spectrum of possibilities for an event are huge, ranging from interaction of atoms to statements such as ”the universe exists”, all of them telling a story, but at a different scale. However, in this work we use the scale of macroscopic objects and we create simple events that can take up to several seconds, mainly due to the limitations given by the game engine. For instance, a plausible example of an event for an action is ”John eats a banana”, and ”John exists” for a fact event.

Apart from the action, the event also contains links to actors that undertake the action, links to the objects that take part in the event and the period of time between which it takes place. However, if an event is a fact (e.g. ”an object exists”) then it only contains the fact, the the period of time and links to other events.

The edges of the graph establish spatio-temporal relationships between events. The spatial links mark the relative position between action events and atomic events from the camera perspective. It must be noted that the relative position is not static in time, so in order to capture this, we create edges from action events which are chronologically ordered in time. The temporal links establish the chronological order between action events by linking each of them with its previous, next or overlapping events.

The Spatio-Temporal Graph Generator is responsible for abstracting the narrative created by the Story Generator as a graph of events. It tracks the actions performed in the video, their duration and order, and links them with the corresponding objects and actors. The module is also responsible for tracking for other relations between events like their order in time or if they overlap, space proximity, if they have the same actors. At the end of the episode, the graph is saved as a .json file.

3. Results

To validate our results, we developed episodes in various indoor locations (mostly houses) within the game, with the following interior ids: 1, 3, 5, 8, 10, 12 . Some of the corresponding locations having these interior ids can be visualized at http://weedarr.wikidot.com/interior. All episodes, with the exception of the one with interior id 5, take place in a house environment. The episode from the interior with id 5 takes place in a gym environment.

Figure 5: Examples of screenshots withing MTA San Andreas depicting various action of a single actor in two episodes, together with the automatically generated description.

We depict in Figure 5 several screenshots from two episodes (with id 3 and id 10), together with the generated descriptions. As it can be observed the graphics of the game are decent and reflect the reality up to a certain level, missing detailed components such as the water from the water tap or the progressive disappearance of the eaten object.

However, although this level of granularity is present in the real world, most of the actions in videos can be inferred from the movement of the person in a specific context. Even in real life, in some cases these details might not be captured by the camera. As an example, answering a phone can be inferred from an agent putting a hand in the pocket and then placing it at his ear, without explicitly observing the phone. This observation would be complicated if a description of the phone would be required, this would be a limitation of our environment.

The descriptions of the actions happening in the video are semantically rich and sufficiently long, having 194 words and 14 sentences for the first house (first row), and 247 words and 21 sentences for the second house (second row), respectively. Each of them contain a very precise description of the agents appearance: (1) ”A young man in a yellow and black shirt and black pants” and (2) ”An old white man in a white shirt, gray three quarter pants and a gray hat”. Also, each agent has a name: Myles PATEL in he first episode and Dustin BRADSHAW in the second episode that are correferenced in the rest of the sentences with Nominative ”he” or the Genitive ”him”. A downside which can be observed is that the generated sentences may appear to be a little dull and to not sound very natural, repeating propositions like ”then” or ”afterwards”.

Figure 6: Examples of an episode depicting various actions in the case of multiple actors and their corresponding generated description. Sentences related to each individual actor are written with the same color.

We further exemplify in Figure 6 the case of an multi agent episode with 4 actors in a gym environment. The first sentence in the caption describes each of the four actors, in a random order, and assigns them a name. Then, the system generates a description for a random number of actions for one of the actors, starting with the actor name and then using the correference. We emphasised with different colors (blue, green, yellow, red) the phrases that refer to the same character.

4. Conclusions and future work

In this paper we tackled the problem of video description and provided the following contributions:

We created a system which successfully creates synthetic videos depicting various actions taking place in a realistic environment. Because we have full control over the environment, camera, objects, players and the actions they perform, we are able to generate very precise textual description and an intermediate graph representation for the events described in the video.
We proposed and defined a new intermediate representation, which can be read and understood by humans, by treating a video as a graph of events with properties and which are correlated with custom defined relations.

Our current work paves the way for new ideas to be pursued in future. We plan to further improve the existing synthetic data generation system so that we could have random objects from the same category spawned in the same location when an episode is ran multiple times (i.e. a different kind of bed). We would also like to continue to implement episodes in all the available game contexts and using all the available animations and objects. We plan to create a large synthetic dataset by distributing the generation system on multiple Virtual Machines or Docker instances and generate samples using all the possible permutations between objects, contexts, actions and people (we estimate approximately 1 million combinations). Then, we plan to create a Deep Learning architecture which would learn to automatically infer the intermediate representation given a video, and use it to describe videos. Another interesting problem to investigate would be to generate the intermediate representation and the synthetic video given the textual description. Last but not least, we plan to test how well would the system be able to transfer its knowledge obtained by training on synthetic videos, recorded in a virtual environment, to real-life data.

References

Q. You, H. Jin, Z. Wang, C. Fang, J. Luo, Image captioning with semantic attention, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4651–4659.
S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, V. Goel, Self-critical sequence training for image captioning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7008–7024.
L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, T.-S. Chua, Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5659–5667.
J. Lu, C. Xiong, D. Parikh, R. Socher, Knowing when to look: Adaptive attention via a visual sentinel for image captioning, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 375–383.
P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, L. Zhang, Bottom-up and top- down attention for image captioning and visual question answering, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6077–6086.
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, D. Parikh, Vqa: Visual question answering, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 2425–2433.
J. Lu, J. Yang, D. Batra, D. Parikh, Hierarchical question-image co-attention for visual question answering, in: Advances in neural information processing systems, 2016, pp. 289–297.
H. Xu, K. Saenko, Ask, attend and answer: Exploring question-guided spatial attention for visual question answering, in: European Conference on Computer Vision, Springer, 2016, pp. 451–466.
A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, M. Rohrbach, Multimodal compact bilinear pooling for visual question answering and visual grounding, arXiv preprint arXiv:1606.01847.
S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, H. Lee, Generative adversarial text to image synthesis, Vol. 48 of Proceedings of Machine Learning Research, PMLR, New York, New York, USA, 2016, pp. 1060–1069. URL http://proceedings.mlr.press/v48/reed16.html

J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T. Dar- rell, Long-term recurrent convolutional networks for visual recognition and description, in: Pro- ceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 2625–2634.
S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, K. Saenko, Translating videos to natural language using deep recurrent neural networks, arXiv preprint arXiv:1412.4729.
L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, A. Courville, Describing videos by exploiting temporal structure, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 4507–4515.
C. Hori, T. Hori, T.-Y. Lee, Z. Zhang, B. Harsham, J. R. Hershey, T. K. Marks, K. Sumi, Attention- based multimodal fusion for video description, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 4193–4202.

J. Lei, L. Yu, M. Bansal, T. Berg, TVQA: Localized, compositional video question answering, in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Brussels, Belgium, 2018, pp. 1369–1379. doi:10.18653/ v1/D18-1167. URL https://www.aclweb.org/anthology/D18-1167

D. Chen, W. B. Dolan, Collecting highly parallel data for paraphrase evaluation, in: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011, pp. 190–200.
M. Rohrbach, S. Amin, M. Andriluka, B. Schiele, A database for fine grained activity detection of cooking activities, in: 2012 IEEE conference on computer vision and pattern recognition, IEEE, 2012, pp. 1194–1201.
P. Das, C. Xu, R. F. Doell, J. J. Corso, A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 2634–2641.
M. Regneri, M. Rohrbach, D. Wetzel, S. Thater, B. Schiele, M. Pinkal, Grounding action descrip- tions in videos, Transactions of the Association for Computational Linguistics 1 (2013) 25–36.
A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, B. Schiele, Coherent multi-sentence video description with variable level of detail, in: German conference on pattern recognition, Springer, 2014, pp. 184–195.
A. Rohrbach, M. Rohrbach, N. Tandon, B. Schiele, A dataset for movie description, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3202–3212.
A. Torabi, C. Pal, H. Larochelle, A. Courville, Using descriptive video services to create a large data source for video annotation research, arXiv preprint arXiv:1503.01070.
J. Xu, T. Mei, T. Yao, Y. Rui, Msr-vtt: A large video description dataset for bridging video and language, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5288–5296.
G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, A. Gupta, Hollywood in homes: Crowdsourcing data collection for activity understanding, in: European Conference on Computer Vision, Springer, 2016, pp. 510–526.
K.-H. Zeng, T.-H. Chen, J. C. Niebles, M. Sun, Generation for user generated videos, in: European conference on computer vision, Springer, 2016, pp. 609–625.
L. Zhou, C. Xu, J. J. Corso, Towards automatic learning of procedures from web instructional videos, arXiv preprint arXiv:1703.09788.
R. Krishna, K. Hata, F. Ren, L. Fei-Fei, J. Carlos Niebles, Dense-captioning events in videos, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 706–715.
L. Zhou, Y. Kalantidis, X. Chen, J. J. Corso, M. Rohrbach, Grounded video description, in: Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 6578–6587.
S. Gella, M. Lewis, M. Rohrbach, A dataset for telling the stories of social media videos, in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 968–974.
N. Aafaq, A. Mian, W. Liu, S. Z. Gilani, M. Shah, Video description: A survey of methods, datasets, and evaluation metrics, ACM Computing Surveys (CSUR) 52 (6) (2019) 1–37.
A. Geiger, P. Lenz, C. Stiller, R. Urtasun, Vision meets robotics: The kitti dataset, The Interna- tional Journal of Robotics Research 32 (11) (2013) 1231–1237.
E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, D. Gordon, Y. Zhu, A. Gupta, A. Farhadi, Ai2-thor: An interactive 3d environment for visual ai, arXiv preprint arXiv:1712.05474.

C. Chen, A. Seff, A. Kornhauser, J. Xiao, Deepdriving: Learning affordance for direct perception in autonomous driving, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2722–2730.
A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, D. Batra, Embodied question answering, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 2054–2063.
C. Souza, A. Gaidon, Y. Cabon, A. Lopez, Procedural generation of videos to train deep action recognition networks, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2594–2604.
N. Srivastava, E. Mansimov, R. Salakhudinov, Unsupervised learning of video representations using lstms, in: International conference on machine learning, 2015, pp. 843–852.
L. Morra, F. Manigrasso, G. Canto, C. Gianfrate, E. Guarino, F. Lamberti, Slicing and dicing soccer: automatic detection ofcomplex events from spatio-temporal data, arXiv preprint arXiv:2004.04147.

X. Puig, K. Ra, M. Boben, J. Li, T. Wang, S. Fidler, A. Torralba, Virtualhome: Simulating household activities via programs, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8494–8502.
G. Mittal, T. Marwah, V. N. Balasubramanian, Sync-draw: automatic video generation using deep recurrent attentive architectures, in: Proceedings of the 25th ACM international conference on Multimedia, 2017, pp. 1096–1104.