Virtual Vision-to-Language System Architecture: First, in I. Story Generator we randomly create one or multiple actors, spawn them in random locations, we create random objects at those locations and allow the actors to perform some actions by interacting between themselves or with the objects. II. Synthetic video generator converts the simulated story to video and saves it. III. Text logger - while the story unfolds this module logs an automatically generated textual description. IV. Spatio-Temporal Graph Generator creates an intermediate abstract representation of the visual information presented in the video. We plan to train ANN to learn this representation and further use it to create the correlated textual description for a video. It should be also possible to use this representation for decoding problem (generating video from text)
As part of the Vision in Words project, we are working on automatically generating video content and corresponding natural language in order to create a large dataset for training deep neural networks systems. The goal is two fold. First, we want to be able to find a common representation between vision and language in the form of graphs of events in space and time. The current literature at the intersection of vision and language lacks such a common representation. On the other hand, we also need a large amount of labeled training data, with pairs of the type (input video, complex linguistic description in natural language). We believe that the virtual environment offers as the possibility to generate automatically such training pairs in order to study better the connection between vision and the description of visual input in natural language.
While this direction is more the subject of future work (started as part of the Vision in Words project), the current preliminary results are very encouraging. We invite the reader to take a closer look at the site dedicated to the Synthetic to Vision to Language work at this link.
A full technical description of the work can be found in the following report here. We expect to submit to a top vision-language conference or journal in the next period.