We have started to collect a large dataset of videos containing complex indoor activities for automatic video to language translation. The video sequences contain relatively complex actions and activities, involving one or several people in an indoor environment at the Institute of Mathematics of the Romanian Academy. Examples of such activities include both simpler as well as more complex actions. Examples of simpler activities, which are performed by a single person could be entering and exiting a room, sitting or standing, picking up a book, reading, placing an object into a handbag, writing on the blackboard. More challenging ones involve activities performed by two or more people, who could interact in various ways: actors could talk to each other, handshake, solve a problem at the blackboard, make a presentation, pass by each other, among others. The main goal of collecting the dataset is to train complex models to automatically describe such videos into relatively complex, natural language.
We have already collected about 630 videos, 30 seconds long each, some of which are already annotated. The dataset and the available annotations are available here.
More information on the dataset can also be found on the IMAR dataset website, with many more details coming soon.
Our novel work on a multi-level representation of language in the context of translating visual content into natural language, along with the introduction of our new IMAR dataset has been recently accepted for publication, with oral presentation, at the prestigious International Conference on Computational Linguistics (COLING), 2020.
Simion-Vlad Bogolin, Ioana Croitoru, Marius Leordeanu, A hierarchical approach to vision-based language generation: from simple sentences to complex natural language, accepted with Oral Presentation at International Conference on Computational Linguistics (COLING) 2020
Our oral presentation can be seen in full here.