The relation between visual interpretation and linguistic expression is at the core of human intelligence. Understanding this relation from a computational perspective and creating systems that perform efficiently in the real world will have a significant impact in real life, with many important applications in outdoor environments, as well as assistive technologies in indoor scenes. In the Vision-in-Words project our aim is to develop a real-time working system for describing indoor videos in natural language, with application to daily activities that involve interactions between people and objects in indoor environments. Starting from our own powerful baselines with recent success in top international computer vision forums, we plan to develop a model that will efficiently perform in real-time, with potential for direct real world applications in the homecare, health and educational sectors. We focus our workplan on three objectives, each starting from a solid starting point following our own work and research. The first objective of Vision-in-Words is to develop a system for realtime object detection in video. The second is to create an efficient system for recognizing human activities and their interactions with objects in such indoor environments. Our third objective is to develop methods and a full vision-to-language translator from indoor videos.
In the scientific report, which is available below, we present specific goals for our project, along with the descriptions of our ongoing work and future directions. We also present our team of young researchers and professors in computer vision with top results at the national and international levels. Our strong research experience in object recognition in images and videos, along with our published and preliminary research results and the top quality of the people involved, will lead to results of the highest value and a successful research project.
The central goal of the Vision-in-Words project is to create and implement algorithms for a system that automatically produces natural linguistic descriptions, in the form of short stories, from indoor videos. At the same time, the system should be able to use language to improve the visual interpretation. First, we want to use visual information to produce language, a description in the form of a „story” of what is „seen”. Second, we will use the higher level context of the „story” in order to improve the visual perception of individual objects. The case of indoor videos is appropriate for such a challenging task as it provides an almost self-contained world in which the number of classes is sufficiently small, yet having many and rich interactions. From a practical point of view, such a system with vision-to-language translation capabilities would have immediate impact on everyday, home robotic applications.
Conf. Dr. Marius Leordeanu
Conf. Dr. Bogdan Alexe
Prof. Dr. Radu Ionescu
Drd. Ioana Croitoru
Drd. Vlad Bogolin
Drd. Nicolae Cudlenco
Mrd. Andrei Avram
Drd. Iulia Duta
Drd. Andrei Nicolicioiu
Conf. Dr. Oana Balan
Prof. Nirvana Popescu
More details about our specific objectives, goals and results can be found in our research report for 2018-2019, which is available below and also at this link.