Vision in Words

Automatic linguistic description of objects, people and their interactions in indoor videos

Overview

The relation between visual interpretation and linguistic expression is at the core of human intelligence. Understanding this relation from a computational perspective and creating systems that perform efficiently in the real world will have a significant impact in real life, with many important applications in outdoor environments, as well as assistive technologies in indoor scenes. In the Vision-in-Words project our aim is to develop a real-time working system for describing indoor videos in natural language, with application to daily activities that involve interactions between people and objects in indoor environments. Starting from our own powerful baselines with recent success in top international computer vision forums, we plan to develop a model that will efficiently perform in real-time, with potential for direct real world applications in the homecare, health and educational sectors. We focus our workplan on three objectives, each starting from a solid starting point following our own work and research. The first objective of Vision-in-Words is to develop a system for realtime object detection in video. The second is to create an efficient system for recognizing human activities and their interactions with objects in such indoor environments. Our third objective is to develop methods and a full vision-to-language translator from indoor videos.

In the scientific report, which is available below, we present specific goals for our project, along with the descriptions of our ongoing work and future directions. We also present our team of young researchers and professors in computer vision with top results at the national and international levels. Our strong research experience in object recognition in images and videos, along with our published and preliminary research results and the top quality of the people involved, will lead to results of the highest value and a successful research project.

Main objective of Vision-in-Words project

The central goal of the Vision-in-Words project is to create and implement algorithms for a system that automatically produces natural linguistic descriptions, in the form of short stories, from indoor videos. At the same time, the system should be able to use language to improve the visual interpretation. First, we want to use visual information to produce language, a description in the form of a „story” of what is „seen”. Second, we will use the higher level context of the „story” in order to improve the visual perception of individual objects. The case of indoor videos is appropriate for such a challenging task as it provides an almost self-contained world in which the number of classes is sufficiently small, yet having many and rich interactions. From a practical point of view, such a system with vision-to-language translation capabilities would have immediate impact on everyday, home robotic applications.

Team

Scientific coordinator

Conf. Dr. Marius Leordeanu

Senior researchers

Conf. Dr. Bogdan Alexe

Prof. Dr. Radu Ionescu

Doctoral students

Drd. Ioana Croitoru

Drd. Vlad Bogolin

Drd. Nicolae Cudlenco

Collaborators

Mrd. Andrei Avram

Drd. Iulia Duta

Drd. Andrei Nicolicioiu

Conf. Dr. Oana Balan

Prof. Nirvana Popescu

Final research report

More details about our specific objectives, goals and results can be found in our research report for 2018-2019, which is available below and also at this link.

Research_Report_TE2016_2182_October_2020_MLeordeanu_final.pdf

Funding

This research work is funded by UEFISCDI under Project Code PN-III-P1-1.1-TE-2016-2182.

Google Sites

Report abuse