Mihai Masala and Marius Leordeanu
Institute of Mathematics of the Romanian Academy National University of Science and Technology POLITEHNICA Bucharest
Abstract
The task of describing video content in natural language is commonly referred to as video captioning. Unlike conventional video captions, which are typically brief and widely available, long-form paragraph descriptions in natural language are scarce. This limitation of current datasets is due, on one hand, to the expensive human manual annotation required and, on the other, to the highly challenging task of explaining the language formation process from the perspective of the underlying story, as a complex system of interconnected events in space and time. Through a thorough analysis of recently published methods and available datasets, we identify a general lack of published resources dedicated to the problem of describing videos in complex language, beyond the level of descriptions in the form of enumerations of simple captions. Furthermore, while state-of-the-art methods produce impressive results on the task of generating shorter captions from videos by direct end-to-end learning between the videos and text, the problem of explaining the relationship between vision and language is still beyond our reach. In this work, we propose a shared representation between vision and language, based on graphs of events in space and time, which can be obtained in an explainable and analytical way, to integrate and connect multiple vision tasks to produce the final natural language description. Moreover, we also demonstrate how our automated and explainable video description generation process can function as a fully automatic teacher to effectively train direct, end-to-end neural student pathways, within a self-supervised neuro-analytical system. We validate that our explainable neuroanalytical approach generates coherent, rich and relevant textual descriptions on videos collected from multiple varied datasets, using both standard evaluation metrics (e.g. METEOR, ROUGE, BertScore), human annotations and consensus from ensembles of state-of-the-art large Vision Language Models. We also validate the effectiveness of our self-supervised teacher-student approach in boosting the performance of end-to-end neural vision-to-language models
Method
A modular, algorithmic step-by-step outline of our proposed approach. Starting from the video we harness multiple tasks (i.e., action detection, semantic segmentation, depth estimation and tracking), followed by a second step of multimodality association. While we still have frame-level information, at this point information is aggregated and correlated across modalities. A third step follows, in which we ensure space and time consistency and person unification, to generate a list of events containing the actions, unique person ids, location, objects and start and end frame, transitioning to video level representation. At the fourth step, this list is then directly converted into GEST nodes and GEST edges are created to yield the final GEST graph. In the self-supervised learning module, we demonstrate how neural student networks (skip-like connections shown in blue, below) can learn from their corresponding GEST teacher path (or sub-path, above). While all such student pathways can learn from their respective GEST teacher paths, in our experiments we focused on the longest connection from original video input to final text output, as illustrated in the self-supervised learning box in the figure.
Text generation metrics evaluation
Human ranking
VLM-as-a-Jury as proxy for human preferences
Self-supervised teacher-student learning