Mihai Masala and Marius Leordeanu
Institute of Mathematics of the Romanian Academy National University of Science and Technology POLITEHNICA Bucharest
The task of describing video content in natural language is commonly referred to as video captioning. Unlike conventional video captions, which are typically brief and widely available, long-form paragraph descriptions in natural language are scarce. This limitation of current datasets is due, on one hand, to the expensive human manual annotation required and, on the other, to the highly challenging task of explaining the language formation process from the perspective of the underlying story, as a complex system of interconnected events in space and time. Through a thorough analysis of recently published methods and available datasets, we identify a general lack of published resources dedicated to the problem of describing videos in complex language, beyond the level of descriptions in the form of enumerations of simple captions. Furthermore, while state-of-the-art methods produce impressive results on the task of generating shorter captions from videos by direct end-to-end learning between the videos and text, the problem of explaining the relationship between vision and language is still beyond our reach. In this work, we propose a shared representation between vision and language, based on graphs of events in space and time, which can be obtained in an explainable and analytical way, to integrate and connect multiple vision tasks to produce the final natural language description. Moreover, we also demonstrate how our automated and explainable video description generation process can function as a fully automatic teacher to effectively train direct, end-to-end neural student pathways, within a self-supervised neuro-analytical system. We validate that our explainable neuroanalytical approach generates coherent, rich and relevant textual descriptions on videos collected from multiple varied datasets, using both standard evaluation metrics (e.g. METEOR, ROUGE, BertScore), human annotations and consensus from ensembles of state-of-the-art large Vision Language Models. We also validate the effectiveness of our self-supervised teacher-student approach in boosting the performance of end-to-end neural vision-to-language models
A modular, algorithmic step-by-step outline of our proposed approach. Starting from the video we harness multiple tasks (i.e., action detection, semantic segmentation, depth estimation and tracking), followed by a second step of multimodality association. While we still have frame-level information, at this point information is aggregated and correlated across modalities. A third step follows, in which we ensure space and time consistency and person unification, to generate a list of events containing the actions, unique person ids, location, objects and start and end frame, transitioning to video level representation. At the fourth step, this list is then directly converted into GEST nodes and GEST edges are created to yield the final GEST graph. In the self-supervised learning module, we demonstrate how neural student networks (skip-like connections shown in blue, below) can learn from their corresponding GEST teacher path (or sub-path, above). While all such student pathways can learn from their respective GEST teacher paths, in our experiments we focused on the longest connection from original video input to final text output, as illustrated in the self-supervised learning box in the figure.
VidIL: A young man prepares for an activity in his classroom. He stands at the door of an empty room filled with desks and chairs. The classroom appears clean and orderly. As the lights illuminate the room, a boy stands on a desk, surveying the classroom with curiosity.
VALOR: in the classroom a man in a white shirt came out of the classroom and walked out
COSA: the new classrooms are open to the public. a view of the main room. a view of the classroom. a view of the room.
VAST: a man walks into a classroom and opens the door to a classroom.
GIT2: a picture of a portrait of a man on a wall
mPLUG-2: a man opens a door and walks into a classroom
PDVC: a man is seen speaking to the camera and leads into a man holding a dart board. then a man is seen speaking to the camera and leads into a man throwing darts at a board. then the man then puts the dart on the board and then walks up to the camera.
GEST (own): In a classroom, the person holds a cup while standing next to a table. They then walk across the floor, bend down to pick something up, and stand up again. After that, the person closes a door and walks through it, entering another room. Meanwhile, another person closes the door, walks across the floor, and stands near a wall. Then, the other person holds a laptop while standing, and walks towards a chair.
GEST + VidIL: In a classroom setting, a young man stands by a table, holding a cup. He walks across the room, bends down to pick up an item, and then straightens up. Afterward, he closes the door as he enters the room. Meanwhile, another person closes the door behind them, walks across the floor, and stands still. This second person then picks up a laptop, stands for a moment, and proceeds to walk over to a chair. The classroom is well-lit, with several desks and chairs arranged neatly.
gpt-4o-2024-05-13 (20 frames): A cyclist is riding along a designated bike lane on a road. As the cyclist approaches a parked white car, the driver of the car opens the door into the bike lane without checking for oncoming traffic. The cyclist, unable to avoid the sudden obstacle, collides with the open car door. The impact causes the cyclist to fall off the bike, while the driver appears startled by the incident.
Notes: The collision between the cyclist and the car door is hallucinated.
Gemini-2.5: The video opens with a person cycling on a red bicycle, moving from left to right across the frame. They are riding on the side of a road, next to a parked white station wagon and a yellow car. The cyclist continues to pedal, maintaining a steady pace. The video concludes with the cyclist still in motion, positioned next to the parked vehicles.
Notes: The man opening the door and the cyclist swerving out of the way is not mentioned.
videollama3: A cyclist rides a bicycle with an oversized front wheel and a small rear wheel down a street. The cyclist is wearing a helmet and casual clothing. They are riding in the bike lane, next to a white car. There are trees and houses on either side of the street.
Notes: The man opening the door and the cyclist swerving out of the way is not mentioned.
gpt-4o-2024-05-13 (8 frames): In the video, a red marker is placed on a flat surface. Shortly after, a second red marker is added next to the first one. Finally, a blue marker is placed beside the two red markers, completing the sequence of events.
Notes: It misses the fact that the red markers are dropped and that one of the red markers is removed from the table.
gemini-2.5: The video opens with a single red marker lying on a wooden surface. Shortly after, a second red marker is added next to the first one. Then, a hand reaches in to place a blue marker on the surface, but accidentally pushes one of the red markers, causing it to fall off the table. The person quickly picks up the fallen marker and places it back on the table next to the other two. After a few seconds, the same hand picks up all three markers.
Notes: It hallucinates pushing and picking the red markers as well as picking up all three markers.
videollama3: The video begins with a view of a wooden surface. A red marker appears on the surface, followed by another red marker. The markers are placed parallel to each other. A hand then enters the frame and places a blue marker next to the red markers.
Notes: It caches the hand that places a red marker but it misses the removal.
gpt-4o-2024-05-13 (10 frames): In the video, a man is seated at a table with a pen placed in front of him. He is holding a notebook and begins by flipping through its pages. He carefully examines the contents, turning the pages one by one. After a while, he places the notebook down on the table and continues to look through it, seemingly reading or reviewing the notes written inside. The video captures his focused attention on the notebook as he methodically goes through the pages.
Notes: In this case, the model misses the part where the book was changed with a notebook.
gemini-2.5: A man seated at a wooden desk is shown holding a notebook, flipping through its pages from front to back. He then places the notebook flat on the desk, with its pages open. A pen is visible on the desk throughout the video.
Notes: The model misses the part where the book was changed with a notebook.
videollama3: A man is sitting at a table in an office, reading a magazine. He is wearing a white shirt and glasses. He flips through the pages of the magazine, then closes it and sets it aside. He picks up a notebook and begins to write in it.
Notes: The model hallucinates the writing part.
gpt-4o-2024-05-13 (20 frames): A man stands in front of a whiteboard filled with mathematical equations and diagrams. He is holding a marker and appears to be examining it closely. After a few moments, he begins to walk forward, away from the whiteboard. As he moves, the marker slips from his hand and falls to the floor. The video concludes with a close-up of the marker lying on the ground near his feet.
Notes: The model hallucinates the moving part and is unable to understand that the marker "changes" color.
gemini-2.5: A man wearing glasses, a beige t-shirt, and brown shorts is standing in front of a whiteboard filled with equations. He is holding a red marker in his right hand. He then drops the marker on the floor.
Notes: The model does not notice the different marker on the floor.
videollama3: A man stands in front of a whiteboard filled with mathematical equations. He holds a marker and gestures with it as he speaks. After a moment, he drops the marker, which bounces on the floor before coming to a stop. The camera then focuses on the marker lying on the ground.
Notes: The model does not notice the different marker (different color) on the floor.
gpt-4o-2024-05-13 (40 frames): In the video, a person is seen packing items into a black backpack. Initially, the individual places a computer keyboard into the backpack. Following this, they pick up a small box from the table and also place it into the backpack. After ensuring that all items are securely packed, the person closes the backpack, puts it on their shoulder, and walks away from the table.
Notes: The model misses picking up the mouse and keys, including what happens with the keys (they are not going into the backpack).
gemini-2.5: A person in a white t-shirt and brown cargo shorts stands at a light-colored wooden desk, packing a black backpack. They pick up a black computer keyboard from the desk and place it into the backpack. Next, they pick up a small, white and green rectangular box and also place it into the backpack. Finally, they adjust the contents of the backpack and then lift it, carrying it away.
Notes: Same as above, missing the mouse and keys part.
videollama3: A man wearing a white shirt and khaki shorts stands at a table in a room with several tables and chairs. He is holding a black backpack and places it on the table. On the table, there is a keyboard, a mouse, a set of keys, and a small box. The man picks up the keyboard and places it inside the backpack. He then picks up the set of keys and places them inside the backpack as well. Finally, he picks up the small box and places it inside the backpack.
Notes: Understands the objects on the table, but still hallucinates about putting the keys in the backpack.
From Vision To Language through Graph of Events in Space and Time: An Explainable Self-supervised Approach - [paper]
Explaining Vision and Language Through Graphs of Events in Space and Time - [paper]
GEST: the Graph of Events in Space and Time as a Common Representation between Vision and Language - [paper]