The goal of this work was to learn semantically relevant representations for instructional videos using both the video frames and narration available from the user-generated hosting site.
The existing research in video understanding is mostly focused on tasks such as classification, action detection, temporal activity detection, and retrieval. This involves encoding a video into latent representation which usually requires a lot of annotations for effective learning. Such latent representation can be very useful for these variety of tasks, but it lacks interpretability and detailed semantic understanding. Including other modalities into the self-supervised learning process can improve the representations learned on downstream tasks in a both text, audio and video.
In addition to using attention mechanisms for joint-embeddings between modalities, this project allowed for a detailed investigation on the many self-supervised methods used in video representation learning for instructional vidoes. These methods vary from video only signals to multi-modal signals that involve audio, video and text. Through this review, which will later be submitted as a survery paper, it was confirmed that contrastive functions were the most sucessful approaches when evaluated on downstream tasks such as action recognition and text-to-video retreival.
To the right, you can see an example of the variety of self-supervised objective functions we directly experimented with for this specific work.
Using attention mechanisms to learn joint embeddings between the narrations and visual features, graphs can be generated looking at the important relations found in a video clip. An example of the attention mechanism used is shown below.
On the left is a visualization of (a) how to install a shower head (b) and how to install a wood floor heating system. Each node’s color illustrates its importance by weight where the darker the node, the greater the importance. Edges are determined by the similarity between a pair of objects and a node embedding for a word that is an action or state. The frames shown are extracted examples from the video and demonstrate some of the activities and relationships shown in the graph.