Procedure-Aware Pretraining for Instructional Video Understanding

CVPR 2023

1 Salesforce Research,

2 Rutgers University

3 UT Austin

 {hz289,mk1353}@cs.rutgers.edu, robertomm@cs.utexas.edu, {ssavarese,jniebles}@salesforce.com


Instructional videos depict humans demonstrating how to perform multi-step tasks such as cooking, repairing, etc. Building good video representations from instructional videos is challenging due to the small amount of video annotations available. This makes extracting the procedural knowledge such as the identity of the task (e.g., ‘make latte’), its steps (e.g., ‘pour milk’) challenging. 


Our insight is that instructions for procedures depict sequences of steps that repeat between instances of the same or different tasks, and that this structure can be well represented by a Procedural Knowledge Graph, where nodes are discrete steps and edges connect steps that occur sequentially in the instructional activities. This graph can then be used to generate pseudo labels to train a video representation that encodes the procedural knowledge in a more accessible form to generalize to multiple procedure understanding tasks.


We call this Procedural Knowledge Graph based pre-training procedure and the resulting model Paprika, Procedure-Aware PRe-training for Instructional Knowledge Acquisition. We evaluate Paprika on COIN and CrossTask for procedure understanding tasks such as task recognition, step recognition, and step forecasting. Paprika yields a video representation that improves over the state of the art: up to 11.23% gains in accuracy in 12 evaluation settings.




Step 1: We build a Procedural Knowledge Graph by combining the text and step information from a text-based procedural knowledge database (e.g., wikiHow) and the visual and step information from unlabeled instructional video datasets (e.g., HowTo100M) automatically. The resulting graph encodes procedural knowledge about tasks and steps, and about the temporal order and relation information of steps.



Step 2: We propose four pre-training objectives respectively focuses on procedural knowledge about the step of a video, tasks that a step may belong to, steps that a task would require, and the general order of steps. These pre-training objectives are designed to allow a model to answer questions about the subgraph of the Procedural Knowledge Graph that a video segment may belong to. The Procedural Knowledge Graph produces pseudo labels for these questions as supervisory signals to adapt video representations produced by a video foundation model for robust and generalizable procedure understanding.




Step 3: Our procedure-aware model, Paprika, after pre-training with the proposed objectives and Procedural Knowledge Graph, becomes frozen afterwards. We use Paprika to obtain video representations as input to downstream models to perform multiple downstream procedure understanding tasks. 


We evaluate our method on the challenging COIN and CrossTask datasets on downstream tasks: procedural task recognition, step recognition, and step forecasting. 

Regardless of the capacity of the downstream model (from the simple multi-layer perceptron to the powerful Transformer), our method obtains a representation that outperforms the state of the art.

Particularly, Paprika learns features “rich” in procedural knowledge, easy to apply to tasks with light downstream heads.


In conclusion:

We show how to learn a video representation for procedure understanding in instructional videos that encodes procedural knowledge. The key is to leverage a Procedural Knowledge Graph to inject procedural knowledge into the video representation, which improves the state-of-the-art performance on several tasks.


@article{zhou2023paprika,

  title={Procedure-Aware Pretraining for Instructional Video Understanding},

  author={Zhou, Honglu and Martin-Martin, Roberto and Kapadia, Mubbasir and Savarese, Silvio and Niebles, Juan Carlos},

  journal={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},

  year={2023}

}