Step 1: We build a Procedural Knowledge Graph by combining the text and step information from a text-based procedural knowledge database (e.g., wikiHow) and the visual and step information from unlabeled instructional video datasets (e.g., HowTo100M) automatically. The resulting graph encodes procedural knowledge about tasks and steps, and about the temporal order and relation information of steps.
Step 2: We propose four pre-training objectives respectively focuses on procedural knowledge about the step of a video, tasks that a step may belong to, steps that a task would require, and the general order of steps. These pre-training objectives are designed to allow a model to answer questions about the subgraph of the Procedural Knowledge Graph that a video segment may belong to. The Procedural Knowledge Graph produces pseudo labels for these questions as supervisory signals to adapt video representations produced by a video foundation model for robust and generalizable procedure understanding.
Step 3: Our procedure-aware model, Paprika, after pre-training with the proposed objectives and Procedural Knowledge Graph, becomes frozen afterwards. We use Paprika to obtain video representations as input to downstream models to perform multiple downstream procedure understanding tasks.