Overview

  • Cognitive Science has shown that humans consistently segment videos into meaningful chunks. The segmentation happens naturally, without pre-defined categories and without being explicitly asked to do so.

  • Here, we study the task of Generic Event Boundary Detection (GEBD), aiming at detecting generic, taxonomy-free event boundaries that segment a whole video into chunks. Details can be found in our paper: https://arxiv.org/abs/2101.10511

  • Some example event boundaries are shown in the righthand figure.

  • Generic event boundaries are

    • immediately useful for applications like video editing and summarization

    • stepping stone towards long-form video modeling via reasoning the temporal structures of segmented units


  • We present more details of our dataset & annotation below and present details of competition track 1 in the corresponding separate webpages. More details & some visualization examples can be found in our white paper.

  • Notably, our Kinetics-GEBD has the largest number of boundaries (e.g. 32x of ActivityNet, 8x of EPIC-Kitchens-100) which are in-the-wild, open-vocabulary, cover generic event change, and respect human perception diversity.

Examples of generic event boundaries: 1) A long jump is segmented at a shot cut, then between actions of Run, Jump and Stand up (dominant subject in red circle). 2) color/brightness changes. 3) new subject appears.


Overview

  • According to cognitive science, humans perceive videos in terms of different events, which are separated by the status changes of dominant subjects in the video. For example, in the figure below, humans perceive the process of “javelin sport” by the action events such as “walking”, “running” and “throwing”. These events are triggered by the athlete’s status changes, like the instantaneous change from “walking” to “running”.

  • The moment that instantly triggers status changes of persons, objects, or scenes often conveys useful and interesting information among a large amount of repeated, static, or regular events. Therefore, developing the understanding of the status changes is another step towards a more fine-grained and robust video understanding.

  • We denote those status changes by boundaries that segment a whole video into chunks. Among current types of tasks, captioning could be one of the best touchstones to examine the correctness and human-likeness of video understanding. Hence, we study the task of Generic Event Boundary Captioning (GEBC), aiming at captioning the generic, taxonomy-free event boundaries caused by status changes in videos. Details can be found in our paper: https://arxiv.org/abs/2204.00486

Dataset and Task Demonstration

  • Motivated by this idea, we build a new dataset called Kinetic-GEBC(Generic Event Boundary Captioning) which includes the video boundaries indicating status changes happening in generic events. For every boundary, our Kinetic-GEBC provides the temporal location and a natural language description, which consists of the dominant Subject, Status Before and Status After the boundary.

  • In comparison of most relevant datasets in video captioning, our Kinetic-GEBC is the first one targeting on the captioning of generic event boundaries.

  • In this challenge, we focus on the task of Boundary Captioning, aiming at developing the understanding of status changes through captioning

  • Provided with the timestamp of a boundary, the machine is required to generate a sentence describing the status change at the boundary.

  • An example is shown on the right side. Given the timestamp of 00:03.94, the visual information around that timestamp is taken as input. Then the machine is supposed to output a caption consists of three items: Subject, Status Before and Status After. In this example, the boundary is caused by the athlete's status changes from walking to running.

Overview

  • “How can I run the microwave for 1 minute?” We often need assistance when dealing with new devices. Learning from the device’s instructional manual or video may require a lot of time and effort. Could we turn to AI assistants (e.g., AR glasses, robots) to teach us?

  • In our ideal, the AI assistant should have comprehensive abilities:

  1. To learn about the devices, it ​​should learn operations from the instructional manual/video;

  2. To understand the question, it should match the specific content of the instructional manual/video to the question;

  3. To teach users, it should provide step-by-step guidance, with both language instruction and visual location referring;

As you can see, these are more challenging than VQA and Visual Dialog tasks. Here, we define a new task called Affordance-centric Question-driven Task Completion (AQTC), where the AI assistant should learn from instructional videos and scripts to guide users step-by-step to achieve their goals.

  • An example is shown in the video below:

AQTC Task & AssistQ Dataset

  • AQTC expects the AI assistant to address affordance-centric questions by inferring from multimodal cues and generating step-by-step answers. Given an instructional video, the video script, the user view image, the user's question, and the set of candidate answers, the AI should select only one correct answer in each step.

    • The answers are multimodal. An answer will contain bounding-boxes of buttons in the user view image.

    • The answers are multi-step. A question should be completed step-by-step.

  • To support this task, we constructed AssistQ, a new dataset comprising 529 question-answer samples derived from 100 newly filmed videos. Each question should be completed with multi-step guidance inferring from visual details and textural details.

    • More specifically, the videos span 25 common household appliances, such as microwaves and washing machines, and 53 distinct appliance brands. The average duration of the videos is 115 seconds and for complex devices there will be a longer duration to capture the large variety of functions.

    • In comparison to existing datasets in VQA, Visual Dialog, and Embodied QA, our AssistQ is unique for egocentric perspective video, affordance-centric questions and multi-step answers.