EgoTaskQA: Understanding Human Tasks in Egocentric Videos
What is EgoTaskQA?
Understanding human tasks through video observations is an essential capability of intelligent agents. The challenges of such capability lie in the difficulty of generating a detailed understanding of situated actions, their effects on object states (i.e., state changes), and their causal dependencies. These challenges are further aggravated by the natural parallelism from multi-tasking and partial observations in multi-agent collaboration. We introduce the EgoTaskQA benchmark for evaluating models' capabilities on these critical aspects in egocentric videos.
We design questions that target the understanding of (1) action dependencies and effects, (2) intents and goals, and (3) agents' beliefs about others. We consider reasoning problems including descriptive (what status?), predictive (what will?), explanatory (what caused?), and counterfactual (what if?) queries to provide diagnostic analyses on spatial, temporal, and causal understandings of goal-oriented tasks.
The EgoTaskQA Benchmark
We introduce the EgoTaskQA benchmark that evaluates models' event understanding capabilities with goal-oriented questions. Some examples in the EgoTaskQA dataset are shown below:
For more details, you can explore the dataset here and check the visualizations in the supplementary material.
The Augmented LEMMA Dataset
In order to generate diverse questions on task execution details, we augment the LEMMA dataset with ground truths of object states, relationships, and agents' beliefs about others. More specifically, we annotate object states and relationships both before and after actions. We further annotate multi-agent relationships on object visibility and agents' awareness of others. Based on these annotations, we determine the causal relationships between actions to provide a complete dependency graph.
Code
View codes and instructions for reproducing the experiments on Github.
Bibtex
@inproceedings{jia2022egotaskqa,
title = {EgoTaskQA: Understanding Human Tasks in Egocentric Videos},
author = {Jia, Baoxiong and Lei, Ting and Zhu, Song-Chun and Huang, Siyuan},
booktitle = {The 36th Conference on Neural Information Processing Systems (NeurIPS 2022) Track on Datasets and Benchmarks},
year = {2022}
}
Team
1 UCLA Center for Vision, Cognition, Learning, and Autonomy (VCLA)
2 Beijing Institute for General Artificial Intelligence (BIGAI)
3 Institute for Artificial Intelligence Peking University
4Department of Automation, Tsinghua University