LEMMA: A Multi-view Dataset for LEarning Multi-agent Multi-task Activities
Understanding and interpreting human actions is a long-standing challenge and a critical indicator of perception in artificial intelligence. However, a few imperative components of daily human activities are largely missed in prior literature, including the goal-directed actions, concurrent multi-tasks, and collaborations among multi-agents. We introduce the LEMMA dataset to provide a single home to address these missing dimensions with meticulously designed settings, wherein the number of tasks and agents varies to highlight different learning objectives. We densely annotate the atomic-actions with human-object interactions to provide ground-truths of the compositionality, scheduling, and assignment of daily activities. We further devise challenging compositional action recognition and action/task anticipation benchmarks with baseline models to measure the capability of compositional action understanding and temporal reasoning. We hope this effort would drive the machine vision community to examine goal-directed human activities and further study the task scheduling and assignment in the real world.
We introduce the LEMMA dataset to explore the essence of complex human activities in a goal-directed, multi-agent, multi-task setting with ground-truth labels of compositional atomic-actions and their associated tasks. By quantifying the scenarios to up to two multi-step tasks with two agents, we strive to address human multi-task and multi-agent interactions in four scenarios: single-agent single-task (1 x 1), single-agent multi-task (1 x 2), multi-agent single-task (2 x 1), and multi-agent multi-task (2 x 2). Task instructions are only given to one agent in the 2 x 1 setting to resemble the robot-helping scenario, hoping that the learned perception models could be applied in robotic tasks (especially in HRI) in the near future.
Both the third-person views (TPVs) and the first-person views (FPVs) were recorded to account for different perspectives of the same activities. We densely annotate atomic-actions (in the form of compositional verb-noun pairs) and tasks of each atomic-action, as well as the spatial location of each participating agent (bounding boxes) to facilitate the learning of multi-agent multi-task task scheduling and assignment.
Compotisional Action Recognition
Human indoor activities are composed of fine-grained action segments with rich semantics. In fact, interactions between human and objects are highly purposive. From the simplest verb of "put", we can generate a plethora of combinations of objects and target places, such as "put cup onto table'', "put fork into drawer''. Situations could become even more challenging when objects were used as tools; for example, "put meat into pan using fork''.
Motivated by the above observations, we propose the compositional action recognition benchmark on the collected LEMMA dataset with each object attributed to a specific semantic position in the action label. Specifically, we build 24 compositional action templates; see examples on the left. In these action templates, each noun could denote an interacting object, a target or a source location, or a tool used by a human agent to perform certain actions.
In this challenge, we require computational models to correctly detect the ongoing concurrent action verbs as well as the nouns at their semantic positions.
We evaluate model performances by metrics on compositional action recognition in both FPVs and TPVs. Specifically, the model is asked to
predict multiple labels in verb recognition for concurrent actions (e.g., "watch tv'' and "drink with cup'' at the same time)
predict multiple labels in noun recognition for each semantic position given verbs, representing the interactions with multiple objects using the same action (e.g., "wash spoon, cup using sink'').
We show qualtitative and quatitative results of our baseline models here:
The most significant factor of human activities, as emphasized in this work, is the goal-directed, teleological stand. An in-depth understanding of goal-directed tasks demands a predictive ability of latent goals, action preferences, and potential outcomes. To tackle these challenges, we propose the action and task anticipation benchmark on the collected LEMMA dataset. Specifically, we evaluate model performances for the anticipation (i.e., predictions for the next action segment) of action and task with both FPV and TPV videos.
This benchmark provides both the training and testing data in all four scenarios of activities to study the goal-directed multi-task multi-agent problem. As there is an innate discrepancy of prediction difficulties among these four scenarios, we gradually increase the overall prediction difficulty, akin to a curriculum learning process, by setting the percentage of training videos to be 3/4, 1/4, 1/4, and 1/4 for 1 x 1, 1 x 2, 2 x 1 and 2 x 2 scenarios, respectively. Intuitively, with sufficient clean demonstrations of tasks in 1 x 1 scenario, interpreting tasks in more complex settings (i.e., 1 x 2, 2 x 1, and 2 x 2) should be easier, thus requiring less learning samples. We believe that such a design will better test the generalization ability of models.
We evaluate models with similar metrics using precision, recall and F1-score. Models' performance is evaluated individually for each scenario. We here show quatitative results of baseline models.
Center for Vision, Cognition, Learning, and Autonomy (VCLA@UCLA)