Demo2Vec:

Learning Object Affordances from Videos in the Wild

University of Southern California & Stanford University

(* indicates equal contributions)


Abstract

Watching expert demonstrations is an important way for humans and robots to reason about affordances of unseen objects. Humans can summarize the information of a demonstration and transfer the learned knowledge when interacting with the same object in different scenarios. In this paper, we consider the problem of reasoning object affordances through the feature embedding of demonstrations. We design the Demo2Vec model which learns to extract embedding vectors of demonstration videos, and predicts the interaction region and the action label on a target image of the same object. We also collected and annotated the Online Product Review dataset for Affordance (OPRA) for learning and evaluating the Demo2Vec model, on which our method outperforms various recurrent neural network baselines.

Model

(a) Our demo2vec model is composed of a demonstration encoder and an affordance predictor.

(b) Demonstration encoder: The demonstration encoder summarizes the input demonstration video into an embedding vector. The affordance predictor then uses the embedding vector to predict the affordances for the target image (ie. the interaction heatmap and the action label).

Online Product Review for Affordances (OPRA) Dataset

Our OPRA Dataset contains 11,505 demonstration clips and 2,512 object images scraped from 6 popular YouTube product review channels. Each entry in the dataset consists of a demonstration video, a target image, ten annotated points (shown as red dots on the images) representing the interaction region, and an action label (shown in purple boxes).

Example Demonstration Sequence

Example demonstration video segmented into clips as shown by each box. It shows the sequential task of making a smoothie.

Predicted Affordances

Sample predicted heatmaps and action labels.

More results can be found here.

Citation

@inproceedings{demo2vec2018cvpr,
author = {Fang, Kuan and Wu, Te-Lin and Yang, Daniel and Savarese, Silvio and Lim, Joseph J.},
title = {Demo2Vec: Reasoning Object Affordances From Online Videos},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2018}
}