Learning Object Affordances from Videos in the Wild

Our demo2vec model summarizes the demonstration video into the feature embedding, and uses it to predict object affordances


Watching expert demonstrations is an important way for humans and robots to reason about affordances of unseen objects. Humans can summarize the information of a demonstration and transfer the learned knowledge when interacting with the same object in different scenarios. In this paper, we consider the problem of reasoning object affordances through the feature embedding of demonstrations. We design the Demo2Vec model which learns to extract embedding vectors of demonstration videos, and predicts the interaction region and the action label on a target image of the same object. We also collected and annotated the Online Product Review dataset for Affordance (OPRA) for learning and evaluating the Demo2Vec model, on which our method outperforms various recurrent neural network baselines.

Video Overview


(a) Our demo2vec model is composed of a demonstration encoder and an affordance predictor.

(b) Demonstration encoder. The demonstration encoder summarizes the input demonstration video into an embedding vector. The affordance predictor then uses the embedding vector to predict the affordances for the target image (ie. the interaction heatmap and the action label).

Online Product Review Dataset for Affordances (OPRA)

Samples from our dataset. Each data point consists of a video, an image, ten annotated points (shown as red dots on the images) representing the interaction region, and an action label (shown in purple boxes) for the human-object interaction centered at the indicated region(s).

Sample Sequence of Data

Sample demonstration video, segmented into clips as shown by each box, showing the sequential task of making a smoothie. The sequential nature of some demonstration videos allows for learning sequential action planning.

Sample Results

Click this link to see more results.

Sample predicted heatmaps and action labels.