Learning Object Affordances from Videos in the Wild
Our demo2vec model summarizes the demonstration video into the feature embedding, and uses it to predict object affordances
Watching expert demonstrations is an important way for humans and robots to reason about affordances of unseen objects. Humans can summarize the information of a demonstration and transfer the learned knowledge when interacting with the same object in different scenarios. In this paper, we consider the problem of reasoning object affordances through the feature embedding of demonstrations. We design the Demo2Vec model which learns to extract embedding vectors of demonstration videos, and predicts the interaction region and the action label on a target image of the same object. We also collected and annotated the Online Product Review dataset for Affordance (OPRA) for learning and evaluating the Demo2Vec model, on which our method outperforms various recurrent neural network baselines.
(a) Our demo2vec model is composed of a demonstration encoder and an affordance predictor.
(b) Demonstration encoder. The demonstration encoder summarizes the input demonstration video into an embedding vector. The affordance predictor then uses the embedding vector to predict the affordances for the target image (ie. the interaction heatmap and the action label).