Project Description

Wearable devices equipped with a camera, such as smart glasses and augmented reality headsets, can sense the world from the user’s point of view. Hence, they can enable user-centric applications and act as personal assistants able to always “follow” the user, see and understand the world from their perspective, and provide guidance on how to carry out specific tasks. To do so, wearable devices need to develop a long-term and holistic comprehension of the way the user interacts with the surrounding world, understanding current interactions with objects, keeping track of past interactions, and possibly anticipating future ones. Previous works have generally modeled user-object interactions at the category level or in a class-agnostic fashion and mostly from a static point of view, (short clips or static frames), which provides a short-range understanding of interactions. Moreover, such approaches usually consider an offline scenario in which the algorithm keeps a possibly infinite video buffer of all observed past video and can access it at any time for re-processing or search (e.g., to search all possible video segments in which a given object appears). We observe that humans understanding of object interactions tends to be instance-based (we care about specific object instances - e.g., this knife), long-range (we can track the history of a given object - e.g., where did I use this knife before) and opportunistic (where did I see the weight scale that I need now?) as opposed to extremely task-oriented (search the room for the mincing knife I need now). To achieve this kind of understanding, humans can 1) automatically discover objects which may be useful for future, possibly unknown tasks, 2) track them through time and space, 3) monitor their relationships with other objects (e.g., “where did this object come from?”). Inspired by humans’ abilities to perform long-range user-object interaction understanding in a streaming scenario (i.e., the video is processed only once when it is acquired, rather than stored for later re-use), we propose to develop a system able to 1) discover important objects to be tracked and highlight their relationships with other objects, 2) track the discovered objects in an instance-based long-term fashion, 3) use information about the detected objects and associated object tracks to form symbolic, high-level and compact memories describing past user interactions, 4) exploit such memories to carry out downstream tasks. This will be implemented through the realization of a user-object interaction discovery module, a visual object tracking module, and a memory forming and downstream task exploitation module. We observe that the proposed research goes towards the realization of a key enabling technology able to understand user-object interactions in a long term fashion, which can be beneficial in the health domain, where wearable systems can be used for cognitive assessment and training.