Conceptual diagram of the proposed Message Passing Framework (MPF): The proposed MPF stabilises the noisy prediction output of heterogeneous state-of-the-art vision modules and stores core sources (hand grasping state, object information and human body pose) in long-term memory that can be used for human-robot interaction.
In Human Robot Interaction (HRI) scenarios, robot systems would benefit from an understanding of the user's state, actions and their effects on the environments to enable better interactions. While there are specialised vision algorithms for different perceptual channels, such as objects, scenes, human pose, and human actions, it is worth considering how their interaction can help improve each other's output. In computer vision, individual prediction modules for these perceptual channels frequently produce noisy outputs due to the limited datasets used for training and the compartmentalisation of the perceptual channels, often resulting in noisy or unstable prediction outcomes. To stabilise vision prediction results in HRI, this paper presents a novel message passing framework that uses the memory of individual modules to correct each other's outputs. The proposed framework is designed utilising common-sense rules of physics (such as the law of gravity) to reduce noise while introducing a pipeline that helps to effectively improve the output of each other's modules. The proposed framework aims to analyse primitive human activities such as grasping an object in a video captured from the perspective of a robot. Experimental results show that the proposed framework significantly reduces the output noise of individual modules compared to the case of running independently. This pipeline can be used to measure human reactions when interacting with a robot in various HRI scenarios.
Overview of the proposed Message Passing Framework (MPF): The framework adaptively constructs the appropriate input type, either frame or video. Then, the object detection outputs are refined with the object memory filtering block, and object prediction information is stored in long-term memory for further updates later. Once the refined object information is stored in memory, the common-sense physics reasoning block further removes the object detection module's noisy output in a run-time manner. For this, MPF mainly uses a hand bounding box and its state result given from the hand-object state estimator. If the state estimator fails, the wrist joint position is chosen to be used as an alternative hand information. The body pose estimation results are also stabilised by cross-checking with the output of video instance (body) segmentation. In this way, multiple heterogeneous modules exchange messages through a shared output memory to improve overall recognition performance.
Youngkyoon Jang, Yiannis Demiris
Message Passing Framework for Vision Prediction Stability in Human Robot Interaction
IEEE International Conference on Robotics and Automation (ICRA), Philadelphia (PA), USA, May 23-27, 2022.
Download: [pdf], [BibTeX], [Demo], [Publicising Page]
Code: [Github]
coming soon
Example of vision module stabilisation using message passing framework (MPF): (a) The original output of the object detection module (Mask R-CNN) shows incorrect prediction (e.g. name and boundary of an object) and missing objects, (b) Visualisation of object bounding boxes stored in memory -- MPF uses the temporal contextual information to display the representative object labels among the multiple predictions in the same bounding box, (c) The law of gravity -- MPF removes the stored object bounding boxes that are not located on top of furniture objects, (d) Hand-object coupled movement rule -- MPF maintains an object bounding box when an object is not on furniture but a person is holding it, and (e) Spatial redundancy check and elimination -- Multiple bounding boxes for the same object are examined and only the closest objects with object detection results are kept. Even if the object detection results are noisy and missing at runtime, the memory of the MPF is reliably preserved in HRI scenarios.
This research was supported by the UKRI Node on Trust (EP/V026682/1), and a Royal Academy of Engineering Chair in Emerging Technologies.
Author URL: [Youngkyoon Jang]*, [Yiannis Demiris]
Affiliation URL: [Personal Robotics Lab] (Imperial College London)
PRL's software: [PRL SW]
UKRI TAS Node on Trust: [webpage] [twitter]
* This work was undertaken while Youngkyoon Jang was a research associate affiliated with Imperial College London.