Jiafei Duan1* , Samson Yu2* , Nicholas Tan3 , Li Yi4 , Cheston Tan2
Institute for Infocomm Research, A*STAR1 , Centre for Frontier AI Research, A*STAR2 , Department of Computer Science, NUS3 , Institute for Interdisciplinary Information Science, Tsinghua University4
Humans with an average level of social cognition can infer the beliefs of others based solely on the non-verbal communication signals (e.g., gaze, gesture, pose, and context) exhibited during social interaction. This social cognitive ability to predict human beliefs and intentions is more important than ever for ensuring the safety of human-robot collaboration. This paper uses the combined knowledge of Theory of Mind (ToM) and Object-Context Relations to investigate methods for enhancing collaboration between humans and autonomous systems in environments where verbal communication is prohibited. Therefore, we proposed a challenging new multimodal 3D video dataset for assessing the ability of Artificial Intelligence (AI) systems to predict human belief states in an object-context scenario. The proposed dataset consists of precise labelling of human belief state ground-truth and multimodal sensory inputs replicating all nonverbal communication inputs captured by human perception. We further evaluated our dataset with the existing deep learning models and provided new insights into the effects of the various input modalities on the performance of the baseline models and the impact of knowing the contextual objects on the overall performance.
(A) Real-world examples of collaborative tasks that requires the inferring of each other’s beliefs via nonverbal communication. (B) An example of the instructions provided during data collection
Examples of the BOSS dataset: (A) Frames of third-person view of the interactions between the subjects with their individual isolated inner voice. (B) A visualization of their annotated belief
dynamics across the given context scenario. (C) The egocentric perception of the subjects. (D) Multi-modalities inputs obtained from sensory inputs and post-processing: pose estimation, depth,
object detection, gaze tracking, and hand gesture tracking
BOSS Dataset
BOSS - A Benchmark for Human Belief Prediction in Object-context Scenarios - is a large-scale machine theory of mind dataset consisting of 900 recorded videos over 347,490 frames from both the egocentric and third-person perspectives of our human test subjects (20 participants). Furthermore, the BOSS dataset was collected with the various multimodal sensory inputs (e.g. gaze, gesture, pose, object detection and etc.).
An example overview of the left verse right subject's belief states for the contextual object class of lemon across the 20 participant pairs.
Overview of the dataset statistics. (I) The frequency of all the potential object-context matches. (II) The mean delayed belief updates across all of the context object classes. (III) The distribution of sequence lengths across the context object classes.
Baseline Evaluation
In the evaluation of our BOSS dataset, a model needs to select the right beliefs of the two people in each frame. We carry out an ablative study on various baseline models (Random, CNN, CNN+Conv1D, CNN+GRU, CNN+LSTM) and record the respective prediction accuracies of the belief states.
Belief states prediction accuracy across the various deep learning baselines for different data categories.