The dataset consists of 15 hours of audio and video data collected from a dyadic human-social robot conversation study [1] and a dyadic human-voice assistant conversation study [2]. The human-voice assistant conversation data consists of 11 hours and 17 minutes of audio and video data of 22 unique participants interacting with the voice assistant to complete a set of one to three tasks including medical self-diagnosis (3 hours 25 minutes of data), planning a day in Edinburgh (4 hours 39 minutes), and discussing on whether universities have their own police forces (3 hours 12 minutes). The human-social robot conversation data consists of 3 hours and 43 minutes of audio and video data of 23 unique participants interacting with a social robot to complete a set of one to two tasks including picking items for desert survival simulation (1 hour 47 minutes) and discussing whether the federal government should ban capital punishment (1 hour 55 minutes).
We collected video (camera facing the user’s face) and audio (user speech and robot speech) recordings of the interactions. To preserve user privacy, we used off-the-shelf state-of-the-art methods to extract multimodal behavioral features from the audio and video recordings:
Facial Features: We used the OpenFace 2.2.0 toolkit to extract the presence and intensity of 17 facial action units (AUs), in a total of 34 facial features per frame.
Head Pose Features: We used the OpenFace 2.2.0 toolkit to estimate the user's head pose (location and rotation).
Speech Features: We used the openSMILE toolbox and extracted interpretable speech features, namely loudness and pitch per window-size (e.g., for MFCC the toolbox creates frames of 25ms length every 10ms). In addition, we will also provide CLIP [3] embeddings of the transcripts. We used Google's speaker diarization to detect user and robot speaker turns in an audio recording.
We annotated the videos using the Datavyu video annotation tool. We marked instances of robot errors and user intent to correct for a mismatch between robot behavior and their expectation with binary labels (\ie 1: present, or 0: absent), marking the time when the robot mistakes start and end and when the user displays intention to disruptively interrupt. These labels have been defined as follows:
Robot Mistake (0-absent, 1-present): The robot makes a mistake such as interrupting or not responding to the user, or responding with an error message or an utterance that is not appropriate for what the user has just said.
Detection of user intention to correct for a mismatch between robot behavior and their expectation} (0-absent, 1-present): The user displays behavior (verbal or non-verbal) that signals an intention to correct for a mismatch between robot behavior and their expectation such as user-initiated disruptive interruption. Disruptive interruption is defined to be when the listener challenges the speaker’s control and disrupts the conversational flow to express an opposing opinion, take the floor, change the subject, or summarize the speaker’s point to end the turn and avoid unwanted information. This behavior suggests that there exists some mismatch between the user's expectation for the robot and the robot's behavior.
[1] Shiye Cao, Jiwon Moon, Amama Mahmood, Victor Nikhil Antony, Ziang Xiao, Anqi Liu, and Chien-Ming Huang. 2025. Interruption Handling for Conversational Robots. arXiv preprint arXiv:2501.01568 (2025).
[2] Amama Mahmood, Junxiang Wang, Bingsheng Yao, Dakuo Wang, and Chien-Ming Huang. 2025. User interaction patterns and breakdowns in conversing with LLM-powered voice assistants. International Journal of Human-Computer Studies 195 (2025), 103406.
[3] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. 8748–8763.