Dataset Info

Multimodal Social HRI Ruptures DB

We will use a dataset collected in our previous study [1, 2], in which we deployed a robotic positive psychology coach at a workplace over four weeks. Around 920 hours of audio-video recordings of 23 employees have been collected during the dyadic interaction between a human and a mental well-being robotic coach during a positive psychology (PP) practice over four weeks. These 23 employees signed an informed consent.  Based on the conditions of the consent and to further respect the anonymity of these participants, their data will not be released; only processed feature statistics and models will be shared upon signing this agreement. The feature statistics that are being shared have been obtained by processing the raw audio-visual recordings using offline state-of-the-art tools as follows:

1) Facial Features: We utilized the offline OpenFace 2.2.0 toolkit to extract the presence and intensity of 17 facial action units (AUs), resulting in a total of 34 facial features per frame. These features are purely statistical representations of only presence (a binary, 0-1) or intensity (int that range from 1 to 5) of the facial gestures (e.g., inner eyebrows moved up is AU1; outer eyebrows moved up is AU2).

2) Audio Features: Our extraction process involved the use of the offline openSMILE toolbox to derive statistical and spectral speech features, such as loudness and pitch over a time window. These features are statistical representations of audio characteristics and do not include any ‘content’ or ‘speech’ data.

3) Body Features: Using the offline OpenPose toolbox, we extracted 25-2D body key points per frame to estimate movement patterns of various body parts. Specifically, we compute the distance and velocity between body keypoints. These features are statistical representations.

4) Labels: We annotated the videos using the ELAN video annotation tool. We marked instances of user awkwardness and robot mistakes with binary labels (i.e., 1: present, or 0: absent), marking the time when the displays of user awkwardness or robot mistakes start and end. These labels have been defined in [1] as follow: 

Only feature statistics and baseline models will be released. Audio-video recordings will not be provided due to anonymity and ethical requirements.