Sensing human social behaviors in the wild requires setups (including cameras, microphones, wearable sensors, etc.) that capture human interactions in naturalistic settings where people interact freely with one another. Even though many researchers have built advanced motion capture studios, it is challenging in practice to replicate these conditions in real-life scenarios such as in a conference or meeting. The assumptions and design choices made during data collection have a direct impact on downstream analysis and methods development.
ConfLab is a multimodal multisensor dataset of in-the-wild free-standing social conversations. It records a real-life professional networking event at the international conference ACM Multimedia 2019. Involving 48 conference attendees, the dataset captures a diverse mix of status, acquaintance, and networking motivations. Modalities include overhead videos, low-frequency audio, bluetooth proximity readings, and inertial motion data. Representative socially-relevant tasks enabled by this dataset include human pose estimation, speaking status detection, and conversationg group deteciton. Visit Conflab for more information.
REWIND dataset is the first in-the-wild mingling dataset with high-quality raw audio, video, and acceleration; automatic pose annotations, and automatic speaking status labels. This dataset enables research tasks that investigate no-audio speech activity segmentation through body movements.
Multimodal data synchronization: Current approaches of synchronizing multimodal data are mostly wired setups that emphasize precision and reliability, leveraging direct connections to ensure accurate data capture. However, the reliance on wired systems can limit the flexibility and mobility of participants, restricting the scope of research to tightly controlled conditions. As we emphasize the importance of recording human behaviors in naturalistic settings, we have made contributions to software-hardware interface-based solutions to enable synchronization of multimodal multi-participants data streams.
Continuous data annotation: Continuous-time annotation, where coders and annotators label data while viewing continuous media (such as video, audio, or time series), has traditionally been used for annotating continuous-value variables like arousal and valence in Affective Computing. In contrast, machine perception tasks are typically annotated frame-by-frame. For action recognition, annotators pinpoint the start and end frames of the target action via a GUI. However, due to the length of videos commonly found in social interaction datasets, this approach can be time-consuming, frustrating, and expensive process. In this contribution, we propose a new technique allowing an easy way to follow a keypoint in a video using the mouse cursor.
C. Raman#, J. Quiros#, S. Tan#, et al., ConfLab: a data collection concept, dataset, and benchmark for machine analysis of free-standing social interactions in the wild, Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS) – Datasets and Benchmarks Track, 2022.
J. Quiros, C. Raman, S. Tan, E. Gedik, L. Cabrera-Quiros, and H. Hung. "REWIND Dataset: Privacy-preserving Speaking Status Segmentation from Multimodal Body Movement Signals in the Wild." arXiv preprint arXiv:2403.01229 (2024).
C. Raman#, S. Tan#, H. Hung, Modular multimodal-multisensor data acquisition and synchronization of audio, video, and wearable device data, Proceedings of the 28th ACM International Conference on Multimedia (ACM-MM), 2020, 3586-3594.
J. Quiros, S. Tan, C. Raman, L. Cabrera-Quiros, and H. Hung. Covfee: an extensible web framework for continuous-time annotation of human behavior, Understanding Social Behavior in Dyadic and Small Group Interactions, pp. 265-293. PMLR, 2022.
#: Equal co-authors and contributions