Improving Social Awareness Through DANTE: A Deep Affinity Network for Clustering Conversational Interactants
Mason Swofford, John Peruzzi, Nathan Tsoi, Sydney Thompson, Roberto Martín-Martín, Silvio Savarese, Marynel Vázquez
CSCW 2020
Mason Swofford, John Peruzzi, Nathan Tsoi, Sydney Thompson, Roberto Martín-Martín, Silvio Savarese, Marynel Vázquez
CSCW 2020
Automatic detection of conversational group enables a rich set of intelligent, social computer interfaces. For example, group detection has traditionally enabled surveillance systems, socially-aware mobile systems, interactive displays, and exhibits. In the context of robotics, group detection is also essential for situated spoken language interaction, non-verbal robot behavior generation, and socially-aware robot navigation in human environments. However, detecting conversations in dynamic human environments is an intricate problem, requiring the perception of subtle aspects of social interactions.
In this work, we study the problem of visually recognizing situated group conversations by analyzing proxemics – people’s use of physical space. In particular, we study automatic recognition of spatial patterns of human behavior that naturally emerge during group conversations.
Most prior work on visual F-Formation detection has focused on explicitly modeling properties of conversational group spatial arrangements. For instance, people tend to keep a social distance from one another during conversations and orient their bodies towards the center of their group, but these approaches do not typically account for the malleability inherent in human spatial behavior. For example, people naturally adapt to crowded environments and modify their spatial formations by interacting closer if need be. Robustness to these complex scenarios is essential for reasoning about group conversations through spatial analysis in real applications.
In this work, we explore using the powerful approximation capabilities of Deep Learning to identify conversations and their members. To do this, we leverage a classical graph clustering algorithm (Dominant Sets) and view the F-Formation detection problem as finding sets of related nodes in an interaction graph. The nodes of the graph correspond to individuals in a scene with associated spatial features obtained through image processing. The graph edges connect two nearby people and have an associated affinity (weight) that encodes the likelihood that they are conversing. Under this framing, the key challenge for F-Formation detection is to compute appropriate affinities for identifying groups. While prior work used simple heuristics to compute edge weights, we propose to learn a function that predicts these weights.
Our method receives as input spatial features (e.g., position x and orientation θ) for the social agents in a scene (a). This information is used to create an interaction graph (b) and to compute pair-wise affinities with DANTE (c). The affinities are assembled into an affinity matrix (d) to cluster nodes (e).
Our proposed novel affinity prediction function is termed DANTE, which stands for Deep Affinity NeTwork for clustEring conversational interactants. DANTE predicts the pairwise affinities between two people, while taking into account the social context of the other interactants. DANTE consists of 3 parts:
We conduct systematic evaluations of our proposed group detection approach on established benchmarks. We use 3 traditional conversational group detection datasets:
We additionally test how well our method generalizes by evaluating on a general group detection dataset:
We evaluate our models using the T=1 F1 metric, which measures the percentage of groups correctly identified. See Section 4.2 of the paper for more details.
In our experiments, we outperform prior state-of-the-art conversational group detection algorithms across all standard benchmark datasets and in our generalization experiment. We also experimented with an ablated version of DANTE, termed DANTE-NoContext, that did not have the Context Transform and found it performed worse on the higher quality datasets, Cocktail Party and SALSA, but it performed better in the low quality dataset, Coffee Break. We attribute this to DANTE-NoContext's simplicity and lower susceptibility to overfitting. However, the Context Transform clearly helps our algorithm's performance on the higher quality datasets by allowing DANTE to aggregate global information.
Our group detection approach can be used to increase the social awareness of interactive systems. To demonstrate this in practice, we built an interactive system using the Robot Operating System (ROS). In our demonstration application, a table-top robot is used to identify F-Formations based on users’ spatial behavior relative to each other and its own spatial configuration in our lab environment. The main components of our interactive system are a robot arm with a screen face, and two RGB-D cameras. The robot and all sensors are connected to a nearby desktop computer, which processes data in real-time and controls the robot.