Improving Social Awareness Through DANTE: A Deep Affinity Network for Clustering Conversational Interactants
Mason Swofford, John Peruzzi, Nathan Tsoi, Sydney Thompson, Roberto Martín-Martín, Silvio Savarese, Marynel Vázquez
Automatic detection of conversational group enables a rich set of intelligent, social computer interfaces. For example, group detection has traditionally enabled surveillance systems, socially-aware mobile systems, interactive displays, and exhibits. In the context of robotics, group detection is also essential for situated spoken language interaction, non-verbal robot behavior generation, and socially-aware robot navigation in human environments. However, detecting conversations in dynamic human environments is an intricate problem, requiring the perception of subtle aspects of social interactions.
In this work, we study the problem of visually recognizing situated group conversations by analyzing proxemics – people’s use of physical space. In particular, we study automatic recognition of spatial patterns of human behavior that naturally emerge during group conversations.
Most prior work on visual F-Formation detection has focused on explicitly modeling properties of conversational group spatial arrangements. For instance, people tend to keep a social distance from one another during conversations and orient their bodies towards the center of their group, but these approaches do not typically account for the malleability inherent in human spatial behavior. For example, people naturally adapt to crowded environments and modify their spatial formations by interacting closer if need be. Robustness to these complex scenarios is essential for reasoning about group conversations through spatial analysis in real applications.
Group Detection Approach
In this work, we explore using the powerful approximation capabilities of Deep Learning to identify conversations and their members. To do this, we leverage a classical graph clustering algorithm (Dominant Sets) and view the F-Formation detection problem as finding sets of related nodes in an interaction graph. The nodes of the graph correspond to individuals in a scene with associated spatial features obtained through image processing. The graph edges connect two nearby people and have an associated affinity (weight) that encodes the likelihood that they are conversing. Under this framing, the key challenge for F-Formation detection is to compute appropriate affinities for identifying groups. While prior work used simple heuristics to compute edge weights, we propose to learn a function that predicts these weights.
Our method receives as input spatial features (e.g., position x and orientation θ) for the social agents in a scene (a). This information is used to create an interaction graph (b) and to compute pair-wise affinities with DANTE (c). The affinities are assembled into an affinity matrix (d) to cluster nodes (e).
Our proposed novel affinity prediction function is termed DANTE, which stands for Deep Affinity NeTwork for clustEring conversational interactants. DANTE predicts the pairwise affinities between two people, while taking into account the social context of the other interactants. DANTE consists of 3 parts:
- A Dyad Transform that computes a local feature representation for the pair of people (i, j). This is a multi-layer perceptron (mlp) which outputs a low dimensional encoding.
- A Context Transform that computes a global feature representation for the social context of the dyad of interest. In order to handle a variable number of interactants, the Context Transform borrows ideas from the PointNet architecture. Each interactant's features are separately input into an mlp to obtain a feature encoding. We obtain our final global feature encoding by max pooling the interactants features across the person dimension.
- A final combining and prediction layer that concatenates the local feature and global feature and inputs the result to an mlp which outputs the affinity prediction for the Dyad.
We conduct systematic evaluations of our proposed group detection approach on established benchmarks. We use 3 traditional conversational group detection datasets:
- Cocktail Party Dataset: Contains about 30 min. of video recordings of a cocktail party in a lab environment. The video shows 6 people conversing with one another and consuming drinks and appetizers. The party was recorded using four synchronized cameras installed in the corners of the room. Subjects’ positions were logged using a particle filter-based body tracker with head pose estimation. Conversational groups were annotated at 5 sec. intervals, resulting in 320 frames with ground truth group annotations.
- SALSA Dataset: 18 participants were recorded using multiple cameras and sociometric badges and then annotated at 3 second intervals over the course of 60 minutes, giving 1,200 total frames. The dataset consists of a poster presentation session and a cocktail party. Despite the differences in the structure of F-Formations that appear in these two settings, we treat SALSA as a single dataset to test generalization to different group formations.
- Coffee Break Dataset: Images were collected using a single camera outdoors. People engaged in small group conversations during coffee breaks. The number of people per frame varied from 6 to 14. People tracking is rough, with orientations only taking values of 0, 1.57, 3.14, and 4.71 radians. Compared to Cocktail Party and SALSA, the spatial features provided by Coffee Break are far noisier. A total of 119 frames have ground truth group annotations.
We additionally test how well our method generalizes by evaluating on a general group detection dataset:
- Friends Meet: 53 synthetic and real sequences of varying group types, including but not restricted to conversational groups. Keeping in line with prior work , we restrict our training and evaluation to the synthetic sequences. These sequences were chosen by  because the real sequences are not labeled by group type. Also,  removed queuing sequences from the data because queues are semantically and spatially different from the other group interactions in the dataset, e.g., groups of pedestrians that walk together towards a destination. Therefore, we present our results based on the 25 non-queuing synthetic sequences, with 200 annotated frames per sequence, for a total of 5,000 frames.
We evaluate our models using the T=1 F1 metric, which measures the percentage of groups correctly identified. See Section 4.2 of the paper for more details.
In our experiments, we outperform prior state-of-the-art conversational group detection algorithms across all standard benchmark datasets and in our generalization experiment. We also experimented with an ablated version of DANTE, termed DANTE-NoContext, that did not have the Context Transform and found it performed worse on the higher quality datasets, Cocktail Party and SALSA, but it performed better in the low quality dataset, Coffee Break. We attribute this to DANTE-NoContext's simplicity and lower susceptibility to overfitting. However, the Context Transform clearly helps our algorithm's performance on the higher quality datasets by allowing DANTE to aggregate global information.
Our group detection approach can be used to increase the social awareness of interactive systems. To demonstrate this in practice, we built an interactive system using the Robot Operating System (ROS). In our demonstration application, a table-top robot is used to identify F-Formations based on users’ spatial behavior relative to each other and its own spatial configuration in our lab environment. The main components of our interactive system are a robot arm with a screen face, and two RGB-D cameras. The robot and all sensors are connected to a nearby desktop computer, which processes data in real-time and controls the robot.