Building embodied autonomous agents capable of participating in social interactions with humans is one of the main challenges in AI. Within the Deep Reinforcement Learning (DRL) field, this objective motivated multiple works on embodied language use. However, current approaches focus on language as a communication tool in very simplified and non-diverse social situations: the "naturalness" of language is reduced to the concept of high vocabulary size and variability. In this paper, we argue that aiming towards human-level AI requires a broader set of key social skills: 1) language use in complex and variable social contexts; 2) beyond language, complex embodied communication in multimodal settings within constantly evolving social worlds. We explain how concepts from cognitive sciences could help AI to draw a roadmap towards human-like intelligence, with a focus on its social dimensions. As a first step, we propose to expand current research to a broader set of core social skills. To do this, we present SocialAI, a benchmark to assess the acquisition of social skills of DRL agents using multiple grid-world environments featuring other (scripted) social agents. We then study the limits of a recent SOTA DRL approach when tested on SocialAI and discuss important next steps towards proficient social agents.
Dance
The agent does not manage to learn to imitate 3-steps dances through visual and language observations.
CoinThief
Our agent is able to learn a sub-optimal policy, but it does not reliably manage to infer the correct number of coins that are visible by the NPC (47% accuracy).
DiverseExit
The agent learns the suboptimal policy of ignoring the teaching NPC and instead goes to a random door in all episodes (25% accuracy).
Help
The agent that was trained in the Exiter role while observing its social partner execute the Helper role is unable to perform the Helper role at test time.
ShowMe
The agent fails to learn: it is unable to analyze the social peer's behavior to infer the correct button to press to enable exiting through the door.
SocialEnv
Unsuprisingly, our agent also fails to learn in out meta-environment, in which a random SocialAI environment is sampled for each new episode.
Simple Help (no zero-shot role swapping)
When the agent does not have to perform (at test time) the role of it's social peer in a zero shot fashion, i.e. role inference is no longer needed, the agent successfully learns to behave appropriately.
Simple CoinThief (with NPC-visible coins tagged)
Reducing the social complexity of CoinThief by visually tagging coins that are NPC-visible (i.e. removing the need for field of view inference), PPO+Explo agents reach an across-seed mean performance of 74% !
Simple TalkItOut (no liar NPC)
When there are only trustworthy NPCs, and the social interaction length is reduced (agent only has to talk with the guide NPC), the PPO+Explo agent manages to master TalkItOut.