Rohan Paleja, Michael Munje, Kimberlee Chestnut Chang, Reed Jensen, and Matthew Gombolay
MIT Lincoln Laboratory, University of Texas at Austin, Georgia Institute of Technology
Neural Information Processing Systems (NeurIPS), 2024
In this paper, we transition from the conventional approach of crafting a teaming solution that aims for flawless out-of-the-box performance to a paradigm where end-users can actively interact with and program AI teammates, fostering a more dynamic and developmental interaction between humans and AI.
Human-Human teams proceeds through several stages before achieving maximal performance [Tuckman, 1965]
In the Forming Stage, there is a drop in performance as team members are unfamiliar with each other and still understanding how they should collaborate.
In the Storming stage, team members continue to understand each other and begin to establish roles and strategies.
In the Norming stage, the team performance begins to improve as agents learn to collaborate harmoniously.
In the Performing stage, team members have established roles and strategies, and are at peak performance.
We explore the question: How can we facilitate human-robot teams to reach the performing stage?
Visualization of the stages Tuckman describes that a team goes through before reaching high performance.
Here, we look at two prior frameworks for producing and collaborative agents and display that the AIs trained via these approaches are rigid and exhibit individualized behaviors, missing out on collaborative teaming strategies that can ultimately result in higher team scores.
To produce a collaborative AI teammate, Human-Aware PPO [Carroll et al., 2019] fine-tunes simulated teammates with human data so that the AI trains with “human-like” agents.
In the Simple Ring domain (shown to the right), a simple high-performing collaborative strategy would be to minimize agent movement via efficient handoffs using the middle counter.
Simple Ring Domain
Human-Human gameplay emulating this preferred strategy is very successful.
When attempting to utilize this strategy with an agent, we see that the AI gets confused and remains still for majority of the episode, This unsuccessful collaboration receives a score of 0.
If we instead figure out the AI’s strategy a priori, which is largely based on individual gameplay, we can still work with the AI but the result is not optimal and may not be what the human wants. This individualized coordination results in minor success, achieving a low score of 40.
Next, we look at Fictitious Co-Play [Strouse et al., 2021], a framework that produces collaborative agents without human data. It does so by training an agent with a population of diverse synthetic partners to create an AI that can collaborate with diverse-skilled human players. We train agents in the Optional Collaboration domain (shown on the right) that incentivizes collaboration and resource sharing by having mixed-ingredient dishes (combining onions and tomatoes) worth more score compared to individual ingredient dishes.
Optional Collaboration Domain
We compare the following:
FCP agent evaluated with its synthetic training partner
FCP agent evaluated with real human subjects
Individual Coordination Heuristic (each agent makes single-ingredient dishes and serves them)
Collaborative Heuristic (each agent shares resources to make mixed-ingredient dishes)
Different Teaming Evaluations in the Optional Collaboration Domain
We find that all approaches underperform a simple collaborative heuristic (a critical, negative result for learning-based methods). As FCP trains an agent to work well with a population of agents, where approximately a third of the diverse-skilled population of agents used in training are completely random agents, we posit that the teammate agent must compensate and exhibit individualized behavior.
Here, we attempt to bridge this gap in performance through providing the ability to interact with and visualize agent models.
We start by training two separate agents jointly via single-agent PPO. One agent represents the human policy and the other is the collaborative agent we are attempting to create. To support generating an interpretable AI policy, we propose an architecture, the Interpretable Discrete Control Tree (IDCT). The IDCT is a differentiable decision tree model -- a neural network architecture that takes the topology of a decision tree (DT). We refer readers to our paper for complete details of the model architecture and its training procedure.
Importantly, the resultant representation after training is that of a simple decision tree with categorical probability distributions at each leaf node.
Overview of our collaborative AI Teammate Generation Policy Modification Scheme
After training an IDCT model, we utilize a post-hoc contextual pruning algorithm that allows us to simplify large IDCT models while precisely adhering to model behavior by accounting for:
Node Hierarchy
Impossible Subspaces of the State Space
This allows us the benefit of training large tree-based models, greatly improving ease-of-training, while still being able to simplify the resultant model to a smaller, equivalent representation.
Across both our domains, we find that we can reduce tree policies from 256 leaves to three and two leaves, respectively.
We focus on creating collaborative agents across two domains, Forced Coordination and Optional Collaboration.
Forced Coordination: Users team with an AI that is separated by a barrier and must pass over items in a timely manner. Here, agents are forced to collaborate.
Optional Collaboration: In this domain, the team can operate individually or collaboratively. Collaboration is incentivized through a higher reward for mixed-ingredient dishes over single-ingredient dishes.
Visualization of Each Domain Utilized in our Human-Subjects Study
Trained IDCT Policies in Each Domain
After training, users repeatedly team with an AI and interactively reprogram their interpretable AI teammate.
Users have several capabilities in creating an effective teammate, including modifying the tree structure by adding or removing decision nodes, changing state features the tree is conditioned on, and modifying actions and/or their respective probabilities at leaf nodes.
Visualization of Possible Modification Options
Visualization of Interface to Modify IDCT Policies
Our between-subjects user study that seeks to understand how users interact with an AI across repeated play under different factors.
Research Questions:
How does human-machine teaming performance vary across factors?
How does team development vary across factors?
Independent Variable: The Teaming Method
IV1-C1: Human-Led Policy Modification: After interacting with the agent (one teaming episode), the user can modify the policy via the GUI, allowing the user to update decision nodes and action nodes in the tree as well as tune affordances. Upon completion, the user can visualize the updated policy in its tree form prior to the next interaction.
IV1-C2: AI-Led Policy Modification: After interacting with the agent, the AI utilizes recent gameplay to fine-tune a human gameplay model via Behavioral Cloning and performs reinforcement learning for five minutes to optimize its own policy to better support the human teammate. Upon completion of policy optimization, the user can visualize the updated AI policy in its interpretable tree form prior to the next interaction. This is similar to Human-Aware PPO, adapted to an online setting.
IV1-C3: Static Policy: Interpretability: After interacting with the agent, the user can visualize the AI's policy in its interpretable tree form prior to the next interaction. Throughout this condition, the AI's policy is static.
IV1-C4: Static Policy - Black-Box: After interacting with the agent, the user does \emph{not} see the AI's policy. Here, the AI policy is the same as IV1-C3, but the human has lost access to direct insight into the model.
IV1-C5: Static Policy - Fictitious Co-Play: [Strouse et al., 2021]: User teams with an AI maintaining a static black-box, neural network (NN) policy trained across a diverse partner set. As this is a baseline, we utilize an NN rather than the legible IDCT policy used in other conditions IV1:C1-4
Comparison Across Experiment Conditions
Per-Condition Experiment Flow for Each Teaming Episode
Users start with a teaming episode. Upon completion, users will conduct a condition specific-action such as modifying the policy tree or visualizing it. Upon completion, users take a Nasa-TLX survey before continuing to the next teaming interaction. After finishing four teaming rounds, the users evaluate the AI they teamed with. This is done in each domain, randomly ordered.
Overview of Experiment Procedure
In Forced Coordination, the IDCT policy converged to a policy with an average reward of 315.22 ± 14.59, and the neural network policy converged to an average reward of 403.16 ± 16.08 evaluated over 50 teaming simulations with the synthetic human teammate the policy was trained with.
In Optional Collaboration, the IDCT policy converged to a policy with an average reward of 171.46 ± 18.89, and the neural network policy converged to an average reward of 295.02 ± 1.86.
A consequent confound due to the current difference in performance capabilities between interpretable vs. black-box models is that the NN policy outperforms the IDCT policy in both domains.
In team coordination performance, we can look at the maximum reward achieved by participants while teaming with an AI.
In the first domain, we see Fictitious Co-Play outperformed all other approaches.
In the second domain, we see that 1) white-box approaches supported with policy modification can outperform white-box approaches alone, and 2) FCP outperformed conditions IV1-C2, IV1-C3, and IV1-C4.
Compared to the heuristics introduced above, that while on average users can coordinate with the Fictitious Co-play agent better than the conditions leveraging the tree models, all approaches underperform a collaborative heuristic in the second domain.
To assess team development, we look at the change in participant rewards across iterations in each domain.
In Forced Coordination, we see that performance first decreases (following Tuckman's forming and storming stages) and then begins to increase (showing a transition into the norming stage). In the future, we would like to evaluate a larger number of iterations to see if the behavior would continue to trend upward.
In Optional Collaboration, we see that white-box approaches with policy modification benefited team improvement over repeated play, facilitating the norming stage of Tuckman’s model.
We find a significant effect between improvement and familiarity with decision trees (𝑝<0.01)
Performance Data in Each Domain across Teaming Iterations
The creation of white-box learning approaches that can produce interpretable collaborative agents that achieve competitive initial performance to that of black-box agents.
The design of learning schemes to support the generation of collaborative AI behaviors rather than individual coordination.
The creation of mixed-initiative interfaces that enable users, who may vary in ability and experience, to improve team collaboration across and within interactions.
4, The evaluation of teaming to a larger number of interactions.