Analyzing Reward Functions via Trajectory Alignment
Calarina Muslimani, Suyog Chandramouli, Serena Booth, W. Bradley Knox, Matthew E. Taylor
Reward design in reinforcement learning (RL) is often overlooked, with the assumption that a well-defined reward is readily available. However, reward functions can be challenging to design and prone to reward hacking, potentially leading to unintended or dangerous consequences in real-world applications. To create safe RL agents, reward alignment is crucial. We define reward alignment as the process
of designing reward functions that preserve the preferences of a human stakeholder. In practice, reward functions are designed with training performance as the primary measure of success; this measure, however, may not reflect alignment. This work studies the practical implications of reward design on alignment. Specifically, we (1) propose a reward alignment metric, the Trajectory Alignment coefficient, that measures the similarity between the preference orderings of a human stakeholder and the preference orderings induced by a reward function, (2) use this metric to quantify the prevalence and extent of misalignment in human-designed reward functions, and (3) examine how misalignment affects the efficacy of these human-designed reward functions in terms of training performance.
Attractive by Design: How The Attractiveness Halo Effect Shapes AI Perception
Doh Miriam, Aditya Gulati, Nuria M Oliver
Humans are impacted by tens of cognitive biases when making decisions. One such bias is the attractiveness halo effect, i.e., the tendency to associate positive traits, such as intelligence or trustworthiness, with individuals that are perceived as attractive. While this bias has been studied extensively in humans, there is limited work studying the existence of such an attractiveness bias in the content generated by AI systems. This work-in-progress paper investigates the presence of the attractiveness halo effect in text-to-image (T2I) generative AI models, specifically examining how T2I models associate "attractiveness" with positive traits, such as intelligence, trustworthiness, and sociability, while linking "unattractiveness" to negative attributes. Through preliminary experiments generating over 12,000 face images labeled with various traits across gender and race categories, we measure the similarity between images associated with attractiveness and other traits by computing centroid distances in the feature embedding space. Initial findings indicate the presence of a halo effect, similar to that observed in humans, where images deemed attractive are more closely associated with positive than with negative traits. These results suggest that T2I models embed an attractiveness bias. However, the extent of these associations varies across demographic groups, with notable differences based on gender and race. This study underscores the potential of generative AI to replicate our own biases, perpetuating societal stereotypes and raising important implications for model development and their application in downstream tasks.
Can ChatGPT Predict What I Think? Exploring Transformer Models' Prediction of Human Information Processes
Meghna Bhadra, Marco Ragni
While AI and machine learning often aim to solve problems optimally, cognitive modeling focuses on describing general patterns of human information processing. This paper investigates the application of transformer-based models for the automatic generation of cognitive models tailored to individual users. Using a dataset of 50 participants solving 64 syllogistic reasoning tasks, we demonstrate that transformer architectures can predict individual responses with accuracy comparable to the best cognitive models. Our methodology not only leverages the opaque processes of transformers but also aims to generate interpretable algorithms. Our approach achieves approximately 47% accuracy, placing it second to the best state-of-the-art cognitive model at 49%, while outperforming several established cognitive models.
Cognitive AI-Driven Recommendations for Improving Human Search Behavior in Optimal Stopping
Erin H. Bugbee, Cleotilde Gonzalez
People frequently encounter optimal stopping decisions in their lives, requiring them to decide when to stop a search of options and make a selection. In these situations, people often struggle to balance exploration and exploitation, leading to suboptimal decisions. This research investigates whether using cognitive AI-driven recommendations can improve decision making in optimal stopping tasks. We propose an experiment using an Instance-Based Learning (IBL) model to generate individualized recommendations in an optimal stopping task. Participants encounter sequences of options and must decide whether to continue exploring or stop search and select the current option. Recommendations may be provided for every option according to theoretically optimal thresholds or based on using the IBL model to predict whether individuals will deviate from optimal behavior. We hypothesize that using cognitive AI-driven recommendations that are provided based on an individual’s observed behavior will result in better decision making outcomes than always providing an optimal recommendation, and that both types of recommendations will lead to better outcomes than no recommendation. This is because there will be greater compliance and the recommendations will help align behavior with the optimal strategy. Our results will inform how recommendation systems can be used to guide people in sequential decision making tasks, by bridging cognitive modeling and AI to help people make better decisions.
Cognitive Models Improve Machine-based Inference of Latent Motives [Spotlight]
Anderson K. Fitch, Peter D. Kvam
The ability to make inferences about another person’s latent states from their behavior is integral to how people behave in social situations, yet is lacking from most artificial intelligence (AI) systems. The present study tests the capacity of cognitive models to assess latent motives by evaluating different AIs tasked with inferring a human player’s intent during a continuous control task. Neural networks were trained
by (a) directly using observable information or (b) selecting important features by estimating the parameters of a generative model of movement behavior inspired by approach-avoidance theory. Comparisons of classifier accuracy suggest that latent model parameters predict a participant’s intent at a level exceeding human performance. Furthermore, classifier performance was best when model-based inferences were combined with summary statistics about behavior, yielding faster and more stable network training compared to networks
that had no manual feature extraction. Equipping AI with cognitive models is a promising avenue for developing explainable, accurate, and trustworthy systems.
Controllable Complementarity: Subjective Preferences in Human-AI Collaboration [Spotlight]
Chase McDonald, Cleotilde Gonzalez
Although much existing work in human-AI interaction and teaming focuses on optimizing objective performance, there is a growing need to understand subjective human preferences in these interactions. To that end, we investigate human preferences for controllability in a shared workspace task where humans collaborate with AI. We introduce Interpretable Behavior Conditioning (IBC), a reinforcement learning training algorithm to enable humans to control the behaviors of their AI partners. In an initial experiment, we validate the robustness of IBC in producing effective AI policies when controls are disabled and hidden, relative to standard self-play policies. In a second experiment, we expose the controls to humans, demonstrating that participants perceive their AI partners as more effective and enjoyable to interact with when they can dictate the AI’s behavior. These findings underscore the importance of designing AI systems that prioritize not only task performance but also subjective quality of human-AI collaboration. Our results have implications for broadening the scope of human-AI complementarity, where AI can complement humans in terms of not only objective outcomes but also subjective preferences.
Decision Making and Theory of Mind for Human-in-the-Loop Settings
Sammie Katt, Samuel Kaski
Humans are part of many real world applications, either as part of the problem or (preferably) their solution. In practice, these applications treat the humans as a passive data source; either as ground truth or faithful representation of the human's belief over the ground truth.
This is seen, for example, when fine-tuning large language models with human feedback or in personalized recommender-systems. This is not only incorrect in existing applications - users are known to adjust feedback according to their understanding of the system they are interacting with - but is also excluding more collaborative settings from being tackled. We propose a richer decision framework to model in human-AI systems, based on two crucial missing components: the objective of the human team mates and their belief over the system that they are interacting with. We demonstrate it is a natural representation for human-AI collaboration in both toy and realistic problems.
Developmentally informed large language models for interdisciplinary collaboration
Yaxin Liu, Adam Green, Stella F. Lourenco
Recent advances in large language models (LLMs) have achieved remarkable success on a variety of linguistic and reasoning tasks. However, current pre-training approaches rely predominantly on massive, static corpora of adult-level text, overlooking the incremental, developmental processes by which human knowledge emerges and evolves. Drawing inspiration from developmental robotics, cognitive modeling, and curriculum learning, we propose a position to integrate developmental milestones into LLM training. By systematically structuring training inputs according to these benchmarks, LLMs could become powerful computational platforms for testing developmental hypotheses in cognitive science. We discuss the practical challenges of assembling suitable datasets, highlight existing resources, and examine how such models could foster deeper collaboration between AI researchers and developmental scientists.
Evaluating the Rationality of AI Decision Making Using the Transitivity Axiom [Spotlight]
Kiwon Song, James Jennings, Clintin Stober
Fundamental choice axioms, such as transitivity of preference, provide testable conditions for determining whether human decision making is rational, i.e., consistent with a utility representation. Recent work has demonstrated that AI systems trained on human data can exhibit similar reasoning biases as humans and that AI can, in turn, bias human judgments through AI recommendation systems. We evaluate the rationality of AI responses via a series of choice experiments designed to evaluate transitivity of preference in humans. We considered ten versions of Meta's Llama 2 and 3 LLM models. We applied Bayesian model selection to evaluate whether these AI-generated choices violated two prominent models of transitivity. We found that the Llama 2 and 3 models generally satisfied transitivity, but when violations did occur, occurred only in the Chat/Instruct versions of the LLMs. We argue that rationality axioms, such as transitivity of preference, can be useful for evaluating and benchmarking the quality of AI-generated decision making.
Insights from Behavioral Experiments on Conformity in Human-AI Collaboration
May Kristine Jonson Carlon, Julian Matthews, Yasuo Kuniyoshi
Traditional automated systems relied on paradigms where machines often played leader-like roles in decision-making. With the recent advancements in artificial intelligence (AI) coming into play, the need for greater human oversight, particularly in knowledge-intensive tasks where AI's complex and often opaque nature, leads to calls for more human-centric approaches. While justified, this shift can lead to inefficiencies—such as underutilizing AI expertise in areas of its strength or human errors when over-relying on flawed guidance. This study examines group decision-making in mixed human-AI settings, focusing on the interplay of perceived competence, group dynamics, and dissent. Using group dynamics experiments previously shown to elicit conformity to the majority, we will explore how dissent from humans or AI influences group decision outcomes. By uncovering the mechanisms behind conformity and dissent, this research aims to inform the design of systems that balance trust in AI with independent human judgment, facilitating effective collaboration.
Leveraging Large Language Models to Measure Gender Representation Bias in Gendered Language Corpora
Erik Derner, Sara Sansalvador de la Fuente, Yoan Gutierrez, Paloma Moreda Pozo, Nuria M Oliver
Language corpora are used in a variety of natural language processing (NLP) tasks, such as for training large language models (LLMs). Biases present in text corpora, reflecting sociolinguistic patterns, can lead to the perpetuation and amplification of societal inequalities. The phenomenon of gender bias is particularly pronounced in gendered languages like Spanish or French, where grammatical structures inherently encode gender, making the bias analysis more challenging. A first step in quantifying gender bias in text entails computing biases in gender representation, i.e., differences in the prevalence of words referring to males vs. females. Existing methods to measure gender representation bias in text corpora have mainly been proposed for English and do not generalize to gendered languages due to the intrinsic linguistic differences between English and gendered languages. This paper introduces a novel methodology that leverages the contextual understanding capabilities of LLMs to quantitatively measure gender representation bias in Spanish corpora. By utilizing LLMs to identify and classify gendered nouns and pronouns in relation to their reference to human entities, our approach provides a robust analysis of gender representation bias in gendered languages. We empirically validate our method on four widely-used benchmark datasets, uncovering significant gender prevalence disparities with a male-to-female ratio ranging from 4:1 to 6:1. These findings highlight the presence of gender biases in LLM training data, which can, in turn, adversely affect human-AI interactions. Our methodology contributes to the development of more equitable language technologies, aiming to reduce biases in LLMs and improve fairness in human-LLM collaboration.
Performance Optimization of Ratings-Based Reinforcement Learning
Evelyn Rose, Devin White, Mingkang Wu, Vernon Lawhern, Nicholas R Waytowich, Yongcan Cao
This paper explores multiple optimization methods to improve the performance of rating-based reinforcement learning (RbRL). RbRL, a method based on the idea of human ratings, has been developed to infer reward functions in reward-free environments for the subsequent policy learning via standard reinforcement learning, which requires the availability of reward functions. Specifically, RbRL minimizes the cross entropy loss that quantifies the differences between human ratings and estimated ratings derived from the inferred reward. Hence, a low loss means a high degree of consistency between human ratings and estimated ratings. Despite its simple form, RbRL has various hyperparameters and can be sensitive to various factors. Therefore, it is critical to provide comprehensive experiments to understand the impact of various hyperparameters on the performance of RbRL. This paper is a work in progress, providing users some general guidelines on how to select hyperparameters in RbRL.
Personalizing Exposure Therapy via Reinforcement Learning
Athar MahmoudiNejad, Matthew Guzdial, Pierre Boulanger
Personalized therapy, in which a therapeutic practice is adapted to an individual patient, can lead to improved health outcomes.
Typically, this is accomplished by relying on a therapist's training and intuition along with feedback from a patient. However, this requires the therapist to become an expert on any technological components, such as in the case of Virtual Reality Exposure Therapy (VRET). While there exist approaches to automatically adapt therapeutic content to a patient, they generally rely on hand-authored, pre-defined rules, which may not generalize to all individuals. In this paper, we propose an approach to automatically adapt therapeutic content to patients based on physiological measures. We implement our approach in the context of virtual reality arachnophobia exposure therapy, and rely on experience-driven procedural content generation via reinforcement learning (EDPCGRL) to generate virtual spiders to match an individual patient. Through a human subject study, we demonstrate that our system significantly outperforms a more common rules-based method, highlighting its potential for enhancing personalized therapeutic interventions.
Preference Learning of Latent Decision Utilities with a Human-like Model of Preferential Choice
Sebastiaan De Peuter, Shibei Zhu, Yujia Guo, Andrew Howes, Samuel Kaski
Preference learning methods make use of models of human choice in order to infer the latent utilities that underlie human behaviour. However, accurate modeling of human choice behavior is challenging due to a range of context effects that arise from how humans contrast and evaluate options. Cognitive science has proposed several models that capture these intricacies but, due to their intractable nature, work on preference learning has, in practice, had to rely on tractable but simplified variants of the well-known Bradley-Terry model. In this paper, we take one state-of-the-art intractable cognitive model and propose a tractable surrogate that is suitable for deployment in preference learning. We then introduce a mechanism for fitting the surrogate to human data and it extend it to account for data that cannot be explained by the original cognitive model. We demonstrate on large-scale human data that this model produces significantly better inferences on static and actively elicited data than existing Bradley-Terry variants. We further show in simulation that when using this model for preference learning, we can significantly improve a utility in a range of real-world tasks.
RbRL2.0: Integrated Reward and Policy Learning for Rating-based Reinforcement Learning
Mingkang Wu, Devin White, Vernon Lawhern, Nicholas R Waytowich, Yongcan Cao
Reinforcement learning (RL), a common tool in decision making, learns policies from various experiences based on the associated cumulative return/rewards without treating them differently. On the contrary, humans often learn to distinguish from different levels of performance and extract the underlying trends towards improving their decision making for best performance. Motivated by this, this paper proposes a novel RL method that mimics humans' decision making process by differentiating among collected experiences for effective policy learning. The main idea is to extract important directional information from experiences with different performance levels, named ratings, so that policies can be updated towards desired deviation from these experiences with different ratings. Specifically, we propose a new policy loss function that penalizes distribution similarities between the current policy and failed experiences with different ratings, and assign different weights to the penalty terms based on the rating classes. Meanwhile, reward learning from these rated samples can be integrated with the new policy loss towards an integrated reward and policy learning from rated samples. Optimizing the integrated reward and policy loss function will lead to the discovery of directions for policy improvement towards maximizing cumulative rewards and penalizing most from the lowest performance level while least from the highest performance level. To evaluate the effectiveness of the proposed method, we present results for experiments on a few typical environments that show improved convergence and overall performance over the existing rating-based reinforcement learning method with only reward learning.
Selecting the Best AI Model in AI-Assisted Decision-Making Tasks: Balancing Accuracy and Confidence Discrimination
ZhaoBin Li, Mark Steyvers
In scenarios where human decision-making is augmented by AI systems, selecting the optimal AI model to maximize the accuracy of hybrid decisions remains a critical challenge. This study considers the importance of confidence discrimination—the ability of an AI to assign confidence scores that reliably differentiate between correct and incorrect predictions. This paper introduces a formal framework to evaluate the trade-offs between accuracy and confidence discrimination in hybrid human-AI decision-making contexts. We analytically derive conditions under which an AI model with lower accuracy but higher confidence discrimination can lead to higher combined decision accuracy when paired with an idealized human decision-maker. This work underscores the importance of balancing accuracy and confidence discrimination in AI-assisted decision making, advancing our understanding of human-AI complementarity.
Sequential Preference Elicitation for Utility Maximisation in Auction Games [Spotlight]
Xiaomei Mi, Jianhong Wang, Samuel Kaski
In auction mechanism design, current research primarily targets automated design with clearly defined objectives, such as maximising revenue, excluding human involvement. However, when objectives are not explicit, these automatic methods fall short. In such cases, we propose to help the auction mechanism designer to solve their design problem by introducing an AI assistant which interactively recommends auction rules and shares the same interface in the auction environment with the designer. The cooperating pair formed of the designer and AI assistant can be conceptualised as a ‘centaur’, an entity with common external actions. In this setting, the centaur's internal interactions iterate three steps: (1) The AI assistant recommends an action to the auctioneer; (2) The auctioneer accepts or rejects the recommendation; (3) The AI assistant updates its beliefs in the auctioneer's goal, based on the auctioneer's decision. Compared to the auctioneer making decisions independently, the AI assistant mitigates the auctioneer's bounded rationality by recommending advice, identifying good actions that the auctioneer might otherwise miss. Simulated experiments on repeated auctions, where the auctioneer has tacit preferences, show that AI assistant improves performance over that of bounded-rational agents designing auctions without assistance.
SHARPIE: A Platform for Conducting Experiments on the Interaction of Reinforcement Learning Agents and Humans
Hüseyin Aydın, Kevin Godin-Dubois, Libio Goncalves Braz, Floris den Hengst, Kim Baraka, Mustafa Mert Çelikok, Andreas W.M. Sauter, Shihan Wang, Frans A Oliehoek
Reinforcement learning (RL) offers a general approach for modeling and training AI agents, including human-AI interaction scenarios. In this paper, we propose SHARPIE (Shared Human-AI Reinforcement Learning Platform for Interactive Experiments) to address the need for a generic framework to support experiments with RL agents and humans. Its modular design consists of a versatile wrapper for RL environments and algorithm libraries, a participant-facing web interface, logging utilities, deployment on popular cloud and participant recruitment platforms. It empowers researchers to study a wide variety of research questions related to the interaction between humans and RL agents, including those related to reward specification and learning, action delegation, preference elicitation, user-modeling, and human-AI teaming. Underlying the platform is a generic interface for human-RL interactions that we hope will standardize the field of study on RL in human contexts.
Towards Neural Network based Cognitive Models of Dynamic Decision-Making by Humans
Changyu Chen, Shashank Reddy Chirra, Maria José Ferreira, Cleotilde Gonzalez, Arunesh Sinha, Pradeep Varakantham
Modeling human cognitive processes in dynamic decision-making tasks has been an endeavor in AI for a long time because such models can help make AI systems more intuitive, personalized, mitigate any human biases, and enhance training in simulation. Some initial work has attempted to utilize neural networks (and large language models) but often assumes one common model for all humans and aims to emulate human behavior in aggregate. However, the behavior of each human is distinct, heterogeneous, and relies on specific past experiences in certain tasks. For instance, consider two individuals responding to a phishing email: one who has previously encountered and identified similar threats may recognize it quickly, while another without such experience might fall for the scam. In this work, we build on Instance Based Learning (IBL) that posits that human decisions are based on similar situations encountered in the past. However, IBL relies on simple fixed form functions to capture the mapping from past situations to current decisions. To that end, we propose two new attention-based neural network models to have open form non-linear functions to model distinct and heterogeneous human decision-making in dynamic settings. We experiment with two distinct datasets gathered from human subject experiment data, one focusing on detection of phishing email by humans and another where humans act as attackers in a cybersecurity setting and decide on an attack option. We conducted extensive experiments with our two neural network models, IBL, and GPT3.5, and demonstrate that the neural network models outperform IBL significantly in representing human decision-making, while providing similar interpretability of human decisions as IBL. Overall, our work yields promising results for further use of neural networks in cognitive modeling of human decision making.