As large language models (LLMs) are increasingly used to model and augment collective decision-making, it is critical to examine their alignment with human social reasoning. We present an empirical framework for assessing collective alignment, in contrast to prior work on the individual level. Using the Lost at Sea social psychology task, we conduct a largescale online experiment (N = 748), randomly assigning groups to leader elections with either visible demographic attributes (e.g. name, gender) or pseudonymous aliases. We then simulate matched LLM groups conditioned on the human data, benchmarking Gemini 2.5, GPT 4.1, Claude Haiku 3.5, and Gemma 3. LLM behaviors diverge: some mirror human biases; others mask these biases and attempt to compensate for them. We empirically demonstrate that human-AI alignment in collective reasoning depends on context, cues, and modelspecific inductive biases. Understanding how LLMs align with collective human behavior is critical to advancing socially-aligned AI, and demands dynamic benchmarks that capture the complexities of collective reasoning.
Whether individuals feel confident about their own actions, choices, or statements being correct, and how these confidence levels differ between individuals are two key primitives for countless behavioral theories and phenomena. In cognitive tasks, individual confidence is typically measured as the average of reports about choice accuracy, but how reliable is the resulting characterization of within- and between-individual confidence remains surprisingly undocumented. Here, we perform a large-scale resampling exercise in the Confidence Database to investigate the reliability of individual confidence estimates, and of comparisons across individuals’ confidence levels. Our results show that confidence estimates are more stable than their choice-accuracy counterpart, reaching a reliability plateau after roughly 50 trials, regardless of a number of task design characteristics. While constituting a reliability upper-bound for task-based confidence measures, and thereby leaving open the question of the reliability of the construct itself, these results characterize the robustness of past and future task designs.
Men tend to self-select into leadership positions more often than women. This chapter explores how confidence contributes to this phenomenon through two laboratory experiments. In Study 1, we measure several dimensions of confidence and examine their link with willingness to lead (WTL) in different leadership contexts. For both genders, we find that confidence is a stronger predictor of willingness to lead when the leader’s role is to advise followers rather than to make decisions on their behalf. We then show that the confidence–WTL relationship in advisory roles reflects a causal mechanism, using a second experiment that exogenously shifts relative confidence via noisy performance feedback layered onto our baseline design (Study 2). We also show that although this causal link holds for both genders, its strength may vary with the type of signal, reflecting heterogeneous responses to positive versus negative feedback. Together, these findings highlight how confidence and leadership context interact to shape gender differences in leadership ambition.
Women remain markedly underrepresented in leadership positions, shaping who holds influence, whose voices are heard, and how collective decisions are made. While these disparities are well documented, it is not well understood when and how they emerge in interactive group decision-making. Because gender is visible in most contexts, it remains unclear to what extent observed disparities reflect behavioral responses to gender cues and gender-based evaluative biases, versus more internal mechanisms that operate even when gender is concealed. We address this question through a large-scale online experiment (N = 816) in which participants engage in collective reasoning before electing a group leader. Groups are randomly assigned to interact either under full identification (gendered avatars and pronouns) or full anonymity (animal identities). When gender is visible, men are significantly more likely to be elected as leaders despite equal task performance. Under anonymity, this gender gap is no longer significant. However, anonymity does not eliminate all gendered dynamics. We identify a two-stage mechanism: women are less likely to volunteer for leadership in both conditions, and when gender is visible, peer evaluations further disadvantage them during elections. Importantly, even when gender is hidden, peers infer leadership intent from communication styles. Through linguistic analyses, we show that peers infer leadership potential from language and communication cues, and cues associated to higher leadership potential are more frequently used by men, suggesting that communication itself can sustain bias even in the absence of explicit gender cues. Our findings highlight the value of online environments for disentangling the social, behavioral and linguistic mechanisms underlying persistent gender gaps in group decision-making.
This study investigates whether repeated experiences of rejection from leadership positions reduce future leadership entry and whether these dynamics differ by gender. I design a multi-round experiment in which participants repeatedly decide whether to apply for a leadership role, face acceptance or rejection through a performance-based lottery, and subsequently decide whether to reapply. I document three main findings. First, rejection discourages subsequent applications, but gender differences emerge only after repeated rejection: women are significantly less likely than men to persist after being rejected twice, whereas responses to an initial rejection are similar across genders. Second, rejection leads to substantial downward revisions in confidence. These changes matter for persistence, especially for women: conditional on similar confidence shocks, women’s reapplication decisions respond more strongly to changes in absolute confidence, although confidence updating does not fully account for the gender gap observed after repeated rejection. Finally, I show that providing counterfactual positive feedback to non-candidates increases subsequent entry, particularly among women. The findings have implications for the design of selection and feedback processes aimed at sustaining participation in leadership.
In the existing literature, “confidence” refers to a variety of phenomena and dimensions, each of them often being measured in study-specific ways. Methodologically, a few studies have raised concerns about existing measures of confidence, both at the theoretical and empirical levels (e.g., Olsson 2014; Klayman et al 1999) – but without providing clear guidance about how to improve them. An important step towards addressing this question is the confidence database (Rahnev et al 2020) that performed a meta-analysis over multiple studies. However, it remains limited to a single overconfidence dimension (overestimation). In economics, a recent study incidentally reported that over-placement and overconfidence might differently correlate with various dimensions of risk attitude, raising questions on the internal validity of a general concept of confidence (Dean & Ortoleva 2019). Empirical work also shows inconsistencies: general confidence levels vary with task type, difficulty and the definition of overconfidence. Our aim is to address these inconsistencies and unknowns in several ways. By recognizing key features of confidence and examining their relationships, we may reconcile conflicting evidence. Identifying latent factors underlying confidence will allow us to link them to personality traits and real-life behaviors, providing insights into confidence effects. This will offer critical evidence on the internal and external validity of confidence measurements, which is currently lacking. Methodologically, we will address open questions using a test/re-test design on a representative general population sample by collecting these data at two time points, we will both assess the generalizability and stability of confidence elicitation and mitigating potential biases from traditional student samples.