Understanding the underrepresentation of women in leadership roles is crucial for addressing gender inequality. This experimental project explores how gender differences in confidence contribute to the gender gap in willingness to lead (WTL) through two studies. In the first study, we examine WTL across two distinct leadership environments. In one, leaders’ payoffs depend on their ability to influence followers’ decisions, while in the other, leaders bear responsibility for making decisions on behalf of the group. While the gender gap in willingness to lead is more pronounced in the responsibility condition, we find that confidence predicts willingness to lead only in the influence condition, both for men and women. This suggests that confidence plays a larger role in leadership aspirations when leaders provide guidance rather than make binding decisions. Based on these findings, we designed a second study to further investigate the causal impact of confidence on WTL within the context of influence-based leadership. In a controlled laboratory setting, we vary participants’ confidence levels to measure their direct effect on leadership aspirations. Our results offer new evidence on how confidence and leadership context jointly shape gender differences in leadership ambitions.
Whether individuals feel confident about their own actions, choices, or statements being correct, and how these confidence levels differ between individuals are two key primitives for countless behavioral theories and phenomena. In cognitive tasks, individual confidence is typically measured as the average of reports about choice accuracy, but how reliable is the resulting characterization of within- and between-individual confidence remains surprisingly undocumented. Here, we perform a large-scale resampling exercise in the Confidence Database to investigate the reliability of individual confidence estimates, and of comparisons across individuals’ confidence levels. Our results show that confidence estimates are more stable than their choice-accuracy counterpart, reaching a reliability plateau after roughly 50 trials, regardless of a number of task design characteristics. While constituting a reliability upper-bound for task-based confidence measures, and thereby leaving open the question of the reliability of the construct itself, these results characterize the robustness of past and future task designs.
In the existing literature, “confidence” refers to a variety of phenomena and dimensions, each of them often being measured in study-specific ways. Methodologically, a few studies have raised concerns about existing measures of confidence, both at the theoretical and empirical levels (e.g., Olsson 2014; Klayman et al 1999) – but without providing clear guidance about how to improve them. An important step towards addressing this question is the confidence database (Rahnev et al 2020) that performed a meta-analysis over multiple studies. However, it remains limited to a single overconfidence dimension (overestimation). In economics, a recent study incidentally reported that over-placement and overconfidence might differently correlate with various dimensions of risk attitude, raising questions on the internal validity of a general concept of confidence (Dean & Ortoleva 2019). Empirical work also shows inconsistencies: general confidence levels vary with task type, difficulty and the definition of overconfidence. Our aim is to address these inconsistencies and unknowns in several ways. By recognizing key features of confidence and examining their relationships, we may reconcile conflicting evidence. Identifying latent factors underlying confidence will allow us to link them to personality traits and real-life behaviors, providing insights into confidence effects. This will offer critical evidence on the internal and external validity of confidence measurements, which is currently lacking. Methodologically, we will address open questions using a test/re-test design on a representative general population sample by collecting these data at two time points, we will both assess the generalizability and stability of confidence elicitation and mitigating potential biases from traditional student samples.
This study investigates whether repeated experiences of rejection from leadership positions reduce future leadership ambitions and whether these dynamics differ by gender. We design an experiment in which participants repeatedly apply for leadership roles, experience acceptance or rejection, and decide whether to reapply in subsequent rounds. By leveraging a randomized selection mechanism—where leader appointment is based either on performance or random assignment—we can further study gender differences in the interpretation of rejection. Prior research suggests that women react more negatively to failure and are more likely than men to attribute rejection to a lack of ability. Based on these results we hypothesize that, over multiple rounds, a gender gap in willingness to lead (WTL) will emerge and widen, as women disproportionately exit the leadership pipeline following rejection. If confirmed, these findings could contribute to the broader understanding of gender disparities in leadership aspirations and inform policy interventions designed to sustain women’s leadership trajectories by mitigating the discouraging effects of early setbacks.
Large language models (LLMs) are increasingly used to simulate human decision-making, yet their behavior in collective social contexts—where identity cues can shape group dynamics—remains underexplored. We manipulate pseudonymity in both human and LLM-based groups to evaluate the role of identity cues in collective decision-making. First, we present empirical results from a large-scale online leader selection task Lost at Sea, N=748) with identified and pseudonymous treatment conditions. We then compare human behavior to LLM-based simulacra from Gemini 2.5, GPT-4.1, and Haiku 3.5 to assess how identity visibility affects alignment. Humans exhibited both a gender gap and an optimal leader gap in self-nomination and peer selection, which narrowed under pseudonymity. Some models closely mirrored this phenomena, while others appeared more meritocratic when identity cues were present, leveraging identity signals to compensate for social bias and electing more optimal leaders than humans as a result. When identity cues were removed, all models exhibited a male-skewed leadership preference. These findings highlight that behavioral alignment is not just a technical challenge, but a sociotechnical choice between mirroring biased human behavior or striving for normative improvements at the cost of divergence.