Center for Neuroscience and Behavior, English
Honors Computer Engineering
Individualized Studies, Entrepreneurship
Honors Linguistics, Media and communication
Honors Linguistics, French
The following is an image of poster presented at the 2026 Undergraduate Research Forum.
Large language models (LLMs) generate human-like descriptions, but may reflect underlying social biases in language.
Adjectives are a useful lens because they encode both sentiment (positive/negative) and gender associations (masculine, feminine, neutral).
Comparing GPT’s adjective choices to human judgments provides a way to evaluate how closely model outputs align with human interpretations.
This study focuses on alignment in sentiment and gender coding.
Adapted from Williams & Bennett (1975).
We took the 57 adjectives that 75% of their participants agreed were masculine or feminine and re-tested them.
Task: Asked participants for gender classification and sentiment classification
Gender: masculine/feminine/neutral
Sentiment: positive/negative/neutral.
Analysis: We used the same 75% and 60% consensus thresholds as the previous study.
LLM Model: GPT
Task: completion task
Design of the experiment:
two factors: gender and age
gender: 6 different names (Julia, Rachel, Rebecca)
age: 3 ages (29, 49, 69)
Prompts followed the structure:
{Name} is a {Age}-year-old {Occupation}. {Pronoun} is ____.
35 unique occupations for variation (e.g. Babysitter, Doctor, Journalist)
List of 57 adjectives to use for completion
GPT’s adjective choices weakly align with human classifications of gender, but more strongly align with those of sentiment.
When prompted to assign an adjective to a male target, GPT was more likely to choose one that is deemed traditionally masculine
Meanwhile, for female targets, GPT was more likely to choose an adjective that is not deemed traditionally feminine.
Gender coding varied across age conditions, whereas sentiment remained consistently positive.
Overall, GPT tended to produce positive adjectives more reliably than gender-congruent language.
The model showed a notable limitation for female targets, often selecting adjectives not strongly associated with femininity.
These results suggest GPT may prioritize generally favorable or competence-related descriptors over socially gendered ones.
Because some human gender-label categories had limited data, these findings should be interpreted with caution.
GPT’s adjective choices aligned more strongly with human sentiment than with human gender judgments.
The model consistently produced positive descriptions.
Alignment was stronger for male targets than for female targets, with additional variation across age conditions.
These findings suggest that LLMs may reproduce stable positivity biases while showing weaker and less consistent alignment with human gendered interpretations of language.
[1] Zhao, J., Ding, Y., Jia, C., Wang, Y., & Qian, Z. (2024). Gender bias in Large Language Models across multiple languages. arXiv preprint arXiv:2403.00277.
[2] Williams, J. E., & Bennett, S. M. (1975). The definition of sex stereotypes via the adjective check list. Sex roles, 1(4), 327-337.
Critical Thinking: Our team was able to gather and analyze information from a diverse set of sources in order to propose and then research our topic
Equity + Inclusion: By evaluating the systematic gender and age bias in LLMs, our team demonstrated an awareness of an willingness to engage with issues relating to Equity + Inclusion
Teamwork: Throughout the research process, our team exercised the ability to collaborate with other team members in a past-paced environment while respecting diverse personalities and sharing responsibilities.
Institutional Review Board Approval