He is Strong, She is Gentle: Gender Bias in Adjective Associations of Large Language Models

Research Team

Dr. So Young Lee

Center for Neuroscience and Behavior, English

Leah Parparov

Honors Computer Engineering

Andrew Rutt

Individualized Studies, Entrepreneurship

Megan Weaver

Honors Linguistics, Media and communication

Maelynn Geoppinger

Honors Linguistics, French

The following is an image of poster presented at the 2026 Undergraduate Research Forum.

Background & Research Question

Large language models (LLMs) generate human-like descriptions, but may reflect underlying social biases in language.
Adjectives are a useful lens because they encode both sentiment (positive/negative) and gender associations (masculine, feminine, neutral).
Comparing GPT’s adjective choices to human judgments provides a way to evaluate how closely model outputs align with human interpretations.
This study focuses on alignment in sentiment and gender coding.

Experiment 1 (Human)

Adapted from Williams & Bennett (1975).
We took the 57 adjectives that 75% of their participants agreed were masculine or feminine and re-tested them.
Task: Asked participants for gender classification and sentiment classification
- Gender: masculine/feminine/neutral
- Sentiment: positive/negative/neutral.
Analysis: We used the same 75% and 60% consensus thresholds as the previous study.

Experiment 2 (LLMs)

LLM Model: GPT
Task: completion task
Design of the experiment:
- two factors: gender and age
  - gender: 6 different names (Julia, Rachel, Rebecca)
  - age: 3 ages (29, 49, 69)
Prompts followed the structure:
{Name} is a {Age}-year-old {Occupation}. {Pronoun} is ____.
35 unique occupations for variation (e.g. Babysitter, Doctor, Journalist)
List of 57 adjectives to use for completion

Discussion

GPT’s adjective choices weakly align with human classifications of gender, but more strongly align with those of sentiment.
When prompted to assign an adjective to a male target, GPT was more likely to choose one that is deemed traditionally masculine
Meanwhile, for female targets, GPT was more likely to choose an adjective that is not deemed traditionally feminine.
Gender coding varied across age conditions, whereas sentiment remained consistently positive.
Overall, GPT tended to produce positive adjectives more reliably than gender-congruent language.
The model showed a notable limitation for female targets, often selecting adjectives not strongly associated with femininity.
These results suggest GPT may prioritize generally favorable or competence-related descriptors over socially gendered ones.
Because some human gender-label categories had limited data, these findings should be interpreted with caution.

Conclusion

GPT’s adjective choices aligned more strongly with human sentiment than with human gender judgments.
The model consistently produced positive descriptions.
Alignment was stronger for male targets than for female targets, with additional variation across age conditions.
These findings suggest that LLMs may reproduce stable positivity biases while showing weaker and less consistent alignment with human gendered interpretations of language.

Selected References

[1] Zhao, J., Ding, Y., Jia, C., Wang, Y., & Qian, Z. (2024). Gender bias in Large Language Models across multiple languages. arXiv preprint arXiv:2403.00277.

[2] Williams, J. E., & Bennett, S. M. (1975). The definition of sex stereotypes via the adjective check list. Sex roles, 1(4), 327-337.

NACE Career Readiness Competencies

Critical Thinking: Our team was able to gather and analyze information from a diverse set of sources in order to propose and then research our topic

Equity + Inclusion: By evaluating the systematic gender and age bias in LLMs, our team demonstrated an awareness of an willingness to engage with issues relating to Equity + Inclusion

Teamwork: Throughout the research process, our team exercised the ability to collaborate with other team members in a past-paced environment while respecting diverse personalities and sharing responsibilities.

Research Compliance Protocols

Institutional Review Board Approval

Page updated

Report abuse