Moral Foundations

of Large Language Models

Marwa Abdulhai, Gregory Serapio-Garcia, Clément Crepy,

Daria Valter, John Canny, Natasha Jaques

UC Berkeley, University of Cambridge, Google Research

arXiv | Code | BibTex

Overview

Moral foundations theory (MFT) is a psychological assessment tool that decomposes human moral reasoning into five factors, including care/harm, liberty/oppression, and sanctity/degradation. People vary in the weight they place on these dimensions when making moral decisions, in part due to their cultural upbringing and political ideology. As large language models (LLMs) are trained on datasets collected from the internet, they may reflect the biases that are present in such corpora. Our contributions are:

Analyze whether popular LLMs have acquired a bias towards a particular set of moral values through MFT.
Show how MF of LLMs relate to human moral foundations and political affiliations
Measure the consistency of these biases depending on the context of how the model is prompted
Show how to adversarially select prompts that encourage the moral to exhibit a particular set of moral foundations, and that this can affect the model's behavior on downstream tasks

We hope these findings help illustrate the potential risks and unintended consequences of LLMs assuming a particular moral stance.

Method

We conduct a series of experiments analyzing the moral foundations of LLMs as a lens into the values they have encoded from training data and may reflect in unforeseen tasks.

Question 1: Do moral foundations exhibited by LLMs demonstrate a cultural or political bias?

We apply t-SNE to reduce moral foundations scores to two dimensions and plot the location of different human populations alongside the LLM models. Each LLM is prompted with either no prompt (the default model), or a political prompt. Human data is shown in blue and comes from psychology studies of human participants in different demographics (anonymous online participants, US participants, and Korean participants), who self-reported their political affiliation (Graham et al., 2009; Kim et al., 2012).

(a) Anonymous Participant human study

from Graham et al. (2009)

(b) GPT-3 (Brown et al., 2020)

MFQ scores of human study experiments across self-reported political affiliation (Graham et al., 2009) (a) vs. GPT-3 DaVinci2 (b)

We compute the absolute error difference between the moral foundation scores of LLMs and the moral foundation scores for a range of political affiliations from human studies of anonymous participants (Graham et al., 2009) and US-Americans & Koreans (Kim et al., 2012). The lowest value for each model is bolded.

Question 2: Do LLMs remain consistent with their moral foundations across different contexts?

We assess consistency in moral foundations by randomly prompting the LLM with 50 random book dialogues from the BookCorpus dataset (Zhu et al., 2015), and observing the resulting distribution of moral foundations scores.

Question 3: Can we reliably change the moral reasoning of the LLM in predictable ways?

We select prompts for each of the moral foundations that maximizes the score for this specific moral foundation.

Question 4: Do different moral foundations lead to different behavior in downstream tasks?

We show the prompt that was found to maximize the model’s weight on each moral foundation. We then show that on the downstream donation task, the donation amount output by a LLM significantly differs based on the moral foundation scores that it obtains.

BibTeX

@misc{abdulhai2023moral,

title={Moral Foundations of Large Language Models},

author={Marwa Abdulhai and Gregory Serapio-Garcia and Clément Crepy and Daria Valter and John Canny and Natasha Jaques},

year={2023},

eprint={2310.15337},

archivePrefix={arXiv},

primaryClass={cs.AI}

}

Page updated

Google Sites

Report abuse