Here is a tentative list of topics that we plan to cover in the semester. Please scroll down for a list of papers associated with these topics.
How to align LLMs to Human judgment?
Social Intelligence in LLMs
Impact of Human-LLMs alignment
Can LLMs reason like humans?
Do LLMs understand human culture?
Can LLMs work with humans?
How to make LLMs safer for societal deployment?
Following is the recommended reading list for individual topics. This is just a suggestion. Presenters are free to choose other relevant/interesting papers. Because of the time constraint, we will not cover all papers listed here.
How to align LLMs to Human judgment?
Instruction tuning
Jason Wei et al, “Finetuned Language Models Are Zero-Shot Learners”, arxiv 2021
Po-Nien Kung, Fan Yin, Di Wu, Kai-Wei Chang, and Nanyun Peng, "Active Instruction Tuning: Improving Cross-Task Generalization by Training on Prompt Sensitive Tasks", EMNLP 2023
RLHF
Ouyang et al, "Aligning language models to follow instructions", OpenAI 2022
Timo Kaufmann, Paul Weng, Viktor Bengs, Eyke Hüllermeier, A Survey of Reinforcement Learning from Human Feedback, arxiv 2023
Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, Abhinav Rastogi, "RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback", arxiv 2023
DPO and friends
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn, "Direct Preference Optimization: Your Language Model is Secretly a Reward Model", NeurIPS 2023
Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, Jason Weston, “Self-Rewarding Language Models”,
Huayu Chen, Guande He, Lifan Yuan, Ganqu Cui, Hang Su, Jun Zhu, Noise Contrastive Alignment of Language Models with Explicit Rewards. arxiv 2024
URiaL
Bill Yuchen Lin, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, Khyathi Chandu, Chandra Bhagavatula, Yejin Choi, "The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning", ICLR 2024
Retrieval augmented ICL
Xiaochuang Han. In-context alignment: Chat with vanilla language models before finetuning. ArXiv, abs/2308.04275, 2023
Kevin Yang, Dan Klein, Asli Celikyilmaz, Nanyun Peng, Yuandong Tian, "RLCD: Reinforcement Learning from Contrastive Distillation for Language Model Alignment", ICLR 2024
Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. "Large language models can self-improve", arXiv preprint arXiv:2210.11610, 2022.
Improving Alignment:
Jiahao Wang, Bolin Zhang, Qianlong Du, Jiajun Zhang, Dianhui Chu, “A Survey on Data Selection for LLM Instruction Tuning a very comprehensive survey on the sample selection strategy”, arxiv 2024
Lichang Chen, Jiuhai Chen, Chenxi Liu, John Kirchenbauer, Davit Soselia, Chen Zhu, Tom Goldstein, Tianyi Zhou, Heng Huang, “OPTune: Efficient Online Preference Tuning sample selection for improve the efficiency of online DPO”, arxiv 2024
Xinyu Lin, Wenjie Wang, Yongqi Li, Shuo Yang, Fuli Feng, Yinwei Wei, Tat-Seng Chua, “Data-efficient Fine-tuning for LLM-based Recommendation applying sample selection strategy for item recommendation”, SIGIR 2024
Liyan Tang, Philippe Laban, Greg Durrett, “MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents synthesizing data for fact verification, then using the synthesized data for finetuning smaller models”, arxiv 2024
Saumya Gandhi, Ritu Gala, Vijay Viswanathan, Tongshuang Wu, Graham Neubig, “Better Synthetic Data by Retrieving and Transforming retrieving similar human written data to improve the diversity and quality of synthetic data”, arxiv
Social Intelligence in LLMs
Ruibo Liu, Ruixin Yang, Chenyan Jia, Ge Zhang, Denny Zhou, Andrew M. Dai, Diyi Yang, Soroush Vosoughi, "Training Socially Aligned Language Models on Simulated Social Interactions", ICLR 2024
Natalie Shapira, Mosh Levy, Hossein Seyed Alavi, Xuhui Zhou, Yejin Choi, Yoav Goldberg, Maarten Sap & Vered Shwartz, "Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models", EACL 2024
A Rao Vijjini, R R Menon, S Srivastava, S Chaturvedi, `SocialGaze: Improving the Integration of Human Social Norms in Large Language Models', EMNLP-Findings (2024)
Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig & Maarten Sap, "SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents", ICLR (2024)
Hyunwoo Kim, Melanie Sclar, Xuhui Zhou, Ronan Le Bras, Gunhee Kim, Yejin Choi & Maarten Sap, "FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions", EMNLP 2023
Ashutosh Dwivedi, Pradhyumna Lavania, Ashutosh Modi, "EtiCor: Corpus for Analyzing LLMs for Etiquettes", EMNLP 2023
Impact of Human LLMs alignment
Michael J. Ryan, William Held, Diyi Yang, Unintended Impacts of LLM Alignment on Global Representation, ACL 2024
Gabrielle Kaili-May Liu, Perspectives on the Social Impacts of Reinforcement Learning with Human Feedback, arxiv 2023
Can LLMs reason like humans?
Mohsen Fayyaz, Fan Yin, Jiao Sun, Nanyun Peng, “Evaluating Human Alignment and Model Faithfulness of LLM Rationale”, arxiv 2024
Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, Weiyan Shi, "How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs", ACL 2024
Andreas Opedal, Alessandro Stolfo, Haruki Shirakami, Ying Jiao, Ryan Cotterell, Bernhard Schölkopf, Abulhair Saparov, Mrinmaya Sachan. Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners? ICML 2024
Do LLMs understand human culture?
Badr AlKhamissi, Muhammad ElNokrashy, Mai Alkhamissi, Mona T. Diab, Investigating Cultural Alignment of Large Language Models, ACL 2024
Amith Ananthram, Elias Stengel-Eskin, Carl Vondrick, Mohit Bansal, Kathleen McKeown, "See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding", arxiv
Samuel Cahyawijaya, Delong Chen, Yejin Bang, Leila Khalatbari, Bryan Wilie, Ziwei Ji, Etsuko Ishii, Pascale Fung, "High-Dimension Human Value Representation in Large Language Models", arxiv 2024
Nithish Kannen, Arif Ahmad, Marco Andreetto, Vinodkumar Prabhakaran, Utsav Prabhu, Adji Bousso Dieng, Pushpak Bhattacharyya, Shachi Dave. Beyond Aesthetics: Cultural Competence in Text-to-Image Models, arxiv 2024
Akshita Jha, Vinodkumar Prabhakaran, Remi Denton, Sarah Laszlo, Shachi Dave, Rida Qadri, Chandan K. Reddy, Sunipa Dev: ViSAGe: A Global-Scale Analysis of Visual Stereotypes in Text-to-Image Generation. ACL 2024
Mukul Bhutani, Kevin Robinson, Vinodkumar Prabhakaran, Shachi Dave, Sunipa Dev: SeeGULL Multilingual: a Dataset of Geo-Culturally Situated Stereotypes. arxiv 2024
Akshita Jha, Aida Mostafazadeh Davani, Chandan K. Reddy, Shachi Dave, Vinodkumar Prabhakaran, Sunipa Dev: SeeGULL: A Stereotype Benchmark with Broad Geo-Cultural Coverage Leveraging Generative Models. ACL 2023
Can LLMs work with humans?
Ella Li, Taiwei Shi, Caleb Ziems, Min-Yen Kan, Nancy F. Chen, Zhengyuan Liu, Diyi Yang, "CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large Language Models for Data Annotation", EMNLP 2023
Shashank Gupta, Vaishnavi Shrivastava, Ameet Deshpande, Ashwin Kalyan, Peter Clark, Ashish Sabharwal, and Tushar Khot, "Bias runs deep: Implicit reasoning biases in persona-assigned LLMs", ICLR 2023
Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan, "Toxicity in chatgpt: Analyzing persona-assigned language models", EMNLP Findings 2023
Tilman Beck, Hendrik Schuff, Anne Lauscher, Iryna Gurevych, "Sensitivity, Performance, Robustness: Deconstructing the Effect of Sociodemographic Prompting", EACL 2024 (Social Impact Award)
Kaitlyn Zhou, Jena D. Hwang, Xiang Ren, Nouha Dziri, Dan Jurafsky, Maarten Sap, "Rel-A.I.: An Interaction-Centered Approach To Measuring Human-LM Reliance", arxiv 2024
Kaitlyn Zhou, Jena D Hwang, Xiang Ren & Maarten Sap, "Relying on the Unreliable: The Impact of Language Models' Reluctance to Express Uncertainty", ACL 2024
Shehzaad Dhuliawala, Vilém Zouhar, Mennatallah El-Assady, Mrinmaya Sachan. A Diachronic Perspective on User Trust in AI under Uncertainty. EMNLP 2023
Peng Cui, Vilém Zouhar, Xiaoyu Zhang, Mrinmaya Sachan. How to Engage Your Readers? Generating Guiding Questions to Promote Active Reading. ACL 2024
How to make LLMs safer for societal deployment?
San Kim, Gary Lee, "Adversarial DPO: Harnessing Harmful Data for Reducing Toxicity with Minimal Impact on Coherence and Evasiveness in Dialogue Agents", NAACL Findings 2024
Luiza Pozzobon, Beyza Ermis, Patrick Lewis, Sara Hooker, "GOODTRIEVER: Adaptive Toxicity Mitigation with Retrieval-augmented Models", EMNLP Findings 2023
Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, Nanyun Peng, "On Prompt-Driven Safeguarding for Large Language Models", ICML 2024
Xiaochen Li, Zheng-Xin Yong, Stephen H. Bach, “Preference Tuning For Toxicity Mitigation Generalizes Across Languages”, arxiv 2024
Mengru Wang, Ningyu Zhang, Ziwen Xu, Zekun Xi, Shumin Deng, Yunzhi Yao, Qishen Zhang, Linyi Yang, Jindong Wang, Huajun Chen, "Detoxifying Large Language Models via Knowledge Editing", ACL 2024
Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, Yaodong Yang, "BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset", NeurIPS 2023
Yuntao Bai et al, "Constitutional AI: Harmlessness from AI Feedback", arxiv 2022
Shangbin Feng, Chan Young Park, Yuhan Liu, Yulia Tsvetkov, "From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models analyzing pretraining data's effect on the political bias", ACL 2023
Yuanshun Yao Xiaojun Xu Yang Liu, "Large Language Model Unlearning an example of applying unlearning for LLM to reduce generate copyrighted or harmful content", ICLR 2024
Here we will maintain a list of papers that present relevant datasets/resources. This document will be populated with YOUR help. Please feel free to start contributing.