Accepted Papers can be found here: https://openreview.net/group?id=ICML.cc/2024/Workshop/MFHAIA
RLHF and IIA: Perverse Incentives Wanqiao Xu · Shi Dong · Xiuyuan Lu · Grace Lam · Zheng Wen · Benjamin Van Roy
AI Alignment with Changing and Influenceable Reward Functions Micah Carroll · Davis Foote · Anand Siththaranjan · Stuart Russell · Anca Dragan
Modeling the Plurality of Human Preferences via Ideal Points Daiwei Chen · Yi Chen · Aniket Rege · Ramya Vinayak 🔗
Hummer: Towards Limited Competitive Preference Dataset Li Jiang · Yusen Wu · Junwu Xiong · Jingqing Ruan · Yichuan Ding · Qingpei Guo · zujie wen · JUN ZHOU · Xiaotie Deng
Prompt Optimization with Human Feedback Xiaoqiang Lin · Zhongxiang Dai · Arunesh Sinha · See-Kiong Ng · Patrick Jaillet · Bryan Kian Hsiang Low
Scalable Oversight by Accounting for Unreliable Feedback Shivam Singhal · Cassidy Laidlaw · Anca Dragan
Preference Learning Algorithms Do Not Learn Preference Rankings Angelica Chen · Sadhika Malladi · Lily Zhang · Xinyi Chen · Richard Zhang · Rajesh Ranganath · Kyunghyun Cho
MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences Souradip Chakraborty · Jiahao Qiu · Hui Yuan · Alec Koppel · Furong Huang · Dinesh Manocha · Amrit Singh Bedi · Mengdi Wang
Note that the orals are also assigned a poster session below.
Scalably Solving Assistance Games Cassidy Laidlaw · Eli Bronstein · Timothy Guo · Dylan Feng · Lukas Berglund · Justin Svegliato · Stuart Russell · Anca Dragan
Preference Elicitation for Offline Reinforcement Learning Alizée Pace · Bernhard Schölkopf · Gunnar Ratsch · Giorgia Ramponi
AI Alignment with Changing and Influenceable Reward Functions Micah Carroll · Davis Foote · Anand Siththaranjan · Stuart Russell · Anca Dragan
Learning to Assist Humans without Inferring Rewards Vivek Myers · Evan Ellis · Benjamin Eysenbach · Sergey Levine · Anca Dragan
Reinforcement Learning from Human Text Feedback: Learning a Reward Model from Human Text Input Belen Martin Urcelay · Andreas Krause · Giorgia Ramponi
Multi-Agent Imitation Learning: Value is Easy, Regret is Hard Jingwu Tang · Gokul Swamy · Fei Fang · Steven Wu
Models That Prove Their Own Correctness Noga Amit · Shafi Goldwasser · Orr Paradise · Guy Rothblum
Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment Yuu Jinnai · Tetsuro Morimura · Kaito Ariu · Kenshi Abe
PIPER: Primitive-Informed Preference-based Hierarchical Reinforcement Learning via Hindsight Relabeling Utsav Singh · Wesley A. Suttle · Brian Sadler · Vinay Namboodiri · Amrit Singh Bedi
Modeling the Plurality of Human Preferences via Ideal Points Daiwei Chen · Yi Chen · Aniket Rege · Ramya Vinayak 🔗
Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms Rafael Rafailov · Yaswanth Chittepu · Ryan Park · Harshit Sikchi · Joey Hejna · William Knox · Chelsea Finn · Scott Niekum
DPM: Dual Preferences-based Multi-Agent Reinforcement Learning Sehyeok Kang · Yongsik Lee · Se-Young Yun
MultiScale Policy Learning for Alignment with Long Term Objectives Richa Rastogi · Yuta Saito · Thorsten Joachims
Towards Aligning Language Models with Textual Feedback Saüc Abadal · Shehzaad Dhuliawala · Keerthiram Murugesan · Mrinmaya Sachan
Bootstrapping Language Models with DPO Implicit Rewards Changyu Chen · Zichen Liu · Chao Du · Tianyu Pang · Qian Liu · Arunesh Sinha · Pradeep Varakantham · Min Lin
MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences Souradip Chakraborty · Jiahao Qiu · Hui Yuan · Alec Koppel · Furong Huang · Dinesh Manocha · Amrit Singh Bedi · Mengdi Wang
Filtered Direct Preference Optimization Tetsuro Morimura · Mitsuki Sakamoto · Yuu Jinnai · Kenshi Abe · Kaito Ariu
Optimal Design for Human Feedback Subhojyoti Mukherjee · Anusha Lalitha · Kousha Kalantari · Aniket Anand Deshmukh · Ge Liu · Yifei Ma · Branislav Kveton
Aligning Crowd Feedback via Distributional Preference Reward Modeling Dexun Li · Cong Zhang · Kuicai Dong · Derrick Goh Xin Deik · Ruiming Tang · Yong Liu
New Desiderata for Direct Preference Optimization Xiangkun Hu · Tong He · David Wipf
Accelerating Best-of-N via Speculative Rejection Ruiqi Zhang · Momin Haider · Ming Yin · Jiahao Qiu · Mengdi Wang · Peter Bartlett · Andrea Zanette
A Theoretical Framework for Partially Observed Reward-States in RLHF Chinmaya Kausik · Mirco Mutti · Aldo Pacchiano · Ambuj Tewari
Weak-to-Strong Extrapolation Expedites Alignment Chujie Zheng · Ziqi Wang · Heng Ji · Minlie Huang · Nanyun Peng
Inverse Reinforcement Learning from Demonstrations for LLM Alignment Hao Sun · M van der Schaar
RLHF and IIA: Perverse Incentives Wanqiao Xu · Shi Dong · Xiuyuan Lu · Grace Lam · Zheng Wen · Benjamin Van Roy
Is a Good Description Worth a Thousand Pictures? Reducing Multimodal Alignment to Text-Based, Unimodal Alignment Amin Memarian · Touraj Laleh · Irina Rish · Ardavan S. Nobandegani
Generalizing Offline Alignment Theoretical Paradigm with Diverse Divergence Constraints Haoyuan Sun · Yuxin Zheng · Yifei Zhao · Yongzhe Chang · Xueqian Wang
Comparing Comparisons: Informative and Easy Human Feedback with Distinguishability Queries Xuening Feng · Zhaohui Jiang · Timo Kaufmann · Eyke Hüllermeier · Paul Weng · Yifei Zhu
Learning the eye of the beholder: Statistical modeling and estimation for personalized color perception Xuanzhou Chen · Austin Xu · Jingyan Wang · Ashwin Pananjady
Informed Meta-Learning Katarzyna Kobalczyk · M van der Schaar
Off-Policy Evaluation from Logged Human Feedback Aniruddha Bhargava · Lalit Jain · Branislav Kveton · Ge Liu · Subhojyoti Mukherjee
Beyond Thumbs Up/Down: Untangling Challenges of Fine-Grained Feedback for Text-to-Image Generation Katie Collins · Najoung Kim · Yonatan Bitton · Verena Rieser · Shayegan Omidshafiei · Yushi Hu · Sherol Chen · Senjuti Dutta · Minsuk Chang · Kimin Lee · Youwei Liang · Georgina Evans · Sahil Singla · Gang Li · Adrian Weller · Junfeng He · Deepak Ramachandran · Krishnamurthy Dvijotham
Concept-Based Interpretable Reinforcement Learning with Limited to No Human Labels Zhuorui Ye · Stephanie Milani · Fei Fang · Geoff Gordon
Uncertainty-aware Preference Alignment in Reinforcement Learning from Human Feedback Sheng Xu · Bo Yue · Hongyuan Zha · Guiliang Liu
Language Alignment via Nash-learning and Adaptive feedback Ari Azarafrooz · Farshid Faal
Step-On-Feet Tuning: Scaling Self-Alignment of LLMs via Bootstrapping Haoyu Wang · Guozheng Ma · Ziqiao Meng · Zeyu Qin · Li Shen · Zhong Zhang · Bingzhe Wu · Liu Liu · Yatao Bian · Tingyang Xu · Xueqian Wang · Peilin Zhao
Efficient Inverse Reinforcement Learning without Compounding Errors Nicolas Espinosa Dice · Gokul Swamy · Sanjiban Choudhury · Wen Sun
Revisiting Successor Features for Inverse Reinforcement Learning Arnav Kumar Jain · Harley Wiltzer · Jesse Farebrother · Irina Rish · Glen Berseth · Sanjiban Choudhury
DPO Meets PPO: Reinforced Token Optimization for RLHF Han Zhong · Guhao Feng · Wei Xiong · Xinle Cheng · Li Zhao · Di He · Jiang Bian · Liwei Wang
AMBER: An Entropy Maximizing Environment Design Algorithm for Inverse Reinforcement Learning Paul Nitschke · Lars L. Ankile · Eura Nofshin · Siddharth Swaroop · Finale Doshi-Velez · Weiwei Pan
Stochastic Concept Bottleneck Models Moritz Vandenhirtz · Sonia Laguna · Ričards Marcinkevičs · Julia Vogt
Hummer: Towards Limited Competitive Preference Dataset Li Jiang · Yusen Wu · Junwu Xiong · Jingqing Ruan · Yichuan Ding · Qingpei Guo · zujie wen · JUN ZHOU · Xiaotie Deng
Is poisoning a real threat to LLM alignment? Maybe more so than you think Pankayaraj Pathmanathan · Souradip Chakraborty · Xiangyu Liu · Yongyuan Liang · Furong Huang
Distributional Preference Alignment of LLMs via Optimal Transport Igor Melnyk · Youssef Mroueh · Brian Belgodere · Mattia Rigotti · Apoorva Nitsure · Mikhail Yurochkin · Kristjan Greenewald · Jiri Navratil · Jarret Ross
Scalable Oversight by Accounting for Unreliable Feedback Shivam Singhal · Cassidy Laidlaw · Anca Dragan
Enhancing Intent Understanding for Ambiguous prompt: A Human-Machine Co-Adaption Strategy Yangfan He · Yuxuan Bai · TIANYU SHI
Relatively Rational: Learning Utilities and Rationalities Jointly from Pairwise Preferences Taku Yamagata · Tobias Oberkofler · Timo Kaufmann · Viktor Bengs · Eyke Hüllermeier · Raul Santos-Rodriguez
Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment Zhaofeng Wu · Ananth Balashankar · Yoon Kim · Jacob Eisenstein · Ahmad Beirami
Preference Learning Algorithms Do Not Learn Preference Rankings Angelica Chen · Sadhika Malladi · Lily Zhang · Xinyi Chen · Richard Zhang · Rajesh Ranganath · Kyunghyun Cho
Order-Optimal Instance-Dependent Bounds for Offline Reinforcement Learning with Preference Feedback Zhirui Chen · Vincent Tan
Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization Hritik Bansal · Ashima Suvarna · Gantavya Bhatt · Nanyun Peng · Kai-Wei Chang · Aditya Grover
Aligning Large Language Models with Representation Editing: A Control Perspective Lingkai Kong · Haorui Wang · Wenhao Mu · Yuanqi Du · Yuchen Zhuang · Yifei Zhou · Yue Song · Rongzhi Zhang · Kai Wang · Chao Zhang
Cross-Domain Knowledge Transfer for RL via Preference Consistency Ting-Hsuan Huang · Ping-Chun Hsieh
Adversarial Multi-dueling Bandits Pratik Gajane
REBEL: Reinforcement Learning via Regressing Relative Rewards Zhaolin Gao · Jonathan Chang · Wenhao Zhan · Owen Oertell · Gokul Swamy · Kianté Brantley · Thorsten Joachims · Drew Bagnell · Jason Lee · Wen Sun
Free-Energy Equilibria: Toward a Theory of Interactions Between Boundedly-Rational Agents David Hyland · Tomáš Gavenčiak · Lancelot Da Costa · Conor Heins · Vojtech Kovarik · Julian Gutierrez · Michael Wooldridge · Jan Kulveit
Towards Safe Large Language Models for Medicine Tessa Han · Aounon Kumar · Chirag Agarwal · Himabindu Lakkaraju
Query Design for Crowdsourced Clustering: Effect of Cognitive Overload and Contextual Bias Yi Chen · Ramya Vinayak
"You just can’t go around killing people'' Explaining Agent Behavior to a Human Terminator Uri Menkes · Ofra Amir · Assaf Hallak
Prompt Optimization with Human Feedback Xiaoqiang Lin · Zhongxiang Dai · Arunesh Sinha · See-Kiong Ng · Patrick Jaillet · Bryan Kian Hsiang Low