Workshop Schedule

All times are in local Vienna time (CEST).

08:50--09:00 Opening remark

09:00--09:30 Invited Talk # 1 : Mike Lewis (Meta AI)

Title: Bridging the Gap Between Pre-training Data and Alignment
Abstract: Large Language Models are first pre-trained to predict unlabelled web documents, but then typically used in scenarios such as few shot learning and instruction following - leaving a significant gap between training and inference. I will describe two scalable approaches to reducing this gap, by augmenting pre-training data with additional structure. First I will introduce In Context Pre-training, which mimics in context learning during pre-training, by training language models on sequences of closely related documents. Then, I will describe Instruction Backtranslation, an approach to augmenting web documents with automatically inferred instructions for generating them, allowing a highly scalable source of instruction following data. Together, these can improve the utility of base models, and reduce the need for a separate alignment phase of training.

09:30--09:45 Best Paper Oral Presentation # 1 : Ken Liu (Stanford University)

Title : Does Data Contamination Make a Difference? Insights from Intentionally Contaminating Pre-training Data For Language Models

09:45--10:00 Best Paper Oral Presentation # 2 : Sachin Goyal & Pratyush Maini (CMU)

Title : The Science of Data Filtering: Data Curation cannot be Compute Agnostic

10:00--11:00 Coffee Break & Poster Session I

11:00--11:30 Invited Talk # 2 : Ludwig Schmidt (Anthropic, Stanford, and U Washington)

Title: A data-centric view on reliable generalization: From ImageNet to LAION-5B & DataComp

11:30--09:45 Best Paper Oral Presentation # 3 : Hritik Bansal (UCLA)

Title : VideoCon: Robust Video-Language Alignment via Contrast Captions

11:45--12:00 Best Paper Oral Presentation # 4 : Luxi He (Princeton University)

Title : What's in Your "Safe" Data?: Identifying Benign Data that Breaks Safety

12:00--13:00 Lunch

13:00--13:30 Invited Talk # 3 : Eric Wallace (OpenAI)

Title : Making “GPT-Next” Trustworthy Through Data
Abstract: I’ll talk about three recent directions from OpenAI to make our next-generation of models more responsible, trustworthy, and secure. First, I will briefly outline the “Media Manager”—a tool to enable content owners to specify how they want their works to be included/excluded from AI training. Next, I will do a deep dive on prompt injections and how we can mitigate them by teaching LLMs to follow instructions in a hierarchal manner. Finally, I will discuss the tensions that exist between developer access and security, whereby providing access to LM output probabilities can allow adversaries to reveal the hidden size of black-box models.

13:30--13:30 Best Paper Oral Presentation # 5 : Lukas Struppek (TU Darmstadt)

Title : Exploiting Cultural Biases via Homoglyphs in Text-to-Image Synthesis

13:45--14:00 Best Paper Oral Presentation # 6 : Jiaqi Ma (UIUC)

Title : Computational Copyright: Towards A Royalty Model for AI Music Generation Platforms

14:00--15:00 Coffee Break & Poster Session II

15:00--15:30 Invited Talk # 4 : Nicolas Papernot (University of Toronto & Vector Institute)

Title: Characterizing Machine Unlearning through Definitions and Implementations
Abstract: The talk presents open problems in the study of machine unlearning. The need for machine unlearning, i.e., obtaining a model one would get without training on a subset of data, arises from privacy legislation and as a potential solution to data poisoning or copyright claims. The first part of the talk discusses approaches that provide exact unlearning: these approaches output the same distribution of models as would have been obtained by training without the subset of data to be unlearned in the first place. While such approaches can be computationally expensive, we discuss why it is difficult to relax the guarantee they provide to pave the way for more efficient approaches. The second part of the talk asks if we can verify unlearning. Here we show how an entity can claim plausible deniability when challenged about an unlearning request that was claimed to be processed, and conclude that at the level of model weights, being unlearnt is not always a well-defined property. Instead, unlearning is an algorithmic property.

15:30--16:00 Invited Talk # 5 : Luke Zettlemoyer (U Washington/Meta)

Title: Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models
Abstract: Existing language model (LM) training regimes entangle compute, data, and parameters, requiring expensive synchronous communication with massive supercomputers. This talk introduces a new algorithm called Branch-Train-Merge (BTM) that asynchronously trains LMs that are fundamentally modular. In BTM, components (or experts) of the LM are specialized to distinct domains in the training corpus, and experts are conditionally updated based on the domain of the incoming document. We show how BTM enables LMs that are rapidly customizable (with the ability to mix, add, or remove experts after training), embarrassingly parallel (requiring no communication between experts), and sparse (needing only a few experts active at a time for inference). Key to our proposal is exploring what constitutes the domains to which experts specialize, as well as reflecting on the data sources used to train LMs. Our new techniques chart a path towards collaborative and iterative LM development, where anyone can contribute and maintain experts at modest computational cost.

16:00--16:30 Panel Discussion

[Panelists : Hanna Hajishirzi, Ludwig Schmidt, Eric Wallace, Remi Denton]

16:30--16:35 Closing remark

Speakers - Invited Talks

(Scheduled order)

Mike Lewis 🔗

Meta AI

Mike Lewis is a research scientist at Meta AI, and is currently the pre-training lead for Llama3. Prior projects include the Cicero Diplomacy agent, and the Bart and Roberta pretrained language models. Previously he was a postdoc at the University of Washington (working with Luke Zettlemoyer), and has a PhD from the University of Edinburgh (advised by Mark Steedman). He received a Best Paper Award at EMNLP 2016, Best Resource Paper at ACL 2017, and Best Paper Honourable Mention at ACL 2018. His work has been extensively covered in the media, with varying levels of accuracy.

Ludwig Schmidt 🔗

Anthropic, Stanford, and University of Washington

Ludwig Schmidt is a member of the technical staff at Anthropic and an assistant professor at the University of Washington (on leave) and Stanford University (incoming). Ludwig’s research interests revolve around the empirical foundations of machine learning, often with a focus on datasets, reliable generalization, and large models. Recently, Ludwig’s research group contributed to open source machine learning by creating OpenCLIP, OpenFlamingo, and the LAION-5B dataset. Ludwig completed his PhD at MIT and was a postdoc at UC Berkeley. Ludwig’s research received a new horizons award at EAAMO, best paper awards at ICML & NeurIPS, a best paper finalist at CVPR, and the Sprowls dissertation award from MIT.

Eric Wallace 🔗

OpenAI

Eric Wallace is a research scientist at OpenAI, where he studies the theory and practice of building trustworthy, secure, and private machine learning models. He did his PhD work at UC Berkeley, where he was supported by the Apple Scholars in AI Fellowship and had his research recognized by various awards (EMNLP, PETS). Prior to OpenAI, Eric interned at Google Brain, AI2, and FAIR.

Nicolas Papernot 🔗

University of Toronto & Vector Institute

Nicolas Papernot is an Assistant Professor at the University of Toronto, in the Department of Electrical and Computer Engineering and the Department of Computer Science. He is also a faculty member at the Vector Institute where he hold a Canada CIFAR AI Chair, and a faculty affiliate at the Schwartz Reisman Institute. He was named an Alfred P. Sloan Research Fellow in Computer Science in 2022 and a Member of the Royal Society of Canada College in 2023. His research interests are at the intersection of security, privacy, and machine learning. His research has been cited in the press, including the BBC, New York Times, Popular Science, The Atlantic, the Wall Street Journal and Wired. He currently serve as a Program Committee Chair of the IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), which he co-founded in 2023. He earned my Ph.D. in Computer Science and Engineering at the Pennsylvania State University, working with Prof. Patrick McDaniel and supported by a Google PhD Fellowship. Upon graduating, he joined Google Brain for a year; He continue to spend time at Google DeepMind.

Luke Zettlemoyer 🔗

University of Washington and Meta

Luke Zettlemoyer is a Professor in the Paul G. Allen School of Computer Science & Engineering at the University of Washington, and a Research Director at Meta. His research focuses on empirical methods for natural language semantics, and involves designing machine learning algorithms, introducing new tasks and datasets, and, most recently, studying how to best develop self-supervision signals for pre-training. His honors include being named an ACL Fellow as well as winning a PECASE award, an Allen Distinguished Investigator award, and multiple best paper awards. Luke received his PhD from MIT and was a postdoc at the University of Edinburgh.

Panelists - Panel Discussion

(alphabetical order)

Remi Denton 🔗

Google

Dr. Remi Denton (they/them) is a Staff Research Scientist at Google, within the Technology, AI, Society, and Culture team, where they study the sociocultural impacts of AI technologies and conditions of AI development. Their recent research centers on emerging text- and image-based generative AI, with a focus on data considerations and representational harms. Prior to joining Google, Emily received their PhD in Computer Science from the Courant Institute of Mathematical Sciences at New York University, where they focused on unsupervised learning and generative modeling of images and video. Prior to that, they received their B.S. in Computer Science and Cognitive Science at the University of Toronto. Though trained formally as a computer scientist, Emily draws ideas and methods from multiple disciplines and is drawn towards highly interdisciplinary collaborations, in order to examine AI systems from a sociotechnical perspective. They've published in multiple top-tier venues spanning social science and computing disciplines, including Big Data & Society, CSCW, FAccT, and NeurIPS.

Hanna Hajishirzi 🔗

University of Washington and AI2

Hanna Hajishirzi is a Torode Family Associate Professor at the University of Washington and a Senior Director of NLP at AI2. Her research spans different areas in NLP and AI, more recently on the science of language models and language models for science. Honors include an NSF CAREER, Sloan Fellowship, Allen Distinguished Investigator Award, Intel rising star award, UIUC alumni award. She has received a best paper and several honorable mention paper awards.

Ludwig Schmidt 🔗

Anthropic, Stanford, and University of Washington

Ludwig Schmidt is a member of the technical staff at Anthropic and an assistant professor at the University of Washington (on leave) and Stanford University (incoming). Ludwig’s research interests revolve around the empirical foundations of machine learning, often with a focus on datasets, reliable generalization, and large models. Recently, Ludwig’s research group contributed to open source machine learning by creating OpenCLIP, OpenFlamingo, and the LAION-5B dataset. Ludwig completed his PhD at MIT and was a postdoc at UC Berkeley. Ludwig’s research received a new horizons award at EAAMO, best paper awards at ICML & NeurIPS, a best paper finalist at CVPR, and the Sprowls dissertation award from MIT.

Eric Wallace 🔗

OpenAI

Eric Wallace is a research scientist at OpenAI, where he studies the theory and practice of building trustworthy, secure, and private machine learning models. He did his PhD work at UC Berkeley, where he was supported by the Apple Scholars in AI Fellowship and had his research recognized by various awards (EMNLP, PETS). Prior to OpenAI, Eric interned at Google Brain, AI2, and FAIR.