07:30 - 08:45 am
☕ Morning Coffee
08:45 - 09:00 am
Opening Remarks & Overview
09:00 - 09:45 am
Keynote
Nicholas Carlini (Anthropic) "Attacking and Defending Language Model Systems"
Abstract: Language models are now increasingly deployed as part of larger systems. This unfortunately increases the scope of attacks that are possible, as adversaries can now take advantage of the ways in which models do not behave like perfect ideal functions. But the same considerations also open up the potential for new defenses. Instead of designing models to be completely secure in isolation, we can start to consider defenses that work by designing security around an untrusted and vulnerable model. This talk discusses several attacks and defenses that result from this new design, and concludes by highlighting open challenges and potential opportunities.
Bio: Nicholas Carlini is a research scientist at Anthropic working at the intersection of machine learning and computer security, and for this work has received best paper awards from USENIX Security, ICML, and IEEE S&P. He received his PhD from UC Berkeley under David Wagner.
09:45 - 10:15 am
☕ Break
10:15 - 11:45 pm
Systems-level Defenses
Chair: Drew Zagieboylo (NVIDIA)
Talks:
Nicolas Papernot (University of Toronto) "Unlearning methods are not a good fit for output censorship"
Abstract: The need for machine unlearning, i.e., obtaining a model one would get without training on a subset of data, arises from privacy legislation and as a potential solution to data poisoning or copyright claims. As we present different approaches to unlearning, it becomes clear that they fail to answer the following question: how can end users verify that unlearning was successful? Indeed, we show how an entity can claim plausible deniability when challenged about an unlearning request that was claimed to be processed, and conclude that at the level of model weights, being unlearnt is not always a well-defined property. Put another way, we find that unlearning is an algorithmic property. Taking a step back, we draw lessons for the broader area of trustworthy machine learning.
Bio: Nicolas Papernot is an Assistant Professor at the University of Toronto, in the Department of Electrical and Computer Engineering, the Department of Computer Science, and the Faculty of Law. He is also a faculty member at the Vector Institute where he holds a Canada CIFAR AI Chair, and a faculty affiliate at the Schwartz Reisman Institute. In addition, he is a Co-Director of the Canadian AI Safety Institute (CAISI) Research Program at CIFAR. His research interests span the security and privacy of machine learning. Some of his group’s recent projects include generative model collapse, cryptographic auditing of ML, private learning, proof-of-learning, and machine unlearning. Nicolas is an Alfred P. Sloan Research Fellow in Computer Science, a Member of the College of the Royal Society of Canada, and a Schmidt Sciences AI2050 Early Career Fellow. He is also a recipient of the McCharles Prize for Early Career Research Distinction and the ACM SIGSAC Outstanding Early-Career Researcher Award. His work on differentially private machine learning was awarded an outstanding paper at ICLR 2022 and a best paper at ICLR 2017. He co-created the IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) and co-chaired its first two editions in 2023 and 2024. He is currently the program co-chair of the IEEE Symposium on Security and Privacy (Oakland). Nicolas earned his Ph.D. at the Pennsylvania State University, working with Prof. Patrick McDaniel and supported by a Google PhD Fellowship. Upon graduating, he spent a year at Google Brain (now Google DeepMind) where he still spends some of his time.
Eric Wallace (OpenAI) "Making “GPT-Next” Robust"
Abstract: I’ll talk about three recent directions from OpenAI to make our next-generation of models more trustworthy and secure. First, I will do a deep dive into chain-of-thought reasoning models and how we can align them with human preferences using deliberative alignment. Next, I will discuss how to mitigate prompt injections and jailbreaks by teaching LLMs to follow instructions in a hierarchical manner. Finally, I will discuss the tensions that exist between open model access and system security, whereby providing access to LM output probabilities can allow adversaries to reveal the hidden size of black-box models.
Bio: Eric Wallace is a research scientist at OpenAI, where he studies the theory and practice of building safe, trustworthy, and private LLMs. He did his PhD at UC Berkeley, where he was supported by the Apple Scholars in AI Fellowship and had his research recognized by various awards (EMNLP, PETS). Prior to OpenAI, Eric interned at Google Brain, AI2, and FAIR.
Chuan Guo (FAIR - Meta) "A strategic vision for mitigating prompt injection attacks"
Abstract: LLM-integrated applications are vulnerable to adversarial manipulation via prompt injection attacks. This type of attack is unique to LLMs and is particularly acute for frontier models with strong instruction-following abilities, and thus requires new mitigation strategies. In this talk, we will discuss recent advances in both model-level and system-level mitigations, as well as how these mitigations can be effective combined to achieve defense in depth.
Bio: Chuan Guo is a research scientist in the Fundamental AI Research (FAIR) team at Meta. His research focuses on AI security and privacy, with recent topics including adversarial robustness, privacy-preserving machine learning, and security of LLM-integrated applications. He received his PhD from Cornell University, advised by Kilian Weinberger and Karthik Sridharan.
Small-group discussion session
11:45 - 01:00 pm
🍴Lunch
01:00 - 02:30 pm
Attacks on Agentic Systems
Chair: Ivan Evtimov (Meta)
Talks:
Johann Rehberger (Embrace the Red) "Exploiting Computer-Use Agents - Attacks and Mitigations"
Abstract: This presentation will demonstrate how prompt injection attacks can compromise agentic systems—such as OpenAI’s Operator and Anthropic’s Claude—with potentially disastrous consequences. The talk will expose critical vulnerabilities that threaten user privacy, system integrity, and the future of AI-driven automation. The talk will also cover current mitigation strategies and forward-looking recommendations and thoughts,. To kick it all off, the talk will start with a striking demonstration showing persistent Command and Control of compromised ChatGPT instances - achieved entirely through prompt injection.
Bio: Johann Rehberger has over twenty years of experience in threat modeling, risk management, penetration testing, and red teaming. As part of his many years at Microsoft, Johann established a Red Team in Azure Data and led the program as Principal Security Engineering Manager. He also built out a Red Team at Uber, and currently is Red Team Director at Electronic Arts. Additionally, Johann enjoys providing training and was an instructor for ethical hacking at the University of Washington in Seattle. He is the author of the book "Cybersecurity Attacks – Red Team Strategies", and holds a master’s in computer security from the University of Liverpool.
Chaowei Xiao (University of Wisconsin-Madison) "Towards Secure and Safe LLM Agents"
Abstract: We are witnessing a paradigm shift in AI, transitioning from deep learning models to the era of Large Language Models (LLMs). This shift signifies a transformative advancement in AI, enabling it to be applied to diverse real-world safety-critical applications. Despite these impressive achievements, a fundamental question remains: is AI truly ready for safe, and secure use? In this talk, I will demonstrate how my research addresses this question. I will introduce two principles for measuring the worst-case performance of AI and for designing secure and safe AI models. Additionally, I will highlight why securing AI requires more than just focusing on model-level robustness by illustrating security vulnerabilities from a broader system perspective. Finally, I will outline my vision for ensuring AI security through comprehensive, integrated model-level and system-level approaches.
Bio: Chaowei Xiao is an Assistant Professor at the University of Wisconsin–Madison. His research focuses on building secure and trustworthy AI systems. He has received several prestigious awards, including the Schmidt Science AI2050 Early Career Award, the Impact Award from Argonne National Laboratory, and various industry faculty awards. His work has been recognized with several paper awards including the USENIX Security Distinguished Paper Award (2024), the International Conference on Embedded Wireless Systems and Networks (EWSN) Best Paper Award (2021), and the MobiCom Best Paper Award (2014), ACM Gordon Bell Prize Finalist (2024) and ACM Gordon Bell Special Prize for HPC-Based COVID-19 (2023). Dr. Xiao's research has been cited over 15k times and has been featured in multiple media outlets such as Nature, Wired, Fortune, and The New York Times. Additionally, one of his research outputs was exhibited at the London Science Museum.
Arman Zharmagambetov (FAIR - Meta) "GCG++: Automated Red-Teaming in the Age of Agentic AI"
Abstract: In recent years, automatic red-teaming for LLMs has seen significant growth. A plethora of methods, such as GCG and its variations, have become widely adopted, often complementing manual red-teaming efforts to evaluate open-source and industry LLMs against diverse attacks. However, the transition to agentic LLMs/VLMs raises questions about the applicability of these methods. Can existing approaches be easily extended, or do we need novel attack types? Prior research presents mixed results, partly due to intrinsic differences between chatbot settings and agentic modes, such as long-horizon tasks and complex environments. This talk explores potential strategies, including optimization-based (e.g. GCG), RL-based (e.g. multi-agent interplay), and beyond.
Bio: Arman Zharmagambetov is a research scientist in the Fundamental AI Research (FAIR) team at Meta. His research primarily focuses on machine learning and optimization, recently exploring their application in enhancing the security and robustness of AI systems.
Small-group discussion session
02:30 - 03:00 pm
☕ Break
03:00 - 04:30 pm
Privacy issues in Agents
Chair: Niloofar Mireshghallah (University of Washington)
Talks:
Eugene Bagdasarian (UMass Amherst CICS) "Contextual Defenses for Privacy-Conscious Agents"
Abstract: Personal AI agents acting in the real world require access to our devices and sensitive data, creating significant privacy risks. Can we build agents that users can trust? This talk introduces a novel framework for designing “privacy-conscious” agents resistant to data theft attacks through prompt injection and jailbreaking. We first analyze the threats, presenting powerful "context hijacking" attacks based on Contextual Integrity (CI) theory designed to manipulate agent behavior. We then show through evaluations on synthetic personal data how integrating privacy reasoning demonstrably improves agent security. Drawing inspiration from classic systems security, we demonstrate how reference monitors and air-gap-inspired techniques can effectively shield agents from untrusted inputs. Finally, we highlight how our recent work advances contextual security and outline key opportunities for the community to build the future of secure, privacy-conscious AI agents.
Bio: Eugene is an Assistant Professor at UMass Amherst CICS. Eugene's work focuses on security and privacy in emerging AI-based systems and agentic use-cases under real-life conditions and attacks.
Saeed Mahloujifar (FAIR - Meta) "How much can language models memorize?"
Abstract: Memorization of training data presents a persistent challenge in machine learning, with particularly concerning implications for language modeling tasks. As foundation models continue to advance, defining memorization becomes increasingly nuanced, requiring careful consideration of both theoretical soundness and practical implications. This talk critically examines existing definitions of memorization in foundation language models and proposes a novel alternative that addresses their limitations. We establish essential criteria for an effective definition: Applicable to instances of models and data, clear distinction between generalization and memorization, and efficient measurability. Drawing inspiration from Kolmogorov complexity, we introduce a compression-based definition that satisfies these criteria. Our presentation demonstrates practical methods to approximate this notion and presents experimental findings on how language models memorize their training data. These insights contribute to our understanding of model behavior and how that is tied to the capacity of models for memorization.
Bio: Saeed Mahloujifar is a Research Scientist in the Fundamental AI Research (FAIR) team at Meta. His research interests are on theoretical foundations of privacy and security for AI systems and their interplay with cryptography.
Small-group discussion session
04:30 - 05:00pm
Concluding Remarks