Talk Title: The Frontiers and Challenges of Reasoning LLMs
Abstract: The recent advances in reasoning-capable Large Language Models (LLMs) have yielded impressive benchmark improvements and transformed commercial AI products. These successes have been particularly pronounced when we can establish clear objectives and verification mechanisms, such as in mathematics and programming. But where do reasoning LLMs currently fall short? How will the dynamics of human-AI collaboration evolve? What are the ongoing impacts on learning and education? In this talk, I will endeavor to speak to, at the very least, some of these questions.
Bio: Nick Haber is an Assistant Professor at the Stanford Graduate School of Education, and by courtesy, Computer Science. After receiving his PhD in mathematics on Partial Differential Equation theory, he worked as a postdoctoral fellow at Stanford in both the Wall Lab (working chiefly on the Autism Glass Project) and the NeuroAI Lab (on building curiosity within artificial intelligence, as well as cognitive models).
Talk Title: Context is King: Unpacking the Generalizability Fallacy in Deep Learning
Abstract: For over twenty years, the deep learning community has celebrated each new model that surpasses its predecessors on standard benchmarks, often with the implicit assumption that these performance gains guarantee success in novel applications. Yet, this optimism repeatedly unravels when models are introduced in the real-world, exposing a critical blind spot: context is not just a detail—it’s the foundation of robustness and performance. In this talk, we will discuss the persistent fallacy that a model’s benchmark superiority ensures generalizability. We will explore how contextual factors—ranging from shifting data distributions to domain-specific constraints—consistently challenge the universality of even the most advanced architectures. Drawing on real-world examples and empirical insights from the world’s preeminent healthcare institution, this talk will highlight why ignoring context undermines applied deep learning and propose strategies to rethink the ML lifecycle for truly adaptive, resilient systems. Context isn’t a footnote; it’s the key to unlocking deep learning’s potential beyond the lab.
Bio: John Kalantari is the Chief Technology Officer of YRIKKA and an Assistant Professor at the University of Minnesota. He previously served as Director of AI at the Mayo Clinic, holding appointments in the Department of Surgery, the Department of Quantitative Health Sciences, and the Center for Individualized Medicine. He is also the founder of the Biomedical Artificial General Intelligence Lab (BAGIL) at Mayo Clinic, an interdisciplinary group focused on developing digital health tools and predictive models to improve patient care and expand healthcare access through causal machine learning and reinforcement learning. At YRIKKA, he leads pioneering advancements in multi-modal generative AI, emphasizing the quantification of model uncertainty and robustness in high-stakes applications such as national defense and healthcare. His work bridges the gap between AI research and critical real-world implementations, pushing the boundaries of generative models to handle diverse data modalities and enhance decision-making in complex environments.
Talk Title: Automating Scientific Discovery: How Far Are We?
Abstract: In this talk, I'll discuss the emergent field of using frontier models such as LLMs for automating scientific discovery and AI research itself. I will first describe the goals of this research area, the various subproblems, proposed approaches, and early work in this space. Despite the hype, flashy news articles, and some recent works with bold claims, I will provide empirical evidence that models still struggle with many aspects of scientific discovery. I argue this is still an open problem and it is unclear whether the current AI paradigm is enough to achieve the long-term ambition of this research agenda. I will then introduce MLGym and MLGym-Bench, a new framework and benchmark for evaluating and developing LLM agents on AI research tasks. This is the first Gym environment for machine learning (ML) tasks, enabling research on reinforcement learning (RL) algorithms for training such agents. MLGym-bench consists of 13 diverse and open-ended AI research tasks from diverse domains such as computer vision, natural language processing, reinforcement learning, and game theory. Solving these tasks requires real-world AI research skills such as generating new ideas and hypotheses, creating and processing data, implementing ML methods, training models, running experiments, analyzing the results, and iterating through this process to improve on a given task. I will demonstrate how MLGym makes it easy to add new tasks, integrate and evaluate models or agents, generate synthetic data at scale, as well as develop new learning algorithms for training agents on AI research tasks. Finally, I will discuss our findings from evaluating frontier LLMs on MLGym-bench, highlighting the limitations of current models at conducting AI Research, as well as avenues for future work.
Bio: Roberta Raileanu is a Research Scientist at Meta and an Honorary Lecturer at UCL. She earned her PhD in Computer Science from NYU where she worked on generalization in deep reinforcement learning. Roberta also holds a degree in Astrophysics from Princeton University. Currently, she works on augmenting foundation models with planning, reasoning and decision making abilities by training them from feedback and interaction with external tools, environments, humans, and other AI agents.
Talk Title: Multilingual and Multicultural Fault Lines: Systemic Weaknesses in Language Technologies
Abstract: Despite recent breakthroughs in deep learning and large language models (LLMs), language technologies continue to exhibit critical failure modes in multilingual and multicultural contexts. This talk provides examples of these systemic weaknesses across a wide range of language technologies from real-world case studies and benchmark evaluations. We highlight how language models often underperform or behave unpredictably when dealing with non-dominant languages, dialects, or culturally-specific content. These failures, ranging from dangerous mistranslations to culturally insensitive chatbot responses, reveal persistent gaps in data, modeling and evaluation. We call for deeper attention to inclusion, representativeness, and equity in language AI development.
Bio: Sunayana is a Principal Researcher at Microsoft Research India in Bangalore, where she has been working for the last 8 years since completing her PhD at Carnegie Mellon University in 2015. She is passionate about making AI inclusive to everyone and her current focus is on improving the evaluation and performance of Large Language Models on non-English languages. In addition to her research, Sunayana also served for the last two years as the director of the MSR India Research Fellow program, which hosts ~65 young researchers to prepare them for careers in research, engineering, and entrepreneurship. Sunayana is an active contributor to the field and regularly publishes her findings, as well as serves on the organizing committee for NLP conferences such as ACL, EMNLP, and CoLM.
Talk Title: Beyond Benchmarks: Why Classification is Still Hard in Practice
Abstract: Standard benchmarks often suggest that tasks like image classification are largely "solved". However, practitioners deploying models in industry frequently encounter scenarios where these solutions frequently fail, especially in critical fields like online safety and content moderation, where there is a long tail of special cases. This talk examines why performance frequently falls short of benchmark expectations in practice. We'll investigate key challenges such as: 1) efficiently obtaining nuanced specialized data; 2) understanding critical evaluation limitations; 3) efficiently training specialized models. Finally, we'll assess how Large Language Models (LLMs) can potentially assist, while also highlighting their limitations. Attendees will gain insights into diagnosing failures and developing robust strategies for tackling niche classification tasks in the real world.
Bio: Otilia Stretcu is a Senior Research Scientist at Google Research in Mountain View, California, working on methods and tools that enable non-AI-practitioners to efficiently train and deploy AI models for specialized applications using only domain knowledge. This work spans across multiple areas including large language models, active learning, few-shot learning, and knowledge distillation. Previously, she was a PhD student at Carnegie Mellon University, co-advised by Prof. Tom Mitchell and Prof. Barnabàs Pòczos. Her PhD research focused on developing algorithms for curriculum learning, semi-supervised learning, and graph-based learning, and applying them on problems related to health and neuroscience. Prior to her PhD, Otilia received an MPhil from the University of Cambridge, UK, where she was a Gates Cambridge scholar, and a BEng from Politehnica University of Timisoara, Romania.
Talk Title: The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search
Abstract: We introduce The AI Scientist-v2, an end-to-end agentic system capable of autonomously conducting scientific research, including hypothesis generation, experimentation, analysis, and manuscript writing. We demonstrated its capability by having one of its three AI-generated papers successfully navigate peer review at the ICLR workshop "I Can't Believe It's Not Better: Challenges in Applied Deep Learning". While this highlights AI's potential in scientific discovery, challenges remain: verifying AI outputs is time-intensive and the current system struggles with generating genuinely novel, high-impact hypotheses and justifying design decisions with deep domain expertise. This study, conducted with IRB approval and organizer cooperation, underscores the urgent need for community norms and transparency regarding AI-generated scientific content to ensure responsible development and maintain the integrity of peer review in the face of the potential influx of such papers.
Bio: Yutaro Yamada is a research scientist at Sakana AI. He previously earned a PhD in Statistics and Data Science from Yale University, supported by the Masason Fellowship. His research covers language processing, computer vision, and machine learning, with a current focus on AI agents.
Names are arranged in alphabetical order.