Keynote Talks

Breaking the Memory Wall for Foundation Models: From Algorithm to Silicon

Abstract

As foundation models continue to scale, the dominant bottleneck is shifting from compute to memory—capacity, bandwidth, and data movement. This trend is especially acute for emerging agentic AI systems, where long contexts, persistent state, and multi-step reasoning place unprecedented pressure on the memory hierarchy. This talk argues that breaking the memory wall requires a coordinated effort across the entire stack, and offers a unified perspective spanning algorithms, systems, architecture, and silicon.

The talk first considers how memory demand can be reduced at its source, through algorithm- and system-level techniques that reshape how models store and move data during inference. It then turns to memory-centric accelerator design, showing how such techniques can be co-designed with the underlying hardware rather than treated in isolation. Finally, it looks toward silicon realization and the future of memory-centric accelerators purpose-built for foundation models and agent systems. The unifying thread—from memory compression to scheduling to memory-centric architecture and specialized chips—points toward a practical path for sustaining the continued scaling of foundation models.

Bio

Yiran Chen is the John Cocke Distinguished Professor of Electrical and Computer Engineering at Duke University. He serves as the Principal Investigator and Director of the NSF AI Institute for Edge Computing Leveraging Next Generation Networks (Athena), the Director of Institute for AI Engineering (IAIE), and the Co-Director of the Duke Center for Computational Evolutionary Intelligence (DCEI). His research group focuses on innovations in emerging memory and storage systems, machine learning and neuromorphic computing, and edge AI. Dr. Chen has authored over 700 publications and holds 96 U.S. patents. His work has received widespread recognition, including two Test-of-Time Awards and 15 Best Paper/Poster Awards. He is the recipient of the IEEE Circuits and Systems Society’s Charles A. Desoer Technical Achievement Award and the IEEE Computer Society’s Edward J. McCluskey Technical Achievement Award. He also serves as the inaugural Editor-in-Chief of the IEEE Transactions on Circuits and Systems for Artificial Intelligence (TCASAI) and the founding Chair of the IEEE Circuits and Systems Society’s Machine Learning Circuits and Systems (MLCAS) Technical Committee. Dr. Chen is a Fellow of the AAAS, ACM, IEEE, and NAI, and a member of the European Academy of Sciences and Arts.

The Path to Inference Efficiency

Abstract

Agentic AI is moving out of demos and into daily use, creating enormous demand for efficient inference: higher throughput, lower latency, and better efficiency in both dollars and joules. Meeting these targets requires rethinking the full inference + tools stack, from the specialized silicon that runs the models, to the system software that compiles, schedules, and serves them at scale, to the model architectures that determine what must be computed in the first place. In this talk, we will examine these layers with an eye toward the next major advances in hardware architecture, and how systems and algorithms can be co-designed to fully exploit them. Large gains in inference efficiency will come not from isolated improvements, but from treating hardware, systems, and models as an integrated stack.

Bio

Christos Kozyrakis is a computer architecture researcher at NVIDIA and the Leonard Bosack and Sandy K Lerner Professor of Engineering at Stanford University. His research focuses on hardware and software infrastructure for AI, as well as the use of AI for hardware and software design. He holds a PhD degree from the University of California at Berkeley and a BS degree from the University of Crete. He is a fellow of the ACM and the IEEE. He has received the IEEE Harry H Goode award, the ACM SIGARCH Maurice Wilkes award, the NSF Career Award, the ISCA Influential Paper Award, the ASPLOS Influential Paper Award, the HPCA Test of Time award, the SoCC Test of Time award, the Okawa Foundation Research Grant, the Noyce Family Faculty Scholarship, and the Willard R. and Inez Kerr Bell Faculty Scholarship, and faculty awards by IBM, Google, and Microsoft.

The New Golden Age for the Computer Architect

Abstract

We are often told that we are in a golden age for computer architecture, but this moment is something more profound: a golden age for the computer architect. Advances in AI are collapsing the traditional boundaries between architecture, software, and hardware implementation, enabling architects to move from idea to realization across the full stack. Yet, despite this promise, today’s AI systems still struggle to reason about architectural design due to missing abstractions and outdated tooling. In this talk, I will argue that the next era of innovation is not just about better hardware, but about redefining how we design systems: introducing new machine-understandable abstractions, building AI-native tools, and embracing system-level thinking in a world of increasingly specialized and heterogeneous architectures. As AI workloads diversify and hardware fragments, the role of the architect expands—from optimizing components to defining design spaces, composing systems, and shaping the software–hardware interface. This is not just a technological shift, but a redefinition of what it means to be a computer architect.

Bio

Professor Yakun Sophia Shao is The Class of 1951 Chair Associate Professor at the Electrical Engineering and Computer Sciences department of University of California, Berkeley. Her research interests are in the area of computer architecture, with a special focus on specialized accelerator, heterogeneous architecture, and agile VLSI design methodology. Previously, she was a Senior Research Scientist at NVIDIA Research and received her Ph.D. degree in 2016 from Harvard University. She is a recipient of the 2025 Outstanding Teaching Award in EE, the 2024 Anita Borg Early Career Award, a Sloan Research Fellowship, an NSF CAREER Award, the 2022 IEEE TCCA Young Computer Architect Award, a Google Faculty Rising Stars Award in Systems Research, a Google Research Scholar Award, a Facebook Research Award, an Okawa Foundation Research Grant, and the inaugural Dr. Sudhakar Yalamanchili Award. Her work has been awarded multiple Best Paper Awards and Distinguished Artifact Awards at major hardware conferences.

Efficient and Scalable Agentic AI with Heterogenous Systems

Abstract

AI agents are emerging as a dominant workload in a wide range of applications, promising to be the vehicle that delivers the promised benefits of AI to enterprises and consumers. Unlike conventional software or static inference, agentic workloads are dynamic and structurally complex. These agents are often directed graphs of compute and I/O operations that span multi-modal data input and conversion (e.g. speech to text), data processing and context gathering (e.g. privacy filtering, vector DB lookups), LLM inferences, tool calls, etc. To scale AI agent usage, we need efficient and scalable deployment and agent-serving infrastructure. Today, however, the vast majority of these workloads are deployed on homogeneous, high-end, single-vendor infrastructure, which can often be quite expensive and limits broad rollout.

To tackle this challenge, we present a system design for dynamic orchestration of AI agent workloads on heterogeneous compute infrastructure spanning CPUs and accelerators, both from different vendors and across different performance tiers within a single vendor. The system delivers several building blocks: a framework for planning and optimizing agentic AI execution graphs using cost models that account for compute, memory, and bandwidth constraints of different HW; an MLIR-based compilation system that can decompose AI agent execution graphs into granular operators and generate code for different HW options; and a dynamic orchestration system that can place the granular components across a heterogeneous compute infrastructure and stitch them together while meeting an end-to-end SLA. Our design thus performs a system-level TCO optimization and our results show that leveraging a heterogeneous infrastructure can deliver significant TCO benefits.

Bio

Zain Asgar is Co-Founder/CEO of Gimlet Labs. Zain was previously the GM/GVP of Pixie & Open Source at New Relic, through an acquisition of Pixie Labs, where he was the Co-Founder/CEO. Zain is also an Adjunct Professor of Computer Science at Stanford University and was an Entrepreneur in Residence at Benchmark before co-founding Pixie. He has a Ph.D. from Stanford and has helped build at-scale data and AI/ML at Google AI, Trifacta, and NVIDIA.

Page updated

Google Sites

Report abuse