Google Cloud Advanced Computing Community

From Supercomputers to Massive AI Infrastructure: Challenges & Opportunities

February 17, 2026
9:00 AM - 10:00 AM PST
Online

About the Session

Supercomputers and cloud data centers are the largest HPC and enterprise computing scale infrastructure, requiring data centers with ~100MW or power. However, AI needs are driving us to new extreme scales of infrastructure–and power. The transition from 100MW facilities to 1–2GW "super-campuses" in the 2026–2027 window represents a fundamental shift in how we architect and operate AI infrastructure. This massive leap in scale introduces complex delivery dynamics, as the multi-year phased rollout of a 2GW site inevitably collides with the aggressive 9–15 month refresh cycles of cutting-edge accelerators. Navigating this requires a departure from traditional monolithic deployments toward a strategy that accounts for heterogeneous hardware tiers arriving mid-construction.

Operationalizing these campuses also demands a rigorous approach to lifecycle management and "crop rotation." While the high-radix, low-latency interconnects of a concentrated gigawatt-scale footprint are optimized for large-scale collective communication in training workloads, they are often suboptimal for the distributed redundancy and latency requirements of production inference. Furthermore, the sheer volume of components at this scale means hardware failure rates increase linearly, making "goodput" the ultimate metric of success. Maintaining productive compute time for the world’s largest jobs now depends on evolving our fault-tolerance and recovery protocols to handle a twenty-fold increase in server and fabric interruptions compared to the megawatt-scale era.

Join us to learn about the challenges and opportunities of net-generation AI infrastructure, why this is needed for AI, and how this will benefit science and innovation.

Speaker

Benjamin Treynor Sloss
Chief Programs Officer and Vice President, Engineering, Google

Benjamin Treynor Sloss joined Google in 2003 to lead Google's nascent Site Reliability team. He has led the development and operations of Google's network, data centers, and the production software infrastructure and operations for Google’s internal and user-facing services. More recently, Ben was responsible for Google site reliability engineering (SRE), networking, data centers, infrastructure supply chain and operations, and demand management worldwide.

In 2025 Ben was promoted to the role of chief programs officer for Google, leading multi-year Google-wide efforts including data center efficiency, AI diffusion, infrastructure capital structures, and long-term capacity and supply assurance. Earlier in his career, Ben held engineering management roles at SEVEN Networks, E.piphany, and Versant Object Technology, in roles ranging from senior software engineer to vice president of engineering and R&D. Ben started his career as a software engineer at Oracle at age 17.

Google Sites

Report abuse