Keynotes

Prof. Babak Falsafi

Post-Moore Server Architecture [VIDEO]

Cloud providers are building infrastructure at unprecedented speeds. The demand for processing, communicating and storing data has grown faster than conventional growth in digital platforms. Meanwhile the conventional silicon technologies we have relied on for the past several decades leading to the exponential growth in IT have slowed down in scaling and will soon come to a halt. While the increase in demand on data-centric IT continues and platform scalability has reached diminishing returns, the basic architecture of a modern server blade still dates back to the CPU-centric desktop PC of the 80’s managing memory at hardware speeds but accessing the network and storage through the OS, legacy software stacks and peripheral interfaces. This talk will make the case for a clean slate co-design of post-Moore server software and hardware.

Bio: Babak Falsafi is a Professor in the School of Computer and Communication Sciences and the founding director of the EcoCloud research center at EPFL. He has worked on server architecture over the years with contributions impacting industrial products and platforms including the WildFire/WildCat NUMA machines by Sun Microsystems, memory prediction technologies in IBM BlueGene and ARM cores, and server evaluation methodologies in use by AMD, HPE and Google PerfKit. His recent work on scale-out server processor design lays the foundation for the first generation of Cavium ThunderX. He is a fellow of ACM and IEEE.

Prof. Dan Ports, Microsoft Research

Accelerating Distributed Systems with In-Network Computation [VIDEO]

Recent advances in accelerators have yielded major improvements in single-node performance, but distributed systems performance now lags far behind. Can we build a new accelerator for distributed systems and close this gap? In this talk, I'll argue that in-network computation can serve as a distributed systems accelerator. Enabled by new programmable switches and NICs that can place small amounts of computation directly in the network fabric, we can speed up common communication patterns for distributed systems, and reach new levels of performance.

I'll describe three systems that use in-network acceleration to speed up classic communication and coordination challenges. First, I'll show how to speed up state machine replication using a network sequencing primitive. The ordering guarantees it provides allow us to design a new consensus protocol, Network-Ordered Paxos, with extremely low performance overhead. Second, I'll show that even a traditionally compute-bound workload -- ML training -- is increasingly network-bound. Our new system, SwitchML, alleviates this bottleneck by accelerating a common communication pattern using a programmable switch. Finally, I'll show that using in-network computation to manage the migration and replication of data, in a system called Pegasus, allows us to load-balance a key-value store to achieve high utilization and predictable performance in the face of skewed workloads.

Bio: Dan Ports is a Principal Researcher at Microsoft Research and Affiliate Assistant Professor in Computer Science and Engineering at the University of Washington. Dan's background is in distributed systems research, and more recently he has been focused on how to use new datacenter technologies like programmable networks to build better distributed systems. He leads the Prometheus project at MSR, which uses this co-design approach to build practical high-performance distributed systems. Dan received a Ph.D. from MIT (2012). His research has been recognized with best paper awards at NSDI and OSDI.