Pankaj Mehra's new home
CXL Panel Summary
By: Professors Andrew Quinn and Pankaj Mehra
Compute eXpress Link (CXL) is an emerging industry standard that leverages fifth generation and beyond PCI Express (PCIe) physical layers by introducing new protocols and arbitration mechanisms for memory expansion and accelerator coherence over wide serial links and fabrics thereof. On November 16, 2022, the CXL Special Interest Group (SIG) at UC Santa Cruz hosted a panel to discuss trends, problems, and opportunities arising from CXL-enabled disaggregation in the data-center. Our panel included thought leaders covering CXL-related topics spanning low-level cache and micro-architecture concerns up to system-level changes in applications and algorithms. Panelists included (click on the panelist's name to see slides where available):
Pankaj Mehra, Elephance Memory & UC Santa Cruz (moderating)
Suresh Mahanty, SK hynix
KC Ryoo, Samsung
Craig Hampel, Rambus
Frank Hady, Intel
Priya Duraisamy, Google
Phil Bernstein, Microsoft
Daniel Bittman, UC Santa Cruz
Sharad Singhal, HPE
Ike Nassi, TidalScale & UC Santa Cruz
Andrew Quinn, UC Santa Cruz
The panel focused on CXL-attached memory as opposed to CXL-attached accelerators. CXL-attached memory comes with a number of challenges: It increases DRAM’s memory path latency compared to traditional interfaces (e.g., DDR), expands the complexity of the memory hierarchy by exposing a wider array of non-uniform memory accesses, and imposes complex security and translation responsibilities on devices. In light of these challenges, the panel focused on two broad topics. First, what advantages does CXL bring that incentivize working the use of CXL in spite of these challenges? Second, what techniques can we use to reduce or eliminate the challenges associated with high-latency CXL memory?
Below, we describe these two topics and also discuss the potential for near-memory compute provided through CXL and resource disaggregation.
A CXL Advantage—Memory Capacity and Shared Memory Pools
CXL-attached memory will bring a massive growth in the memory capacity of our servers, with individual servers gaining access to TBs or PBs of memory (KC Ryoo). With such large memory sizes, applications that rely on massive data-sets and are traditionally disk-bound, such as machine learning or big-data analytics, could instead store their datasets in memory (Suresh Mahanty, Sharad Singhal). In-memory datasets provide large improvements to current application performance (Ike Nassi), and are likely to improve further if we redesign applications and algorithms for in-memory datasets. As an example, we should expect major gains on the pervasive shuffle operations underlying database joins, map-reduce computation, and in-memory Spark framework by holding data online and operating on it near-memory in large pools of fabric-attached memory. New policies for allocating a database system’s buffer cache in shared memory pools will drive greater data reuse (Phil Bernstein, Sharad Singhal).
Making PB scale memory capacity practical requires deploying shared memory pools. Otherwise, data centers would incur high degree of memory stranding, in which large swaths of deployed memory go unused [See Pond]. In addition to enabling large single-machine capacity, shared memory pools also enable us to scale memory independently from compute, allow bandwidth to be configured relative to capacity more flexibly than DDR, reduce total memory usage across the datacenter by sharing in-memory data across heterogeneous processors, and improve fault tolerance by reducing out-of-memory errors or accelerating host reboots (Priya, Craig Hampel). With third-generation CXL (CXL 3) we will have the option of building global fabric-attached memory (G-FAM), provided that the many CXL switch developments in progress in the industry yield the right mix of radix and latency for bisection bandwidth of the fabric significantly exceeding the one CPU’s memory bandwidth. With this, remote memory operations will see unprecedented acceleration and the applications whose performance is dominated by data movement today, great speedup at lower latency and power consumption (Sharad Singhal).
Alas, shared memory pools come with additional challenges beyond the latency of far memory. Shared memory pools transfer significant root-of-trust, reliability, and security responsibilities from a single host’s hypervisor or operating system to the hardware/software deployed on the shared memory pool device. The blast radius of a single application failure expands in speed and distance because faulty states can propagate through memory pools. Security challenges too increase, especially as we think about side-channels forming from caches, prefetching, and speculation that might be employed on individual shared memory pools (see below). Memory pooling across servers and workloads sets up nasty multi-tenancy and optimization challenges of achieving both performance isolation and high capacity utilization while satisfying diverse and dynamic capacity-bandwidth ratio requests (Phil Bernstein and Priya).
Data Center architects will be spoilt for choice with memory tiers
We typically think of CXL as adding a single new layer to the memory hierarchy and have largely considered how we can cope with the increased latency at the new CXL tier. However, architectural trends today suggest that the memory hierarchy is diversifying not only towards CXL and disaggregation, but also towards in-package DRAM (Frank Hady). Consequently, the systems of tomorrow have access to a multitude of different memory technologies involving a combination of in-package, direct-connect, and disaggregated memory.
Diverse memory hierarchies offer potential advantages in power, performance, and cost. But, there are more questions than answers: What applications can tolerate higher memory latency? Can we hide or alleviate the memory latency challenges? How can we make the complexity of an expanded memory hierarchy mostly invisible to applications? How should memory hierarchy management be split between hardware and software? (Frank Hady, Priya, Ike Nassi)
Caches will play a crucial role in mitigating the high-latency of accessing data in the lower levels of an expanding memory hierarchy (Frank Hady, Nathan K.). Each CXL-attached storage device (memory pool or flash device) seems poised to provide an independent cache with its own policies. How can we coordinate such caches? It seems especially important, but challenging, to support prefetching without cache pollution, bounded coherency, and speculative execution at a distance.
A CXL Proposition—Software-Defined Infrastructure and Near-Memory Compute
The panel discussion provided hope about the increased efficiency and reduced cost that we should expect from a resource disaggregated world. But it also identified major challenges that must be overcome before we can realize that efficiency. The challenges are mainly centered around policies and mechanisms for resource management (e.g., managing isolation in shared memory pools, coordinating caching policies across processors and memory devices, etc.).
Software-defined infrastructure is well studied to enable innovation and ultimately solve management challenges. In particular, we envision near-memory compute resources, deployed on CXL-attached memory, to enable hypervisors, operating systems, data frameworks, or even individual applications to dictate resource management policies. This approach will enable policy customization; for example, prefetching across caches throughout a memory hierarchy could be customized at runtime based on configuration parameters, the application’s access pattern, and the observed load on devices in a shared memory pool. Moreover, we see revisiting old ideas in memory management, such as stored procedures (Phil Bernstein), compressed caches and memory (ZipPads), and object-based memory organizations (Pankaj Mehra). Once we get over the well-understood and basic challenges of compute offload, such as parallelism and translation at the device side, we foresee a blossoming of new memory models, and memory-centric algorithms (e.g., WiscSort over BRAID model, offloaded shuffles).
Near-memory compute goes beyond customized resource management and also enables system and application logic offloading. For example, pointer chasing performs best when offloaded to near-memory compute on devices (Phil Bernstein). Such delegation will enable security-based memory isolation between applications sharing a memory pool, or allow callbacks for an application to group private data into entire objects to be flushed from a cache upon context switches or promotion/demotion in the cache hierarchy.
There is strong evidence that operating systems are beginning to evolve their memory management subsystems to accommodate dynamically detected memory capacity changes (Pankaj Mehra). To go beyond developing a bespoke resource management solution for each application and/or for each configuration, we need to expose new memory abstractions and programming models from devices and systems toward programming ease and performance maximization of future applications (Daniel Bittman). Many industry participants underscored the ease of porting existing/Posix applications transparently as crucial to the near term adoption of CXL, with only a short time to realize its first benefits (Frank Hady, KC Ryoo, Suresh Mahanty, Priya). These aren’t inconsistent goals, just different in what industry and academia need to do.