CXL SIG
This group includes external (industry) participants.
[12/13/2022] Notes and slides from our Industry Panel on Memory Disaggregation held Nov 16, 2022 are now online.
CXL SIG Google Drive folder
CXL SIG Google Drive folder
2023
On Spring Break. Meetings will resume April 4th at new time: Tuesdays 2-3 PM.
Mar 20 (Monday) an update from Andrew/Pooneh on how taking a tiered view of data heat and latency tolerance shows data-intensive applications may be able to utilize Pond-style lower tiers quite well.
Mar 13
Build upon excellent discussion of in-memory key-value stores suitable for disaggregated memory by continuing to characterize ideas from recent work that are suitable for CXL (meaning they can exploit hardware load-store data path) versus those that work with RDMA primarily.
Mar 6
Continue discussion on remoteable pointers by deep diving on Fusee (FAST'23) and WASM (Web Assembly), led by Yiwei Yang
We will be weighing implementation ideas versus 3 critical requirements of Remoteable Pointers
Must work from the source as pointers even when the memory is far (requires zero implementation in CXL for the most part)
Must work at the device for offloading pointer chasing to CXL memory device or pre-CXL memory node
Must work at newly started compute without the friction of serialization-deserialization for independent scaling of memory and compute
Feb 27
We focused on remoteable pointers seen in prior art such as Carbink and AIFM
We went around the room to see what other works have recently shown good implementations, and Fusee from Huawei and WASM were brought up.
Feb 20 (Monday) 1pm
Grad Student Researcher Lokesh Jaliminche led a our discussion on "Impact of CXL on Computational Storage"
Feb 13 (Monday) 1pm
We discussed the Usenix OSDI'22 Carbink paper from the bottom of this page
Feb 6 (Monday) 1 pm
Yiwei discussed the rooflines from one of the SC22 presentations continuing her talk.
Then, we talked about Computational CXL-Memory Solution for Accelerating Memory-Intensive Applications by Sim, et al which appeared in IEEE Computer Architecture Letters Vol 22 No 1 (2023). It addresses how best to combine near-data processing and memory interleaving by architecting a simple load balancer behind low-bandwidth CXL links to have the best of both data processing bandwidth and performance/Watt, in the context of k-Nearest Neighbor as the representative memory-intensive workload.
Jan 30 (Monday) 1 pm
Yiwei continued presenting about the CXL booth at Supercomputing 2022.
Jan 23 (Monday) 1 pm
Agenda is to do quick update of CXL news and a quick roundtable to hear suggestions about talks people want to present and papers they want to be discussed this quarter (remaining 7 meetings).
Jan 16 (Monday) at 1pm
Graduate student Yiwei Yang (advised by Andrew Quinn) will discuss the design of his CXL Memory simulator and his learnings at Supercomputing 2022.
Jan 10 (Tues) Antonio Barbalace talk at 1PM (Zoom link above)
TITLE
Rethinking Systems Software for Emerging Data Center Hardware
ABSTRACT
Today’s data center hardware is increasingly heterogeneous, including several special-purpose and reconfigurable accelerators that sit along with the central processing unit (CPU). Emerging platforms include also heterogeneous memory – directly attached, NUMA, and over peripheral bus. Furthermore, processing units (CPUs and/or accelerators), pop-up in storage devices, network cards, and along the memory hierarchies (near data processing architectures). Therefore, introducing hardware topologies that didn’t exist before!
Existent, traditional, systems software has been designed and developed with the assumption that a single computer hosts a single CPU complex with direct attached memory, or NUMA. Therefore, there is one operating system running per computer, and software is compiled to run on a specific CPU complex. However, within emerging platforms this doesn’t apply anymore because every different processing unit requires its own operating system and applications, which are not compatible between each other, making a single platform look like a distributed system – even when CPU complexes are tightly coupled. This makes programming hard and hinders all of a set of performance optimizations. Therefore, this talk argues that new systems software is needed to better support emerging non-traditional hardware topologies, and introduces new operating system and compiler design(s) to achieve easier programming, and full system performance exploitation.
BIO
Antonio Barbalace is a Senior Lecturer (Associate Professor) at the School of Informatics of the University of Edinburgh, Scotland. Before, he was an Assistant Professor in the Computer Science Department, at Stevens Institute of Technology, New Jersey. Prior to that, he was a Principal Research Scientist and Manager at Huawei, German Research Center, based in Munich, Germany. He was a Research Assistant Professor, and before a Postdoc, at the ECE Department, Virginia Tech, Virginia. He earned a PhD in Industrial Engineering from the University of Padova, Italy, and an MS and BS in Computer Engineering from the same University.
Antonio Barbalace’s research interests include all aspects of system software, embracing hypervisors, operating systems, runtime libraries, and compilers/linkers, for emerging highly-parallel and heterogeneous computer architectures, including near data processing platforms and new generation interconnects with coherent shared memory. His research seeks answers about how to architect or re-architect the entire software stack to ease programmability, portability, enable improved performance and energy efficiency, determinism, fault tolerance, and security. His research work appeared at top systems venues including EuroSys, ASPLOS, VEE, ICDCS, Middleware, EMSOFT, HotOS, HotPower, and OLS.
WEBSITE
http://www.barbalace.it/antonio/
CXL SIG celebrates Daniel's graduation. Congratulations, Dr. Bittman!
Rare Talent. Dissertation Award level work. Foundational. Bold. Superatives just flew in the closed door session of the committee. I have watched Daniel take an idea and commit to it. He has been an inspiration to his fellow grad students. And has done justice to the often ignored "Ph." part of the Ph.D. degree.
Daniel's contribution to the rapidly evolving world of memory has been recognized at prestigious Usenix ATC in 2020 with a Best Presentation award. But I have realized the importance of his work in action as I work closely with major SaaS analytics vendors, major semiconductor memory suppliers, and world's leading virtualization researchers.
As "memoryness" spreads beyond RDMA in space through disaggregation and in time through persistent memory, the need to rescue translation contexts from process abstraction has become paramount. Daniel has done that by placing a foreign object table in every memory object and done for memory what S3 did for storage. We at Elephance -- where Daniel is a cofounder -- believe in this idea deeply and are committed to bring it to the world of CXL, working closely with our customers and partners in industry and government and with our growing network of academic and research collaborators.
Pankaj Mehra, President and CEO
Elephance Memory, Inc.
2022
So how bad is CXL latency, really? Find out in our newsfeed.
We will kick off 2023 with a talk by Antonio Barbalace from University of Edinburgh.
The final meeting of CXL SIG for 2022 was on Dec 6 where Yiwei Yang summarized near-memory processing for genomics.
Learnings from Existing Disaggregated Memory Systems
On Nov 29, Pankaj will discuss the recently published Samsung study of simulating a CXL attached DRAM-swapped-to-SSD as far memory and how well workloads cope with it. See a quick summary in our CXL in the News subpage. (slides)
On Nov 22, Andrew Quinn discussed Pond. Pond builds on a previously released manuscript from the same authors. The system studies memory pooling for increasing DRAM utilization in data centers and thereby reducing the cost of using and maintaining main memory. In particular, Pond looks at implementing memory pools using the CXL standard. The paper first analyzes cloud production traces to show that small-scale memory pooling (i.e., across only 8-16 sockets) is sufficient to achieve most of the cost benefits from memory pooling. They then show that a machine learning model can accurate predict the memory allocation size required for a black-box application. Pond would decrease DRAM costs by 7% with performance that is within 1-5% of standard systems (i.e., same-NUMA-node allocations).
On Nov 8th, Pankaj led the discussion on this Microsoft Research position paper presented at HotOS 2020 titled Disaggregation and the Application by Sebastian Angel from UPenn and Mihir Nanvati and Siddhartha Sen from MSR. paper
On Oct 25th, Pankaj led the first of several discussions on existing disaggregated memory systems used by cloud service providers
Workloads
On Nov 1st, we had another presentation from Pooneh about results from her research on workloads and working sets first shared by her briefly on Sep 6
Microarchitecture Co-evolution with CXL
On October 4th and 18th, we revisited a topic (see this folder) we started discussing on Sep 13 when the "A Case against (most) context switches link" paper was brought up by Andrew. From reading that paper we learned how CPU microarchitecture could evolve in response to longer memory latencies. That made me wonder what else is out there and one new and one old paper jumped to the top of my reading pile.
Linux HMM scope, co-evolution with CXL, and limitations
Discussion of Sep 20 led by Yiwei (SMDK & HMSDK)
Yiwei discussed the stack below.
On Sep 27, 2022 Yiwei updated with details showing where jemalloc and libnuma hook into SMDK. We need to run this by Samsung collaborators for accuracy and to better understand their roadmap (link to Yiwei's SMDK deep dive presentation)
Frank Hady (Intel) asked whether Linux stack needs to be modified to use CXL. HMM is evolving to get there. Current stacks are using a variety of emulation techniques and I/O mechanisms as well as PMDK and DAX remnants. Native CXL Linux stack will emerge from these experiences.
Discussion of Sep 13 led by James/Pankaj/Andrew (Future CPUs; CXL software ecosystem more broadly than just HMM)
Paper discussed: A Case Against (Most) Context Switches link
UCSC-only content discussed: Workloads spreadsheet UCSC-internal link
Discussion of Sep 6 led by Pooneh (HMM)
Past Topics
Why CXL? Will it make sense despite added latency and cost in the memory access path relative to DDR DRAM?
Document WIP