2024
CXL SIG on holiday.
May 21: Lokesh will lead the discussion on paper from POSTECH and SK Hynix, "Low-overhead General-purpose Near-Data Processing in CXL Memory Expanders" paper
May 14: Pankaj will lead a discussion on new topics.
May 7: Esteban Ramos will discuss Novel Composable and Scaleout Architectures Using Compute Express Link paper by DD Sharma of Intel
Apr 30: Jayjeet will discuss emucxl: an emulation framework for CXL-based disaggregated memory applications
paper
Apr 23: Yiwei will lead the discussion of updated results using interleaving of DDR and CXL pooled memory on Astera Labs hardware recently reported by industry authors. paper Main results using Astera Labs CXL parts on an Emerald Rapids based server showed bandwidths of DDR and CXL add up effectively under interleaving.
Apr 9. We discussed latest updates on CXL products and prototypes
Apr 2: Tim Pezarro (Senior Product Manager at Microchip in Burnaby, Canada) joined us remotely to speak about their smart memory controllers (https://www.microchip.com/en-us/products/memory/smart-memory-controllers).
Mar 26: Salus: Efficient Security Support for CXL-Expanded GPU Memory and PIMFlow: Compiler and Runtime Support for CNN Models on Processing-in-Memory DRAM led by Yiwei Yang
Mar 5: CXL-ANNS Pankaj and/or Jayjeet will talk about KAIST paper on billion-scale Approx Nearest Neighbor Search link from Usenix ATC 2023CXL link
Feb 20: Pankaj walked the team over GPU memory usage during long context inference in Transformer-based LLMs
Feb 13: Lokesh followed up his Dec 5 presentation with a related more recent paper from Apple "LLM in a flash: Efficient Large Language Model Inference with Limited Memory": link
Feb 7: SPECIAL INDUSTRY SESSION ON MEMORY POOLING
[5] Presenters' intro and opening Pankaj Mehra
[10+10] Hotnets'23 paper "A Case against CXL Memory Pooling" link (Philip Levis from Google, Stanford)
[20+5] IEEE Micro paper "Design Tradeoffs in CXL Based Memory Pools for Cloud Platforms" by Berger, et al. (Daniel Berger from Microsoft, UW)
[10] Moderated Q&A: Short, written, clarifying questions only. Feel free to presubmit to moderator for sharing with authors
Jan 30: Allen led a discussion on Arm's CMS presentation about area considerations of snoop filters in CXL SoCs.
Jan 23: We discussed Database Kernels: Seamless Integration of Database Systems and Fast Storage via CXL link
Jan 16: Yiwei discussed SDM: Sharing-enabled Disaggregated Memory System with Cache Coherent Compute Express Link link
Jan 9: Pooneh S. presented the gShard paper from Hot Chips 32 (2020) link We discussed the compute-memory tradeoff causing nearly 40 percent of activations to be recomputed.
2023
Dec 5: Lokesh will present a paper from Kioxia about using XLFLASH in the GPU's memory hierarchy because suppposedly GPU algorithms for graph traversal are more latency tolerant than CPU-oriented algorithms. Shintaro Sano, Yosuke Bando, Kazuhiro Hiwada, Hirotsugu Kajihara, Tomoya Suzuki, Yu Nakanishi, Daisuke Taki, Akiyuki Kaneko, and Tatsuo Shiozawa. 2023. GPU Graph Processing on CXL-Based Microsecond-Latency External Memory. In Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis (SC-W '23). Association for Computing Machinery, New York, NY, USA, 962–972. https://doi.org/10.1145/3624062.3624173 link
Nov 28: Pankaj will catch the team up on OCP CMS activities planned for 2024 and the project on Acceleration Interfaces he'll lead.
Nov 14: Our own Achilles Benetopoulos will discuss "A Cloud-Scale Characterization of Remote Procedure Calls" from SOSP'23 link
Nov 7: Open discussion on how transparent is transparent page placement? What are its hidden costs?
Oct 31: Our own Yiwei Yang will present SOSP paper titled "Memtis: Efficient Memory Tiering with Dynamic Page Classification and Page Size Determination" covering profile guided and hardware assisted tiering. link
Oct 24: OCP Global Summit and Samsung Memory Tech Day recap
Oct 17: No meeting due to OCP
Oct 10: Yiwei Yang will present about Partial Failure Resilient Distributed Memory link. From the paper's abstract: CXL-SHM, an automatic distributed memory management system based on reference counting. The reference count maintenance in CXL-SHM is implemented with a special era-based non-blocking algorithm. Thus, there are no blocking synchronization, memory leak, double free, and wild pointer problems, even if some participating clients unexpectedly fail without freeing their possessed memory references.
Oct 3: Yiwei Yang will present her work on CXLMemUring: A Hardware Software Co-design Paradigm for Asynchronous and Flexible Parallel CXL Memory Pool Access. Link to the paper
Sep 26: A Sea of Accelerators? Whether and what should be accelerated for data-intensive work of some of the largest services in the world. We will explore the characteristics of these workloads and the potential for accelerating them with Vidushi Dadu of Google to open our Fall Quarter meetings. Link to the paper.
==
June 13 : Pankaj led the discussion of the "Unified Buffer: Compiling Image Processing and Machine Learning Applications to Push-Memory Accelerators" paper from Stanford. We discussed trade-offs of push and pull memory approaches and how taking the view of the push memory approach could help simplify data movement in the systems.
May 30: Discussion of "A case for CXL-centric Server Processors" paper from Georgia Tech led by Pankaj. From the paper: "replacing all DDR interfaces to the processor with the more pin-efficient CXL interface."
We haven't done this in a while so we -- Yiwei and Pankaj -- will take a moment to review the bandwidth -- and therefore capacity -- per pin advantage of CXL versus DDR interfaces while also considering Serdes area. Interestingly, one of these links claims DDR memory uses 380 pins per channel and the other, 288 (the right answer), even though both are posted on CXL Consortium website.
Pankaj will recap briefly the latency advantages of the new native CXL IPs versus traditional PCIe for reducing RTT latency to 40ns (down from 100) by adopting a different Serdes implementation.
May 23: Pankaj previewed his International Supercomputing Conference 2023 (Hamburg) Exacomm Workshop talk on "Principles for Optimizing Data Movement in Emerging Memory Hierarchies."
May 16 Discussion of IEEE Micro paper "Design Tradeoffs in CXL Based Memory Pools for Cloud Platforms" by Berger and Ernst
May 2 (Tuesday) We will have lead author of ASPLOS23 TMTS (Transparent Memory Tiering System) paper, Priya Duraiswamy (Google), lead the discussion on their new work on a two tier memory system in which the slow tier is able to hold about 25 percent of the memory with minimal impact on performance. They use job classification to identify those jobs that can effectively use slower memory, and proactively and stably move data into the cold tier with demonstrated ability of maintain a low promotion rate, thus expected access latency across tiers.
Apr 25 (No meeting due to moderator on travel) --
Apr 18 Discussion of Asynch access to Far Memory led by Yiwei
April 11(Tuesday): Demystifying CXL Memory with Genuine CXL-Ready Systems and Devices by Sun, Yan ; Yuan, Yifan ; Yu, Zeduo ; Kuper, Reese ; Jeong, Ipoom ; Wang, Ren ; Nam Sung Kim
Paper carefully evaluates how best to use page interleaving between a large DDR DRAM and a small CXL DRAM. It advocates using the more pipelinable non-temporal stores on SPR processors as well as offloading far memory manipulation to the new DSA. We compared against the more proactive approach such as the one Priya will describe at our May 2 meeting and found that is the more likely path for hyperscalers.
Mar 20 (Monday) an update from Andrew/Pooneh on how taking a tiered view of data heat and latency tolerance shows data-intensive applications may be able to utilize Pond-style lower tiers quite well.
Mar 13
Mar 6
Continue discussion on remoteable pointers by deep diving on Fusee (FAST'23) and WASM (Web Assembly), led by Yiwei Yang
We will be weighing implementation ideas versus 3 critical requirements of Remoteable Pointers
Must work from the source as pointers even when the memory is far (requires zero implementation in CXL for the most part)
Must work at the device for offloading pointer chasing to CXL memory device or pre-CXL memory node
Must work at newly started compute without the friction of serialization-deserialization for independent scaling of memory and compute
Feb 27
We focused on remoteable pointers seen in prior art such as Carbink and AIFM
We went around the room to see what other works have recently shown good implementations, and Fusee from Huawei and WASM were brought up.
Feb 20 (Monday) 1pm
Feb 13 (Monday) 1pm
Feb 6 (Monday) 1 pm
Yiwei discussed the rooflines from one of the SC22 presentations continuing her talk.
Then, we talked about Computational CXL-Memory Solution for Accelerating Memory-Intensive Applications by Sim, et al which appeared in IEEE Computer Architecture Letters Vol 22 No 1 (2023). It addresses how best to combine near-data processing and memory interleaving by architecting a simple load balancer behind low-bandwidth CXL links to have the best of both data processing bandwidth and performance/Watt, in the context of k-Nearest Neighbor as the representative memory-intensive workload.
Agenda is to do quick update of CXL news and a quick roundtable to hear suggestions about talks people want to present and papers they want to be discussed this quarter (remaining 7 meetings).
Graduate student Yiwei Yang (advised by Andrew Quinn) will discuss the design of his CXL Memory simulator and his learnings at Supercomputing 2022.