[July 17, 2024] Speculation Techniques to Tackle Data Dependency in CPU Microarchitecture (Why Apple CPUs are faster than others)
Speaker: Dr. Jungi Jeong (CPU Architect @ Google)
[August 23, 2023] Designing and Building New OS for Cloud Services
Speaker: Prof. Youngjin Kwon @ KAIST
Abstract: In this talk, I would like to share my two experiences building two OSes for emerging cloud services. Memory disaggregation has replaced the landscape of cloud systems by physically separating compute and memory nodes, achieving improved utilization. As early efforts, kernel paging-based approaches offer transparent virtual memory abstraction for remote memory with paging schemes but suffer from expensive page fault handling. We revisit the paging-based approaches and challenges their performance in paging schemes. We posit that the overhead of the paging-based approaches is not a fundamental limitation. We propose DiLOS, a new library operating system~(LibOS) specialized for paging-based memory disaggregation. We have revamped the page fault handler to get away with the swap cache and incorporated known techniques in our prefetcher, page manager, and communication module for performance optimization. Furthermore, we provide APIs to augment the LibOS with application semantics. OS containers have become a foundational component of cloud systems, offering benefits such as encapsulating the kernel, user libraries, and applications to reduce operational costs and enhance manageability. While OS containers present the illusion of isolated kernel code and states for processes, they share the same underlying kernel, raising concerns regarding security and fault isolation. Previous solutions to address the isolation concerns are using virtual-machine-based systems, leveraging hardware-based isolation, but this approach often introduces significant performance overhead. In response, we introduce a new approach, CofferOS, that leverages Rust’s safety features to enhance container isolation. This paper introduces a Coffer abstraction, implemented as a class, which ensures that instances never directly access the code and states of others. This isolation principle is achieved by encapsulating kernel code within each Coffer instance. CofferOS, a new OS, containerizes kernels and processes within Coffer instances, strengthening security and fault isolation compared to the traditional OS container.
[May 25, 2023] On-device Continual Learning over Memory Hierarchy: A System Researcher’s Standpoint
Speaker: Prof. Myeongjae Jeon @ UNIST
Sponsor: Ajou-DREAM BK21 Colloquium
Abstract: Continual Learning (CL) is an emerging machine learning paradigm for edge devices that learn from a continuous stream of tasks. To avoid forgetting knowledge from previous tasks, episodic memory (EM) methods exploit a subset of the past samples while learning from new data. Despite promising results, prior studies are predominantly simulation-based and do not adequately address the growing demand for both EM capacity and system efficiency in practical setups. In this talk, I will discuss CarM, a system our research team has developed to meet this demand by employing hierarchical EM management as a key design principle. CarM uses high-speed RAMs for EM to ensure system efficiency and takes advantage of abundant storage to preserve past experiences, mitigating forgetting by facilitating efficient sample migration between memory and storage. Recently, we further improved CarM for cost-effectiveness, i.e., achieving high model accuracy without compromising energy efficiency, to make it more suitable for energy-sensitive edge devices. To gain insights into achieving cost-effectiveness, we first explore the design space of CarM. Miro, our new system runtime, carefully integrates these insights into CarM, enabling dynamic CL system configuration based on resource states for high cost-effectiveness. Miro also performs online profiling on parameters with clear accuracy-energy trade-offs and adapts to optimal values with minimal overhead. CL system research is still in its infancy, with numerous research problems to overcome to ensure deployability. I look forward to engaging in deeper and broader conversations during this seminar.
[Apr 27, 2023] Out-Of-Order BackProp: An Effective Scheduling Technique for Deep Learning
Speaker: Prof. Jiwon Seo @ Hanyang Univ.
Sponsor: Ajou-DREAM BK21 Colloquium
Abstract: Neural network training requires a large amount of computation and thus GPUs are often used for the acceleration. While they improve the performance, GPUs are underutilized during the training. This paper proposes out-of-order (ooo) back-prop, an effective scheduling technique for neural network training. By exploiting the dependencies of gradient computations, ooo backprop enables to reorder their executions to make the most of the GPU resources. We show that the GPU utilization in single- and multi-GPU training can be commonly improved by applying ooo backprop and prioritizing critical operations. We propose three scheduling algorithms based on ooo backprop. For single-GPU training, we schedule with multi-stream ooo computation to mask the kernel launch overhead. In data-parallel training, we reorder the gradient computations to maximize the overlapping of computation and parameter communication; in pipeline-parallel training, we prioritize critical gradient computations to reduce the pipeline stalls. We evaluate our optimizations with twelve neural networks and five public datasets. Compared to the respective state of the art training systems, our algorithms improve the training throughput by 1.03--1.58× for single-GPU training, by 1.10--1.27× for data-parallel training, and by 1.41--1.99× for pipeline-parallel training.
[Mar 30, 2023] AI 서비스를 위한 클라우드 자원관리의 이슈와 해법
Speaker: Prof. Euiseong Seo @ SKKU
Sponsor: Ajou-DREAM BK21 Colloquium
Abstract: AI의 학습과 서빙을 자체 서버를 사용하여 수행하는 것은 비용과 자원효율성 측면에서 불리하기 때문에, 최근 많은 기업들이 클라우드 AI 서비스로 이전하고 있다. 클라우드 AI 서비스는 학습과 서빙을 위한 자원관리를 클라우드 벤더가 수행한다. 따라서, 클라우드 벤더는 다양한 사용자(tenant)로부터 들어오는 학습 및 서빙 워크로드를 자원효율적으로 처리함으로 인해 얻을 수 있는 이득이 상당히 높다. 우리는 학습 클러스터에서 발생하는 스케쥴링 비효율성에 대한 이슈와 이를 해결하기 위한 접근법을 소개하고, 자원 효율성 뿐 아니라 점점 더 중요해지는 에너지 효율성을 높일 수 있는 서빙 클러스터 운영 기법을 소개한다. 마지막으로 동적 배칭 기법을 통한 서빙 클러스터의 성능 효율성 향상 기법에 관한 최근 연구결과를 설명한다.
[Jan 20, 2023] Algorithm-Hardware-Software Co-Design to Build Specialized Systems
Speaker: Prof. Jongse Park @ KAIST
Sponsor: Ajou-DREAM BK21 Colloquium
Abstract: Modern retrospective analytics systems leverage cascade architecture to mitigate bottleneck for computing deep neural networks (DNNs). However, the existing cascades suffer from two limitations: (1) decoding bottleneck is either neglected or circumvented, paying significant compute and storage cost for pre-processing; and (2) the systems are specialized for temporal queries and lack spatial query support. In this talk, I will present CoVA, a novel cascade architecture that splits the cascade computation between compressed domain and pixel domain to address the decoding bottleneck, supporting both temporal and spatial queries. CoVA cascades analysis into three major stages where the first two stages are performed in compressed domain, while the last one in pixel domain. First, CoVA detects occurrences of moving objects (called blobs) over a set of compressed frames (called tracks). Then, using the track results, CoVA prudently selects a minimal set of frames to obtain the label information and only decode them to compute the full DNNs, alleviating the decoding bottleneck. Lastly, CoVA associates tracks with labels to produce the final analysis results on which users can process both temporal and spatial queries.
[Aug 04, 2022] DAOS: Data Access-aware Operating System
Speaker: Dr. SeongJae Park @ Amazon (Kernel Development Engineer)
Sponsor: Ajou-DREAM BK21 Colloquium
Abstract: In data-intensive workloads, data placement and memory management are inherently difficult: the programmer and the operating system have to choose between (combinations of) DRAM and storage, replacement policies, as well as paging sizes. Efficient memory management is based on fine-grained data access patterns driving placement decisions. Current solutions in this space cannot be applied to general workloads and production systems due to either unrealistic assumptions or prohibitive monitoring overheads. To overcome these issues, we introduce DAOS an open-source system for general data access-aware memory management. DAOS provides a data access monitoring framework that utilizes practical best-effort trade-offs between overhead and accuracy. The memory management engine of DAOS allows users to implement their access-aware management with no code, just simple configuration schemes. For system administrators, DAOS provides a runtime system that auto-tunes the schemes for user-defined objectives in a finite time. We evaluated DAOS on commercial service production systems as well as state-of-the-art benchmarks. DAOS achieves up to 12% performance improvement and 91% memory saving. DAOS is upstreamed and available in the Linux kernel.
[Feb 25, 2021] Fixing the pothole on the road: Improving address translation to keep up performance (Online Zoom)
Speaker: Prof. Chang Hyun Park @ Uppsala University
Sponsor: Ajou-DREAM BK21 Colloquium
Abstract: The availability of large pages has dramatically improved the efficiency of address translation for applications that use large contiguous regions of memory. However, large pages can be difficult to allocate due to fragmented memory, non-movable pages, or the need to split a large page into regular pages when part of the large page is forced to have a different permission status from the rest of the page. Furthermore, they can also be expensive due to memory bloating caused by sparse accesses to application data. In this work, we enable the allocation of large 2MB pages even in the presence of fragmented physical memory via perforated pages. Perforated pages permit the OS to punch 4KB page-sized holes in the physical address range allocated to a large page and re-map them to other addresses as needed. This not only enables the system to benefit from large pages in the presence of fragmentation, but also allows for different permissions to exist within a large page, enhancing sharing flexibility. In addition, it allows unused parts of a large page to be used elsewhere, mitigating memory bloating. To minimize changes to the system, perforated pages reuse the 4KB-level page table entries to store the hole locations and translates holes into regular 4KB pages. By enabling large pages in the presence of physical memory fragmentation, perforated pages increase the applicability and resulting benefits of large pages with only minor changes to the hardware and OS. In this work, we evaluate the effectiveness of perforated pages with timing simulations under diverse and realistic fragmentation scenarios.
[Oct 17, 2019] Storage Systems
Speaker: Jinki Han @ PetaIO
Sponsor: CSE Colloquium
[Sep 26, 2019] Stream Analytics on High Bandwidth Hybrid Memory
Speaker: Prof. Myeongjae Jeon @ UNIST
Sponsor: CSE Colloquium
Abstract: Stream analytics has an insatiable demand for memory and performance. Emerging hybrid memories combine commodity DDR4 DRAM with 3D-stacked High Bandwidth Memory (HBM) DRAM to meet such demands. However, achieving this promise is challenging because (1) HBM is capacity limited and (2) HBM boosts performance best for sequential access and high parallelism workloads. At first glance, stream analytics appears a particularly poor match for HBM because they have high capacity demands and data grouping operations, their most demanding computations, use random access. In this talk, I will present the design and implementation of StreamBox-HBM, a stream analytics engine that exploits hybrid memories to achieve scalable high performance. StreamBox-HBM performs data grouping with sequential access sorting algorithms in HBM, in contrast to random access hashing algorithms commonly used in DRAM. StreamBox-HBM solely uses HBM to store Key Pointer Array (KPA) data structures that contain only partial records (keys and pointers to full records) for grouping operations. It dynamically creates and manages prodigious data and pipeline parallelism, choosing when to allocate KPAs in HBM. It dynamically optimizes for both the high bandwidth and limited capacity of HBM, and the limited bandwidth and high capacity of standard DRAM. StreamBox-HBM achieves 110 million records per second and 238 GB/s memory bandwidth while effectively utilizing all 64 cores of Intel’s Knights Landing, a commercial server with hybrid memory. It outperforms stream engines with sequential access algorithms without KPAs by 7× and stream engines with random access algorithms by an order of magnitude in throughput. To the best of our knowledge, StreamBox-HBM is the first stream engine optimized for hybrid memories.
[May 16, 2019] Memory-Centric System Architecture for Data-Driven Computing
Speaker: Prof. Gwangsun Kim @ POSTECH
Sponsor: CSE Colloquium
Abstract: Computing systems are faced with unprecedented challenges in processing a large amount of data efficiently as demanded by data-intensive applications such as big data analytics and machine learning. To achieve high system performance for such applications, it is important to remove bottlenecks that exist in current systems for memory accesses as well as communication between different processors and accelerators (e.g., GPUs). Meanwhile, recently developed 3D-stacked memory devices such as Hybrid Memory Cube not only provides high memory bandwidth but also presents new opportunities in designing system interconnect as the memories can create a memory network. In this talk, I will propose a new system interconnect design referred to as Memory-Centric Network (MCN) that leverages the memory network to address the bandwidth bottlenecks in conventional Processor-Centric Network designs. Moreover, the MCN can be extended to interconnect the memory devices from compute accelerators and host processors to create a Unified Memory Network, which addresses the PCIe bottleneck in the system while removing the need for memory copies between processors and accelerators. In addition, to overcome the processor interface bandwidth bottleneck, Near-Data Processing through the memory network will be proposed. Lastly, I will discuss future research directions in memory system based on new technologies that are currently in development.