What is the challenge?
The complexity of manycore System-on-chips (SoCs) is growing faster than our ability to manage them to reduce the overall energy consumption. Further, as SoC design moves towards 3D-architectures, the core’s power density increases leading to unacceptable high peak chip temperatures.
Our solution:
We consider the optimization problem of dynamic power management (DPM) in manycore SoCs for an allowable performance penalty (say 5%) and admissible peak chip temperature. We employ a machine learning (ML) based DPM policy, which selects the voltage/frequency (V/F) levels for different cluster of cores as a function of the application workload features such as core computation and inter-core traffic etc. We propose a novel learning-to-search (L2S) framework to automatically identify an optimized sequence of DPM decisions from a large combinatorial space for joint energy-thermal optimization for one or more given applications. The optimized DPM decisions are given to a supervised learning algorithm to train a DPM policy, which mimics the corresponding decision-making behavior.
Results achieved:
Our experiments on two different manycore architectures designed using wireless interconnect and monolithic 3D demonstrate that principles behind the L2S framework are applicable for more than one configuration. Moreover, L2S-based DPM policies achieve up to 30% energy-delay product savings and reduce the peak chip temperature by up to 17K compared to the state-of-the-art ML methods for an allowable performance overhead of only 5%.
What is the challenge?
Large-scale manycore System-on-Chips (SoCs) need to satisfy the conflicting objectives of maximizing performance and minimizing energy consumption for dynamically changing applications.
Our solution:
We consider the problem of dynamic power management (DPM) in large manycore SoCs for unseen applications at runtime. We employ a machine learning (ML) based DPM policy, which selects the voltage/frequency (V/F) levels for different cluster of cores as a function of the application features such as core computation, inter-core traffic etc. We propose a novel uncertainty-aware online learning framework to learn the DPM policy, which can adapt to unseen applications at runtime. It relies on two key ideas. First, an entropy-based uncertainty measure is used to distinguish between seen and unseen system states. Second, we employ conformal prediction to compute uncertain V/F sets for unseen system states. We perform bounded-search over the uncertain V/F configurations using power/performance models to identify the best V/F configurations to minimize the energy-delay product (EDP) and create supervised examples for online learning.
Results achieved:
Our experiments on 64-core system show that the EDP is reduced by up to 50% and 60% when compared to existing online-imitation learning and reinforcement learning methods, respectively.
What is the challenge?
Resistive random-access memory (ReRAM) based processing-in-memory (PIM) architectures are used extensively to accelerate inferencing/training with convolutional neural networks (CNNs). Three-dimensional (3D) integration is an enabling technology to integrate many PIM cores on a single chip.
Our solution:
We propose the design of a thermally efficient dataflow-aware monolithic 3D (M3D) NoC architecture referred to as TEFLON to accelerate CNN inferencing without creating any thermal bottlenecks.
Results achieved:
TEFLON Reduces the Energy-Delay-Product (EDP) by 42%, 46%, and 45% on an average compared to a conventional 3D mesh NoC for systems with 36-, 64-, and 100-PIM cores respectively. TEFLON reduces the peak chip temperature by 25K and improves the inference accuracy by up to 11% compared to sole performance-optimized SFC-based counterpart for inferencing with diverse deep CNN models using CIFAR-10/100 datasets on a 3D system with 100-PIM cores.
What is the challenge?
Processing-In-Memory (PIM) architectures have emerged as an attractive computing paradigm for accelerating Deep Neural Network (DNN) training and inferencing. However, a plethora of PIM devices, e.g., ReRAM, FeFET, PCM, MRAM, SRAM exists and each of these devices offers advantages and drawbacks in terms power, latency, area, and non-idealities.
Our solution:
A heterogeneous architecture that combines the benefits of multiple devices in a single platform can enable energy-efficient and high performance DNN training and inference. Three-dimensional (3D) integration enables the design of such a heterogeneous architecture where multiple planar tiers consisting of different PIM devices can be integrated into a single platform. In this work, we propose the HuNT framework, which hunts for (finds) an optimal DNN neural layer mapping, and planar tier configurations for a 3D heterogeneous architecture.
Results achieved:
Overall, our experimental results demonstrate that the HuNT-enabled 3D heterogeneous architecture achieves up to 10x and 3.5x improvement with respect to the homogeneous and existing heterogeneous PIM-based architectures, respectively, in terms of energy-efficiency (TOPS/W). Similarly, the proposed HuNT-enabled architecture outperforms existing homogeneous and heterogeneous architectures by up to 8x and 2.4x respectively in terms of compute-efficiency (TOPS/mm2) without compromising the final DNN accuracy.
What is the challenge?
Operation Unit (OU)-based configurations enable the design of energy-efficient and reliable ReRAM crossbar-based Processing-In-Memory (PIM) architectures for Deep Neural Network (DNN) inferencing. To exploit sparsity and tackle crossbar non-idealities, matrix-vector-multiplication (MVM) operations are computed at a much smaller level of granularity than a full crossbar, referred to as OUs. However, determining the suitable OU size for a given DNN workload presents a non-trivial challenge as the DNN layers exhibit different levels of sparsity and have varying impact on overall predictive accuracy.
Our solution:
In this paper, we propose a framework for designing heterogeneous OU-based PIM accelerators. The OU configurations vary based on the characteristics of the neural layers and the time-dependent conductance drift of PIM devices due to repeated inference runs.
Results achieved:
Our experimental results demonstrate that the sparsity-aware layer-wise heterogeneous OU-based PIM computation reduces latency and energy by 34% and 73% on average respectively, compared to state-of-the-art homogeneous OU-based architectures without compromising the predictive accuracy.
What is the challenge?
ReRAM-based Processing-In-Memory (PIM) architectures enable energy-efficient Deep Neural Network (DNN) inferencing. However, ReRAM crossbars suffer from various non-idealities that affect the overall inferencing accuracy. To address that, the matrix-vector-multiplication (MVM) operations are computed by activating a subset of the full crossbar, referred to as Operation Unit (OU). However, OU configurations vary with the neural layers’ features such as sparsity, kernel size, and their impact on predictive accuracy.
Our solution:
In this paper, we consider the problem of learning appropriate layer-wise OU configurations in ReRAM crossbars for unseen DNNs at runtime such that the performance is maximized without loss in predictive accuracy. We develop a machine learning (ML) based framework called Odin, which selects the OU sizes for different neural layers as a function of the neural layer features and time-dependent ReRAM conductance drift.
Results achieved:
Our experimental results demonstrate that the energy-delay-product (EDP) is reduced by up to 8.7x over state-of-the-art homogeneous OU configurations without compromising predictive accuracy.
What is the challenge?
Graph Neural Networks (GNNs) are made up of multiple layers, with each layer comprising of different compute kernels involving weight vectors and adjacency matrices of input graph dataset. These layers exhibit varying features such as sparsity, storage requirement, and impact on predictive accuracy. Non-volatile memory (NVM)-based 3D Processing-In-Memory (PIM) architectures offer a promising approach to accelerate GNN inferencing. However, NVM device-based crossbars suffer from various non-idealities that affect the overall predictive accuracy.
Our solution:
In this work, we consider the problem of finding a suitable mapping of GNN layers to PIM-based processing elements (PEs) in a 3D manycore architecture such that the impact of crossbar non-idealities on predictive accuracy is minimized. We develop a framework called GINA, which leverages low-cost, approximate Hessian-based methodology to automatically determine the GNN layers that are critical for accuracy and find a suitable GNN layer to PE mapping. To tackle non-idealities and to exploit sparsity at the crossbar level, a subset of the full crossbar is activated in a cycle, referred to as Operation Unit (OU). However, OU configurations vary with the above-mentioned GNN layer features, time-dependent conductance drift, and input graph dataset. GINA learns to optimize the OU configuration for unseen datasets as a function of GNN layer features and time-dependent conductance drift.
Results achieved:
Our experimental results demonstrate that GINA-enabled 3D PIM architecture reduces the latency and energy by 7.4x and 13x on an average respectively, compared to state-of-the-art PIM architectures without compromising the predictive accuracy.