Research Projects

Tabularization for Practical NN-Based Prefetching

Attention-based Neural Networks (NN) have demonstrated their effectiveness in accurate memory access prediction, an essential step in data prefetching. However, the substantial computational overheads associated with these models result in high inference latency, limiting their feasibility as practical prefetchers. To close the gap, we propose a new approach based on tabularization that significantly reduces model complexity and inference latency without sacrificing prediction accuracy. Our novel tabularization methodology takes as input a distilled, yet highly accurate attention-based model for memory access prediction and efficiently converts its expensive matrix multiplications into a hierarchy of fast table lookups. As an exemplar of the above approach, we develop DART, a prefetcher comprised of a simple hierarchy of tables. With a modest 0.09 drop in F1-score, DART reduces 99.99% of arithmetic operations from the large attention-based model and 91.83% from the distilled model. DART accelerates the large model inference by 170× and the distilled model by 9.4×. 

Domain Specific ML Prefetcher for Graph Analytics

Existing Machine Learning (ML) prefetchers encounter challenges with phase transitions and irregular memory accesses in graph processing. We propose MPGraph, an ML-based Prefetcher for Graph analytics using domain specific models. MPGraph introduces three novel optimizations: soft detection of phase transitions, phase-specific multi-modality models for access delta and page predictions, and chain spatio-temporal prefetching (CSTP) for prefetch control. For practical implementation, we compress the prediction models to reduce the storage and latency overhead. MPGraph with the compressed models still shows significantly superior accuracy and coverage compared to BO, with 3.58% IPC improvement.

Reinforced Ensemble Framework for Data Prefetching

Most prefetchers in the literature are efficient for specific memory address patterns thereby restricting their utility to specialized applications–they do not perform well on hybrid applications with multifarious access patterns. Therefore we propose ReSemble: a Reinforcement Learning (RL) based adaptive enSemble framework that enables multiple prefetchers to complement each other on hybrid applications. Our RL trained ensemble controller takes prefetch suggestions from all prefetchers as input, selects the best suggestion dynamically, and learns online toward getting higher cumulative rewards, which are collected from prefetch hits/misses. Our ensemble framework using a simple multilayer perceptron as the controller achieves on the average 85.27% (accuracy) and 44.22% (coverage), leading to 31.02% IPC improvement, which outperforms state-of-the-art individual prefetchers by 8.35%–26.11%, while also outperforming SBP, a state-of-the-art (non-RL) ensemble prefetcher by 5.69%.

Fine-Grained Address Segmentation for Data Prefetching

Existing approaches are based on the modeling of text prediction, considering prefetching as a classification problem for sequence prediction. However, the vast and sparse memory address space leads to large vocabulary, which makes this modeling impractical. The number and order of outputs for multiple cache line prefetching are also different from text prediction. We propose TransFetch, a novel way to model prefetching. To reduce vocabulary size, we use fine-grained address segmentation as input. To predict unordered sets of future addresses, we use delta bitmaps for multiple outputs. We apply an attention-based network to learn the mapping between input and output. Prediction experiments show that address segmentation achieves higher F1-score than delta inputs and page & offset inputs. Prefetching simulation shows that TransFetch outperforms the state-of-the-art prefetchers.

Transformer for Memory Access Prediction

Data prefetching is a technique that can hide memory latency by fetching data before it is needed by a program. Prefetching relies on accurate memory access prediction, to which task machine learning based methods are increasingly applied. Unlike previous approaches that learn from deltas or offsets and perform one access prediction, we develop TransforMAP, based on the powerful Transformer model, that can learn from the whole address space and perform multiple cache line predictions.We propose to use the binary of memory addresses as model input, which avoids information loss and saves a token table in hardware. We design a block index bitmap to collect unordered future page offsets under the current page address as learning labels. As a result, our model can learn temporal patterns as well as spatial patterns within a page. In a practical implementation, this approach has the potential to hide prediction latency because it prefetches multiple cache lines likely to be used in a long horizon.

Clustering-driven Meta-LSTM for Access Prediction

We propose clustering-driven compact LSTM models that can predict the next memory access with high accuracy. We introduce a novel clustering approach called Delegated Model Clustering, that can reliably cluster the applications. For each cluster, we train a compact meta-LSTM model that can quickly adapt to any application in the cluster. Prior LSTM based work on access prediction has used orders of magnitude more parameters and developed one model for each application (trace). While one (specialized) model per application can result in more accuracy, it is not a scalable approach. In contrast, our models can predict for a class of applications by trading off specialization at the cost of few retraining steps at runtime, for a more generalizable compact meta-model.

RNN-Augmented Prefetcher

The rapid development of Big Data coupled with slowing down of Moore's law has made the memory performance a bottleneck in the von Neumann architecture. Machine learning has the potential to provide opportunities to address the memory performance issues, specifically through data access prediction. While recent works focusing on the prediction of memory accesses have used recurrent neural networks (RNN), there is a lack of a framework utilizing such prediction in a prefetcher. We introduces the RNN Augmented Offset Prefetcher (RAOP) framework, which consists of two parts: an RNN-based predictor and an offset prefetching module. By  leveraging the RNN predicted access as a temporal reference, RAOP improves prefetching performance by executing offset prefetching for both the current address and the RNN predicted address. 

Meta-LSTM Models for Memory Access Prediction 

While recent deep learning models have performed well on sequence prediction problems, they are far too heavy in terms of model size and inference latency to be practical for data prefetching. Here, we propose extremely compact LSTM models that can predict the next memory access with high accuracy. Prior LSTM based work on access prediction has used orders of magnitude more parameters and developed one model for each application (trace). While one (specialized) model per application can result in more accuracy, it is not a scalable approach. In contrast, our models can predict for a class of applications by trading off specialization at the cost of few retraining steps at runtime, for a more generalizable compact meta-model. Our experiments on 13 benchmark applications demonstrate that three compact meta-models can obtain accuracy close to specialized models using few batches of retraining for majority of the applications.