System Performance

Publications

Shivani Tripathy, Debiprasanna Sahoo, Manoranjan Satpathy and Madhu Mutyam

ACM International Conference on Supercomputing 2020

https://dl.acm.org/doi/abs/10.1145/3392717.3392766


Fuzzy Fairness Controller for NVMe SSDs

Modern NVMe SSDs are widely deployed in diverse domains due to characteristics like high performance, robustness, and energy efficiency. It has been observed that the impact of interference among the concurrently running workloads on their overall response time differs significantly in these devices, which leads to unfairness. Workload intensity is a dominant factor influencing the interference. Prior works use a threshold value to characterize a workload as high-intensity or low-intensity; this type of characterization has drawbacks due to lack of information about the degree of low- or high-intensity.

A data cache in an SSD controller - usually based on DRAMs - plays a crucial role in improving device throughput and lifetime. However, the degree of parallelism is limited at this level compared to the SSD back-end consisting of several channels, chips, and planes. Therefore, the impact of interference can be more pronounced at the data cache level. No prior work has addressed the fairness issue at the data cache level to the best of our knowledge. In this work, we address this issue by proposing a fuzzy logic-based fairness control mechanism. A fuzzy fairness controller characterizes the degree of flow intensity (i.e., the rate at which requests are generated) of a workload and assigns priorities to the workloads. We implement the proposed mechanism in the MQSim framework and observe that our technique improves the fairness, weighted speedup, and harmonic speedup of SSD by 29.84%, 11.24%, and 24.90% on average over state of the art, respectively. The peak gains in fairness, weighted speedup, and harmonic speedup are 2.02x, 29.44%, and 56.30%, respectively.

Shivani Tripathy, Debiprasanna Sahoo, Manoranjan Satpathy

32nd IEEE International Conference on VLSI Design and 18th International Conference on Embedded Systems (VLSID), 2019

https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8711244


Multidimensional Grid Aware Address Prediction for GPGPU

GPGPUs are predominantly being used as accelerators for general purpose data parallel applications. Most GPU applications are likely to exhibit regular memory access patterns. It has been observed that warps within a thread block show striding behaviour in their memory accesses corresponding to the same load instruction. However, determination of this inter warp stride at thread block boundaries is not trivial. We observed that thread blocks along different dimensions have different stride values. Leveraging this observation, we characterize the relationship between memory address references of warps from different thread blocks. Based on this relationship, we propose a multidimensional grid aware address predictor that takes the advantage of SM level concurrency to correctly predict the memory address references for future thread blocks well in advance. Our technique provides a cooperative approach where information once learned is shared with all the SMs. When compared with the CTA-aware technique, our predictor enhances average prediction coverage by 36% while showing almost similar prediction accuracy.

Debiprasanna Sahoo, Swaraj Sha, Manoranjan Satpathy, Madhu Mutyam

IEEE Computer Architecture Letter (CAL)

https://ieeexplore.ieee.org/document/8437148/

ReDRAM: A Reconfigurable DRAM Cache for GPGPUs

A GPU-based DRAM cache system contains both a TLB and a DRAM cache tag array that perform virtual-to-physical and physical-to-DRAM-cache address translation. Furthermore, the existing designs do not exploit the opportunity of allocating store-before-load data directly on GPU DRAM that could save multiple CPU-GPU transactions.

This design optimizes the DRAM cache in GPUs in two ways:

1. Tagless DRAM Cache design obliviates tag check operation completely from the critical path.

2. Reconfigures the DRAM cache as a heterogeneous unit that acts both as a cache and as an allocatable unit for store-before-load data.

Shivani Tripathy, Debiprasanna Sahoo, Manoranjan Satpathy

ACM/IEEE International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES), ESWEEK, 2018

https://dl.acm.org/doi/10.5555/3283552.3283564

Locking Lines in Tag Cache to Improve Access Optimization for DRAM Caches

Emerging 3D stacking technologies have enabled the use of DRAMs as the last level cache of CPUs. Several designs have been proposed in the existing literature of DRAM caches towards the design space exploration. While the debate on the design trade-offs between block-based and page-based DRAM caches continues, we discuss an orthogonal problem in the design space exploration of the DRAM cache related to the cache block allocation policy. We believe ours is the first work to analyze this aspect of the DRAM cache.


Debiprasanna Sahoo, Swaraj Sha, Manoranjan Satpathy, Madhu Mutyam and Laxmi Narayan Bhuyan

23rd, ACM Asia and South Pacific Design Automation Conference, 2018

https://dl.acm.org/citation.cfm?id=3201652

CAMO: A Novel Cache Management Organization for GPGPUs

GPGPUs are now commonly used as co-processors of CPUs for the computation of data-parallel and throughput intensive algorithms. However, memory available in GPGPUs is limited for many applications of interest; there is a continuous demand for increased memory of such applications. Several techniques like multi-steaming or pinned memory are frequently employed to mitigate these issues to some extent. However, these techniques either suffer from latency overhead or increase programming complexity. GPUdmm uses GPU DRAM as a cache of CPU; key problems in this design are inefficient memory access data-path and tag access overhead. In this context, we present CAMO, a novel cache memory organization for GPGPUs which addresses the limitations of pinned memory technique and GPUdmm. First, it uses GPU DRAM as a victim cache of LLC that improves the performance by delivering data faster to the SMs. Second, it uses ATCache, a CPU based DRAM cache tag management technique. ATCache reduces the number of DRAM cache accesses. We implement CAMO within the GPGPU-Sim framework and show that its average performance - when compared with pinned memory - increases by a factor of 1.87x and the peak performance growth being 4.67x. In addition, CAMO outperforms GPUdmm on an average by a factor of 15.9% and maximum speedup by a factor of 80%.


Debiprasanna Sahoo, Manoranjan Satpathy and Madhu Mutyam

30th IEEE International Conference on VLSI Design and 16th International Conference on Embedded Systems (VLSID), 2017

http://ieeexplore.ieee.org/document/7884754/

An Experimental Study on Dynamic Bank Partitioning of DRAM in Chip Multiprocessors

Concurrent execution of multiple applications on-chip multiprocessors leads to interference at a different level of shared resources like the banks of a DRAM. There are a few studies in the literature suggest that we can dynamically partition the DRAM banks among the running processes. However, detailed performance statistics are not available in the existing literature. In this paper, we made a comparative study of DRAM bank allocation algorithms between static and dynamic bank allocation algorithms on a large number of SPEC CPU 2006 benchmarks. We conclude that, though per-application performance increases for few applications, there is no benefit in overall system performance.