Agile-DRAM: Agile Trade-Offs in Memory Capacity, Latency, and Energy for Data Centers (2024). Proceedings of the International Symposium on High Performance Computer Architecture (HPCA).
Jaeyoon Lee, Wonyeong Jung, Dongwhee Kim, Daero Kim, Junseung Lee, and Jungrae Kim
Abstract
Data centers frequently face significant memory under-utilization due to factors such as infrastructure overprovisioning, inefficient workload scheduling, and limited server configurations. This paper introduces Agile-DRAM, a novel DRAM architecture that addresses this issue by flexibly converting the under-utilized memory capacity into enhanced latency performance and reduced power consumption. Through minor modifications to the conventional DRAM architecture, AgileDRAM supports multiple operational modes: low-latency, low-power, and the default max-capacity mode. Notably, Agile-DRAM facilitates agile transitions between these modes in response to workload fluctuations in data centers at runtime. Evaluation results demonstrate that the low-latency mode can boost single-core execution speed by up to 25.8% and diminish energy usage by up to 22.4%. Similarly, the low-power mode can reduce DRAM standby and self-refresh power by 31.6% and 85.7%, respectively.
Agile-DRAM
Figure 1. Existing bitline and BLSA structures.
Figure 2. The bitline structure and operations of Agile-DRAM.
Figure 3. The global decoder of Agile-DRAM.
Agile-DRAM is a novel DRAM architecture enabling agile and efficient trade-offs among capacity, performance, and power consumption. By introducing a straightforward yet effective modification to the conventional DRAM structure, it provides support for both low-latency and low-power modes along with the traditional max-capacity mode.
One of Agile-DRAM’s standout features is its capability to transition dynamically between these modes without any data loss. This adaptability equips the system with the flexibility to cater to a diverse range of application requirements, which is particularly beneficial in data center environments characterized by underutilized memory capacity. Despite its advantages, Agile-DRAM introduces only an insignificant increase in chip area overhead (mere few hundred logic gates), rendering it a feasible and attractive option for DRAM manufacturers seeking a single design that is applicable across both high-end and low-cost markets.
A. Agile-DRAM Structure
This section outlines the modifications necessary for implementing Agile-DRAM. It begins with a discussion of the mirrored mat structure designed to pair two mats together, followed by a description of minor adjustments to the global decoder and the command decoder. These changes facilitate the simultaneous activation of paired rows and enable the storage of information with opposite charge levels.
A conventional open-bitline structure connects half of the bitlines to upper BLSAs and the other half to lower BLSAs (Figure 1a). In this structure, however, grouping mats together results in a chaining effect (for example, the 1st mat shares BLSAs with the 2nd mat, which in turn shares with the 3rd mat, and so on).
Agile-DRAM breaks this chaining effect by introducing a mirrored mat structure (Figure 2). This new structure mirrors one of the mats to place two rows of BLSA in the middle, as opposed to the conventional open-bitline structure that positions one row at the top and another in the middle. Then it connects all bitlines from the top and bottom mats to the BLSAs in the middle. Bitlines in odd positions are linked to an upper BLSA (triangles A and C in Figure 2), whereas bitlines in even positions are connected to a lower BLSA (triangles B and D). This rearrangement does not increase the area but marginally extends the bitline length by the height of the BLSA, which is unlikely to affect DRAM timing parameters in cycles. With these changes, all cells in the paired mats connect to the BLSA rows in the middle, ensuring that paired mats do not share BLSAs with other mats.
In this structure, simultaneous activation of the two mats links two cells at the same column position into a central BLSA via the bitline and bitline. If the cells hold opposite charge levels, the BLSA can sense the stored information more readily than in a conventional DRAM where bitline has the reference voltage level (1/2 VDD). This new structure allows Agile-DRAM to utilize complementary rows without incurring the disadvantages of unused capacity [1] or halving the row size [2].
Another modification for Agile-DRAM is changing the global decoder (Figure 3). In a conventional setup, the global decoder accepts the higher bits of a row address and chooses a subarray. Agile-DRAM modifies the decoder to simultaneously select paired sub-arrays in low-latency or low-power modes. This modification adds an OR gate to the true and complementary signals of the least significant subarray address (addr[3] with 1024 × 1024 mats) so that paired subarrays and their WLs can be activated in parallel.
Agile-DRAM introduces a mode register and delays logic into the command decoder to facilitate agile mode switching (Section V). The delay logic imposes a time gap between the activation of paired mats during a mode transition. During a transition from the max-capacity mode to another mode, Agile-DRAM first activates the row containing the data to be retained. It then uses the delay logic to wait for several cycles, ensuring that the row fully drives the shared bitlines. Finally, it activates the other row to store complementary information in the unused row, a process similar to RowClone [4]. This strategy enables a non-destructive mode transition that does not risk data loss and allows agile mode switching without causing significant delay.
In summary, Agile-DRAM requires only a minor addition to the existing circuitry, primarily confined to bank-level and chip-level peripheral logic. This leads to a significant reduction in area overhead compared to previous methods, whose matlevel modifications incur significant overheads due to the presence of a large number of mats.
Evaluation results can be found in our paper
Conclusion
Data centers have contended with significant memory underutilization. Repurposing this idle resource for other benefits has posed a considerable challenge due to the strict service level objectives of service providers and the cost sensitivity of DRAM vendors. To address this issue, this paper introduces Agile-DRAM, a novel DRAM architecture.
Agile-DRAM capitalizes on temporarily under-utilized memory capacity to enhance systems performance and energy efficiency. It can dynamically adjust to a variety of work conditions by transitioning between different operating modes: max-capacity, low-latency, and low-power. For servers with low memory utilization, the low-latency mode allows for accelerated application performance. In contrast, servers with very low utilization can shift to the low-power mode to curtail their standby power consumption. The Agile-DRAM architecture ensures a smooth transition between these modes, eliminating any disruption to existing workloads or the need for data rewriting. In addition to these substantial benefits, Agile-DRAM presents minimal area overheads.
With the aforementioned benefits, Agile-DRAM emerges as a highly promising solution for data center operators to tackle the severe issue of memory under-utilization with promoting operational performance and energy efficiency. Simultaneously, the high area efficiency of Agile-DRAM provides an affordable solution to cost-sensitive DRAM vendors. Therefore, Agile-DRAM effectively bridges the needs of both data center operators and DRAM vendors, providing an innovative and practical approach to optimizing memory utilization and overall system performance.
References
[1] F. Bai, S. Wang, X. Jia, Y. Guo, B. Yu, H. Wang, C. Lai, Q. Ren, and H. Sun, “A low-cost reduced-latency dram architecture with dynamic reconfiguration of row decoder,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 31, no. 1, pp. 128–141, 2023.
[2] H. Luo, T. Shahroodi, H. Hassan, M. Patel, A. G. Yaglıkc¸ı, L. Orosa, ˘ J. Park, and O. Mutlu, “Clr-dram: A low-cost dram architecture enabling dynamic capacity-latency trade-off,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2020, pp. 666–679.
[3] J. Dean and L. A. Barroso, “The tail at scale,” Communications of the ACM, vol. 56, no. 2, pp. 74–80, 2013.
[4] V. Seshadri, Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry, “Rowclone: Fast and energy-efficient in-dram bulk data copy and initialization,” in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-46. New York, NY, USA: Association for Computing Machinery, 2013, p. 185–197. [Online]. Available: https://doi.org/10.1145/2540708.2540725