Abstract
DRAM vendors introduced On-Die Error Correction Codes (OD-ECC) to correct errors internally. Most OD-ECCs are based on Single Error Correction to correct individual bit errors. However, recent soft error experiments on HBM2 reveal that DRAM frequently experiences multi-bit errors, necessitating a stronger OD-ECC solution. This paper introduces a novel OD-ECC, EPA ECC, specifically designed to correct frequently-observed multi-bit error patterns. The key innovation of EPA ECC is the construction of multi-bit symbols aligned with common error patterns, and the application of Reed-Solomon codes to correct severe errors without increasing the redundancy ratio. Our evaluation demonstrates that EPA ECC provides higher memory reliability than the current SEC-DED ODECC without incurring significant system performance degradation.
Motivation
Table 1. Soft Error Patterns [1].
Figure 1. Error Pattern Examples on HBM2E OD-ECC Block.
This section analyzes the HBM2E error patterns reported by high-energy beam testing, which motivated this study. [1] conducted a high-energy neutron beam injection experiment on NVIDIA V100 GP-GPUs with 32GB HBM2. These high-energy particles (9.8e5 neutrons/cm^2/second) generate soft errors in the memory, and Sullivan classified the observed errors into 7 levels of severity based on the number of bit errors and their locations within memory accesses. Figure 1 presents the error patterns.
1-bit errors were defined as errors that involved only a single bit within an access, while 1-byte errors were defined as errors that involved more-than-one bit on the same byte, transferred side-by-side. 1-pin errors involve more-than-one bit errors from a single pin over time. 2-bit and 3-bit errors that do not belong to a single byte or single pin were classified as 2-bit errors and 3-bit errors, respectively. Errors that affected more-than-three bits were classified as a 1 beat error if they affected only one beat of transfer, or as a 1 entry error if they affected more than one beat of transfer.
Table I presents the soft error probabilities based on the severity. The table indicates that 1-bit errors were the most common type of error, accounting for 73.98% of all observed errors. However, a significant portion (22.56%) of the errors were 1-byte errors. Together, these two types accounted for 96.54% of all observed errors, while the remaining errors were classified as the 1-pin error (0.19%), 2-bits (0.11%), 3-bits (0.03%), 1-beat (0.90%), and 1-entry (2.23%).
These results provide important insights into the types of errors that can occur in HBM2. The high incidence of 1-byte errors may indicate that highenergy particle strikes often upset multiple nearby cells due to the higher cell density, or that they often land on a peripheral circuit (such as a row decoder) to affect multiple cells connected to a wordline. In either case, the analysis guided us to develop strategies for mitigating severe and frequent 1-byte errors, such as implementing a novel ECC code for HBM2E. To address this issue, this paper proposes a suitable code aligned to soft error patterns classified from [1].
EPA ECC
Figure 2. Overview of EPA ECC
The reported error patterns indicate that the current SEC-based ECC schemes [2] are insufficient to provide high memory reliability against soft errors. This section presents a novel ECC scheme, called Error-PatternAligned ECC (EPA ECC), to provide robust protection against soft errors. By correcting dominant 1-bit and 1-byte errors, EPA ECC can correct 96.54% of the observed errors, whereas the current SEC-DED can cover only 73.18%. Despite the reliability benefits, EPA ECC does not increase decoding latency significantly by utilizing systematic RS codes and efficient hardware implementation.
A. Code Layout
Figure 2 shows the structure of EPA ECC. The code layout maintains the same memory access block size as the current one (256-bit data, 32-bit S-ECC redundancy, and 24-bit OD-ECC redundancy). Instead of increasing redundancy, EPA ECC adopts a different approach. In EPA ECC, each byte is converted into an 8-bit symbol, resulting in non-binary ECC. This transformation allows for improved error-correcting capabilities without increasing redundancy. The original 288-bit data and S-ECC redundancy are thus transformed into 36 data symbols. Similarly, the 24-bit OD-ECC redundancy is converted into 3 redundant symbols.
Then, EPA ECC applies RS codes to the entire access block. The 3-symbol redundancy allows 1-symbol correction and 2-symbol detection. This means that EPA ECC can correct all single byte-errors and detect all double byte-errors within the access block. By using the larger block size and applying shortened RS codes, EPA ECC offers stronger protection with the same redundancy ratio compared to using 3 individual SECDED OD-ECC schemes. This approach enhances the error-correcting capability without increasing the redundancy overhead, making EPA ECC a more efficient and robust error-correction scheme.
It is true that EPA ECC cannot correct two or three independent bit errors that belong to different symbols, while the current OD-ECC scheme is capable of correcting some of them using 3 SEC-DED codes. However, the likelihood of having two or more independent errors within the same block is quite low, as evidenced by the beam-testing results (Table I). The results indicate a 0.11% probability for 2-bit errors and an even lower 0.03% probability for 3-bit errors.
Given the low probabilities of multiple independent errors occurring within the same block, EPA ECC provides a good trade-off between error-correction capability and redundancy overhead. Moreover, our evaluation (details can be found in our paper) shows that EPA ECC provides near 100% detection of 3-bit errors, whereas the current 3× SECDED often mis-corrects 3-bit errors, if they belong to the same word, and results in a SDC. The benefits of stronger protection against single byte-errors and the detection of double byte-errors outweigh the potential drawback of not being able to correct certain instances of two or three independent bit errors belonging to different symbols. Overall, EPA ECC offers a more efficient and robust error-correction scheme.
Conclusion
This paper proposes EPA ECC, an error patternaligned OD-ECC to improve memory reliability. EPA ECC can correct dominant soft errors in HBMs, reducing DUE 4.6×, and SDC 4980× than the SEC-DED OD-ECC [2]. EPA ECC is a novel ECC code for HBM2E, improving memory reliability while maintaining acceptable system performance degradation (around 1%).
References
[1] M. B. Sullivan, “Characterizing and mitigating soft errors in GPU dram,” in MICRO, 2021.
[2] K. C. Chun, “A 16-gb 640-gb/s hbm2e dram with a databus window extension technique and a synergetic on-die ecc scheme,” IEEE JSSC, 2020.