Abstract
Memory reliability is especially key to attaining resilience at scale, where effective error detection and correction are essential for ensuring data integrity and preventing system failures. Technologies such as On-Die Error Correction Code (OD-ECC) and Rank-Level ECC (RL-ECC) have evolved independently, mitigating various errors and enhancing reliability. However, investigating an optimal combination between OD-ECC and RL-ECC could lead to further improvements. In particular, the optimal scheme and mapping between OD-ECC with Bounded Fault (BF) applied and RL-ECC in DDR5 have yet to be explored. This paper investigates the importance of scheme consistency for reliability in memory sub-systems and provides guidelines for RLECC design for system companies.
Methodology
A. On-Die ECC Bound Scheme
In DDR5's OD-ECC, which requires enhanced RAS (Reliability, Availability, and Serviceability) features, a BF design has been implemented to prevent mis-correction of multiple faults into different bounds. Due to DDR5's limited block size, there are two types of bounds: 16 bits, 32 bits. The two bounds are further divided into five schemes (Table 1). This division is based on considerations of x4 devices commonly used in servers and DDR5's Burst Length, resulting in schemes of (# of DQ x BL) 1x16, 2x8, and 4x4 for 16-bit bounds, and 2x16 and 4x8 for 32-bit bounds.
Table 1. On-Die Bound Scheme
B. Rank-Level ECC Scheme
RL-ECC, capable of achieving Chip-level error correction, is used in servers requiring high reliability. We base our approach on Reed-Solomon Codes with symbol sizes of 8 bits and 16 bits, considering a balance between overhead and latency. Like the OD-Bound approach, we divided the schemes into three based on (# of DQ x BL count per symbol), as shown in Table 2. Fig. 1 depicts the orthogonal operation of the OD and RL schemes in the memory sub-system.
Table 2. Rank-Level Scheme
Figure 1. OD-ECC and RL-ECC combinations scheme in (b) OD-Bound Scheme in Table1. (c) Partial RL-ECC Scheme in Table2. In this paper, 30 combinations are created using the matrices in (b) and (c).
C. Error Scenario
We randomly inject errors on data blocks to emulate error scenarios. The error injection process was based on five fault models: Single, Chip-Level, Pin-aligned, Multi Bit bounded (MBBE), and Word. Errors were generated and injected into the data blocks based on these models, with each bit within the affected region having a 50% chance of switching from correct to incorrect. For each fault model, we assessed whether errors resulted in Detectable and Correctable Errors (DCE), Detectable but Uncorrectable Errors (DUE), or Silent Data Corruption (SDC).
Result
We have found that the scheme illustrated as an example in DDR5 (a narrow and elongated form of OD-ECC bound, limited to a 1DQ width) and the currently most used form of RL scheme represented by Chipkill (wider scheme with a symbol width of 4DQs) are not the most effective options.
Figure 2. A Comparison of correction coverage for Each Scheme. MBBE scenarios are injected into two separate chips. Each bar represents the OD-ECC Bound.