How to protect a system against soft errors?

The basic strategy against soft errors is redundancy. Assume that a processor calculates 4+3. Without any soft error, the result will be 4+3=7. Now, assume that a soft error corrupts the processor during the calculation so that the result becomes 4+3=3. Without any protection, the processor will continue the next calculations without knowing that the previous calculation was wrong.

The simple solution is repeating the calculation. Assume the processor calculates 4+3 twice. If one soft error corrupts one of the calculations, the processor will have two results, 4+3=3 and 4+3=7; By comparing them, the processor can be aware of corruption from the mismatch of the results from the identical calculations. This double redundancy is called DMR (dual modular redundancy). Usually, DMR can detect a single soft error.

While DMR can detect a soft error, it is hard to correct the effect of error by DMR; if the processor has only two results (4+3=3 and 4+3=7), the processor cannot determine which one is correct. Now, assume that the processor replicates the calculation (4+3) three times. If a soft error corrupts one calculation, the processor has three results (two 4+3=7 and one erroneous 4+3=3). Based on the majority, the processor can choose 4+3=7 as the right result. This triple redundancy is called TMR (triple modular redundancy).

Redundancy can be implemented at the hardware level or software level. Hardware-level redundancy, which replicates hardware components or entire processors, is effective against soft errors. However, hardware-level redundancy requires additional hardware costs for the redundancy since it should modify the hardware. Additionally, hardware-level redundancy is hard to apply to already produced systems. On the other hand, software-level redundancy, which replicates the software, can protect systems against soft errors without modifying hardware. Software-level redundancy provides redundancy by various granularity; from assembly level to the entire application level.