Chapter 3 general concepts on faults

To make a system more reliable hence more available, we need to either remove all the faults from the system, or tolerating the faults such that no failure will be triggered, even when there a fault condition exists.

1. classification of faults

faults can be classified based on when, where, what, why, and how.

byzantine faults: a fault that we cannot predict when it will happen, why it will happen, how it will happen. We might know where it will happen and the consequence of the fault.

A byzantine fault is often caused by a Heisenbug: a bug that get exposed under certain timing and load. E.g., a pointer corruption bug may disappear under gdb environment.

intermittent fault: fault that has an unpredictable timing of when it will happen. Often due to race conditions in software.

reproducible fault: fault that can be reproduced with specified steps or procedures under specified system conditions, including load, configuration, etc.

operational and human fault: fault caused by human beings. e.g., typing in a wrong value for IP address.

other classification can be:

temporary or permanent fault, internal or external faults, active/latent fault

2. Fault handling

Following are areas that often involved when dealing with faults.

fault avoidance/prevention: preventive/corrective maintenance, masking correctable fault.

fault detection/prediction: voting system, value range checking, data integrity checking, data comparison, timing checking (heartbeat),

fault diagnosis: local/global, granularity of diagnosis, single/multiple failure mode. self diag, online diag,

fault isolation: establish fault domain, take the culprit out of service, physical isolation or logic isolation

fault recovery: forward recovery: re-sending a message

backward recovery: rollback to a known state, pessimistic, checkpoint, reset, restart, reboot, rebalance/re-route, journaling

checkpointing data should sitting out side fault domain,

recovery domain: make domain as small as possible, and separate physically ( memory space ) and logically ( no shared data )

outside recovery domain: need supervision and monitor

fault repair: fix the faulty component, offline diag, replacement, download, patch, etc

fault notification/reporting: in-band, out-of-band, in-line