chapter 4 fault tolerance and redundancy

signle point of failure (SPOF): hardware failure, OS failure, application failure, operational failure, environmental failure,

avoiding SPOF deamands redundancy.

redundancy :A redundant component is one that can be connected to the same inputs and can provide the same outputs as another component

dependency and redundancy: dependency can suggest SPOF, and a component is redundant only when all dependency are redundant.

spatial/physical redundancy: swithch over

temporal/time redundancy: ACK/NACK, retry

information redundancy: duplicate data in one msg.

parallel redundancy: 2 or more spare parts (N-modular redundancy, TMR, DMR, voting or split brain ),

jointed by a common component which may fail too.

standby redundancy: a standby part, hot/cold/warm standby

failover cluster: stateful (logic data corruption, software patch), (active/passive/standby,hot spare),

service group: dependecy, start/restart, livelyhood check (timeout chain for hang process/hardware), failover ping-pong, migration,

activation/deactivation during error condition,

problems: failover success rate, migration taking too long, livelyhood check taking too long, deactivation in error condition,

not using journaling, cleanup after crash, or migration,

load balancing cluster (server fam): simple stateless web http service, DNS and DS, ip load balancer, reverse proxy, hardware scheduling,

failfast:isolate dead node to avoid split brain syndrome, using quodrum device (disk heartbeat)

is redundant "cheap hardware/software" better than a single expensive hardware/software?

granularity of "SPOF" and "redundancy" level.

keep it simple and straightforward: do one thing and do it well, less dependency, virtualization,

redundancy management hardware/software typically is not redundant