chapter 4 fault tolerance and redundancy
signle point of failure (SPOF): hardware failure, OS failure, application failure, operational failure, environmental failure,
avoiding SPOF deamands redundancy.
redundancy :A redundant component is one that can be connected to the same inputs and can provide the same outputs as another component
dependency and redundancy: dependency can suggest SPOF, and a component is redundant only when all dependency are redundant.
spatial/physical redundancy: swithch over
temporal/time redundancy: ACK/NACK, retry
information redundancy: duplicate data in one msg.
parallel redundancy: 2 or more spare parts (N-modular redundancy, TMR, DMR, voting or split brain ),
jointed by a common component which may fail too.
standby redundancy: a standby part, hot/cold/warm standby
failover cluster: stateful (logic data corruption, software patch), (active/passive/standby,hot spare),
service group: dependecy, start/restart, livelyhood check (timeout chain for hang process/hardware), failover ping-pong, migration,
activation/deactivation during error condition,
problems: failover success rate, migration taking too long, livelyhood check taking too long, deactivation in error condition,
not using journaling, cleanup after crash, or migration,
load balancing cluster (server fam): simple stateless web http service, DNS and DS, ip load balancer, reverse proxy, hardware scheduling,
failfast:isolate dead node to avoid split brain syndrome, using quodrum device (disk heartbeat)
is redundant "cheap hardware/software" better than a single expensive hardware/software?
granularity of "SPOF" and "redundancy" level.
keep it simple and straightforward: do one thing and do it well, less dependency, virtualization,
redundancy management hardware/software typically is not redundant