Chapter 2 basic approaches for HA

let's go back to the calculation of availability.

availability = reliability / ( reliability + serviceability ) = MTBF / ( MTBF + MTTR ) = uptime / ( uptime + downtime ) = 1 / ( 1 + downtime/uptime )

cleary, we need to make uptime as long as possible, and make down time as short as possible. So those are two basic approaches.

1. make downtime as short as possible.

There are many aspects to this approach, including:

1. make hardware modular such that repairing or replacing bad part can be very fast. This is sometime called "corrective maintenance downtime". It is the down time when some parts went wrong and we need to repair it.

2. make system maintenance window as short as possible. Adding lubricant to some hardware parts can be an example. This is sometime called "preventive maintenance downtime". Doing so can prevent future failure.

3. It was estimated that 45% of the failure in IT service is due to operator mistakes/error. Adopting Information Technology Infrastructure Library ( ITIL ) best practice to reduce operator mistakes.

3. make software modular such that software bug fix can be patched without a reboot but a shorter service disruption via a process restart. This could include many virtualization technologies.

4. In Service Software Upgrade ( ISSU ).

2. make uptime as long as possible

uptime is a direct measure of system reliability. and this is a whole industry of reliability engineering on how to make a system as reliable as possible. There are two basic approaches.

2a. get rid of fault and failure.

Many of the reliability techniques related to Failure Mode and Effect Analysis ( FMEA ) and design to avoid it.

There can be several iterations between identifying failure and re-design.

2b. tolerate fault and failure.

In many cases, failure cannot be completely identified, and we have to face a failure sometime down the road. In such cases, we can design the system to tolerate a fault and have the system continue to run even a subsystem fails. This is so called "fault tolerance".

A fault tolerant system will be designed with enough margin on all aspects of the subsystem such that small deviation from reference or baseline number will not cause the whole system to fail.

Tolerating Single Point of Failure ( SPOF ) is sometime very difficult to do. A SPOF may fail the whole system. In such cases, redundancy might be a inevitable solution: just let primary subsystem fail, and let backup subsystem take over so that the whole system still appear no failure to the external user.