chapter 1 basic concepts on HA

1. some basic definitions

defect: variation in actual & expected result in the product/software.

fault: a condition or defect that may or may not causes a system to fail in performing its required function. Fault is internal to the system.

Failure: the inability of a system or component to perform its required functions within specified performance requirements. A failure is external and visible to system users.

Availability: continued system functionality without reduced capabilities, regardless there is failure or faults or not. A more classic and formal definition is : the probability that a system is available to perform its specified functionality within specified conditions during a specified period of time.

High Availability: a system with availability at 99.999% or higher

Reliability: the probability that a system can perform its specified functionality under specified condition during a specified period of time.

serviceability: The probability that a system can be repaired to serve its specified functionality in a specified period of time.

Disaster Recovery: Recovering from a major system outage with possibly reduced capabilities. Sometimes also called service continuity. It often associated with recover time objective (RTO), recover point ojective (RPO)

Fault Tolerance: continued system functionality with possibly reduced capabilities in the event of a failure or faults within subsystems. There should be no system outage for a fault tolerant system.

MTBF ( Mean Time Between Failure ): average time a system can continue to run before it hits a failure. It is often quoted as MTTF ( Mean Time To Failure ) as well.

MTTR ( Mean Time To Repair ): average time to repair a system such that it can perform its specified functionality again. It is often quoted as "Mean Time To Recovery" as well.

Annual Failure Rate ( AFR ): number of failures per year for a system.

SLA ( Service Level Agreement ): contract between a company and its customer related to how the product should perform under various conditions.

2. relationship among reliability, serviceability and availability

Reliability is typically quantified as MTBF. It is system up time.

Serviceability is typically quantified as MTTR. It is part of system down time.

Availability is typically quantified as MTBF / ( MTBF + MTTR ). It is the percentage of time when system is up.

availability = reliability / ( reliability + serviceability ) = MTBF / ( MTBF + MTTR ) = uptime / ( uptime + downtime ) = 1 / ( 1 + downtime/uptime ).

The above equation has many meanings.

2a. reliability and serviceability are foundational for availability.

In fact, they are foundational for "high available systems" and "mission-critical systems".

a HA system demands that the ration between downtime and uptime to be as small as possible. It does not demand uptime or downtime alone. telecom systems are typical HA systems.

A mission-critical system demands that uptime to be as long as possible to finish the mission. It does not care about system downtime. Once the mission is finished, the downtime can be very long since the system is not on a mission. Spaceshuttle, parachute are examples of mission-critical systems.

A "safety-critical systems" is a bit different from HA, or mission-critical. It does not demand too much on uptime or downtime. Instead, it demands on failure handling with fail-fast to quickly isolate the failure domain and fail-safe to fail to a safe mechanism such that system will not do any damage. We don't care what system can perform its functionality or not any more once it is dropped into fail-safe mode.

2b. Availability refers a "time frame".

If we have a system with availability of 99.999% for a month, that means in that specified month, total downtime must be only 0.0001% of that one month, which is 2.6 seconds. If we have a system with availability of 99.999% for a year, that means only 0.001% of total downtime in a year, which is 315 seconds.

apparently, same availability, same downtime/uptime ration, applying to different "period of time", requires different uptime and downtime. However, it is not necessarily means that shorter time interval suggest more difficulty in building such a system. In fact, on the contrary, many simple system can continue run for a month ( and after that, having a few days downtime ) without any failure at all but very few system can have 99.999% availability for a year. Annual Failure Rate ( AFR ) would be a good number to quantify system as well.

2c. ratio between downtime and uptime is important.

Availability is calculated as 1 / ( 1 + downtime/uptime ). the ration between downtime and uptime will determine the availability. However, This does not mean we don't care total uptime, or total downtime. The longer the uptime and the shorter the downtime, the better.

2d. shortening downtime is as important as extending uptime.

uptime * 10 / ( uptime * 10 + downtime ) = uptime / ( uptime + downtime * 0.1 ).

the above equation suggest that extending uptime 10 folds is equivalent to cutting downtime by 90%. Lots of cases, extending uptime 10 times is more difficult to do than cutting downtime by 90%.

3 relationship between fault tolerance and availability

by definition, fault tolerance refers to continued system functionality with possibly reduced capabilities in the event of a failure or faults within subsystems. There should be no system outage for a fault tolerant system.

a fault tolerant system will continue to run even a subsystem experience failures. From external point of view, the whole system did not experience failure. This is literally extending the system uptime hence extending system reliability.

extending reliability is one aspect of improving a system availability, refer back the availability calculation.

availability = reliability / ( reliability + serviceability ) = MTBF / ( MTBF + MTTR ) = uptime / ( uptime + downtime ) = 1 / ( 1 + downtime/uptime ).