Expertise‎ > ‎

Reliability Modeling

Simple Reliability Models
odeling the reliability of a simple system without repair is straightforward. A combination of series and parallel probabilities are adequate. They can be evaluated with many simple methods including the following:

1) Closed form equations for series/parallel probabilities
2) Reliability Block Diagram (RBD)
3) Fault Tree Analysis (FTA)
4) Failure State Diagram (FSD)

Systems with complex logic for success (or failure) cannot be modeled as a simple RBD or FSD. These may be m-out-of-n logic or hardware
configurations that change as a function of time. Assessing reliability of these systems requires modeling events as dependencies . Changing configurations are common in nuclear cooling systems in which the number of heat removal systems decreases as the reactor heat decreases, thus, redundancy increases. 

The failure state diagram to the right shows a heat removal system for a nuclear reactor. It has one dependent event, Offsite Power [See "Treatment of Dependencies in Reliability Analyses"].

Modeling reliability that includes repairs of subsystems is more complicated. If failure rates and
restoration rates are constant, then Markov models can be employed. With Markov models you can estimate percentage of time spent in any of the event states; estimate the probability of failure when repairs are possible; and estimate the expected number of minutes (or seconds) of down-time you might expect. However, Markov models can easily become too complex. To the right is a model for all combinations of three pieces of hardware. While this is very easy to evaluate, just imagine the complexity that results when there are 10 or 20 states.

It is possible to reduce the number of states by consolidation or "merging" as long as you follow certain rules carefully. Also, implied in any Markov model is a repair process that includes the possible number of simultaneous repairs and the specific groups of hardware that are repaired in a single repair action. This is discussed in "Implied Service Strategies in Availability Assessments".

When failure rates or restoration rates are not constant (failures not exponentially distributed), then Monte Carlo Simulation is necessary. Monte Carlo Simulations can assess the probability of a set of events occurring simultaneously regardless of sequence, or model a sequence of events occurring in a specific order.

A good example is the distribution of time to reconstruct data on a hard disk drive (HDD) in a redundant array of independent disks (RAID). The capacity of the HDD and the data transfer rate sets a minimum time to restore. It takes a finite amount of time to fill a 1 terabyte (TB) HDD with data. The RAID architecture allows the rate of reconstruction to vary based on the other demands of reading and writing, but the software can set a limit for the maximum time allowed. In an exponential distribution the rate is constant and the probability of occurrence in a time interval, t1, is the same regardless of whether the beginning of t1 is in 10 seconds or in 10 years. The figure below is a pictorial representation of the sequential Monte Carlo Simulation developed for analyzing RAID reliability with restoration; see "Enhanced Reliability Modeling of RAID Storage Systems".

Including restoration in a system reliability or availability analysis is frequently misunderstood and performed incorrectly. If you want to design a system with the minimum level of redundancy to meet a specific reliability target, the architecture and the model must be as accurate as possible. You can't afford to let poor modeling drive you to a poor design.