Publications‎ > ‎Patterns‎ > ‎

PLoP95_telecom


Fault-Tolerant Telecommunication System Patterns

by
Michael Adams, James Coplien, Robert Gamoke, Robert Hanmer, Fred Keeve, Keith Nicodemus
AT&T Bell Laboratories

Copyright ©1995 AT&T. All rights reserved.

Introduction
These patterns form part of a much larger pattern catalogue in use at AT&T. The patterns presented here form a small partial pattern language within the larger collection of patterns. We chose them because of their interconnectedness, the diversity of their authorship, and because they are probably well-known to the telecommunications programming community. Many of these patterns work in other domains, but for now, we take telecommunications designers as our audience.

Several of the unique characteristics of telecommunications software are its reliability and human factors. Many switching systems, including the ones referred to in these patterns are designed to be in continuous operation with the requirement that they be out of service no more than two hours in forty years. This requirement in many cases limit the design choices that can be made.

The systems must also be designed so that the human maintenance personnel requirements are optimized. This can lead to automatic systems as well as providing remote computer systems to monitor and control the switching equipment.

Many thanks to Gerard Meszaros of BNR, who served as the PLoP/95 shepherd for these patterns.


Glossary

1A: A central processor for telecommunications systems.

1B: A second-generation central processor based on the 1A architecture.

4ESS™, 5ESS©: Members of the AT&T Electronic Switching System product lines.

Application: The portions of the systems software that relate to its call processing functionality.

Call Store: The memory stores used to for static or dynamic data.

CC: Central control, the central processor complex, a 1A or a 1B.

FIT: Failures in a Trillion, a measurement of the failure rate of hardware components (one component failure in 109 hours).

OOS: Out-of-Service.

PC: Processor Configuration, the initialization and recovery mechanisms independent of the application that deal with the common underlying hardware/software platform. The term is also used as a verb, a synonym for a level of system reboot.

Phase: A level of system recovery escalation.

Program Store: The memory stores used for program text.

Stored Program Control: A term used to differentiate between central control based switching and the older relay and crossbar based systems.

Transient Fault: A condition that is transitory in nature. It appears and disappears. Lightning might produce transient errors.


Pattern: Minimize Human Intervention

Problem: History has shown that people cause the majority of problems in continuously running systems (wrong actions, wrong systems, wrong button).

Context: High-reliability continuous-running digital systems, where downtime, human-induced or otherwise, must be minimized.

Forces: Humans are truly intelligent; machines aren't. Humans are better at detecting patterns of system behavior, especially among seemingly random occurrences separated by time. (People Know Best)

Machines are good at orchestrating a well thought-out, global strategy, and humans aren't.

Humans are fallible; computers are often less fallible.

Humans feel a need to intervene if they can't see that the system is making serious attempts at restoration. Human reaction and decision times are very slow (by orders of magnitude) compared to computer processors.

A quiet system is a dead system.

Human operators get bored with ongoing surveillance and may ignore or miss critical events.

Events, normal processing or failures, are happening so quickly that inclusion of the human operator is infeasible.

Solution: Let the machine try to do everything itself, deferring to the human only as an act of desperation and last resort.

Resulting Context: A system less susceptible to human error. This will make the systems customers happier. In many administrations, the system operator's compensation is based on system availability, so this strategy actually improves the lot of the operator.

Application of this pattern leads to a system where patterns such as Riding Over Transients, SICO First and Always and Try All Hardware Combos apply to provide the system with the ability to proceed automatically.

Rationale: Empirically, a disproportionate fraction of high-availability system failures are operator errors, not primary system errors. By minimizing human intervention, the overall system availability can be improved. Human intervention can be reduced by building in strategies that counter human tendencies to act rashly; see patterns like Fool Me Once, Leaky Bucket Counters and Five Minutes of No Escalation Messages.

Notice the tension between this pattern and People Know Best.

Author: Robert Hanmer, Mike Adams, 1995/03/23


Pattern: People Know Best

Problem: How do you balance automation with human authority and responsibility?

Context: High-reliability continuous-running systems, where the system itself tries to recover from all error conditions.

Forces: People have a good subjective sense of the passage of time, and how it relates to the probability of a serious failure, or how it will be perceived by the customer.

The system is set up to recover from failure cases. (Minimize Human Intervention)

People feel a need to intervene.

Most system errors can be traced to human error.

Solution: Assume that people know best, particularly the maintenance folks. Design the system to allow knowledgeable users to override the automatic controls.

Example: As you escalate through the 64 states of Processor Configuration (Try All Hardware Combos), a human who understands what's going on can intervene and stop it.

Resulting Context: People feel empowered; however, they also are responsible for their actions.

This is an absolute rule: people feel a need to intervene. There is no perfect solution for this problem, and the pattern cannot resolve all the forces well. Fool Me Once is a partial solution, in that it doesn't give the human a chance to intervene.

Rationale: There is no try; there is only do or fail-Yoda, in Star Wars.

Consider the input command to unconditionally restore a unit. What does "unconditional" mean? Let's say that the system thinks that the unit is powered down; what should happen when the operator asks for the unit to be restored unconditionally? Answer: try to restore it anyhow, no excuses allowed; the fault detection hardware can always detect the powered-down condition and generate an interrupt for the unit out of service. Why might the operator want to do this? Because it may be a problem not with the power, but with the sensor that wrongly reports the power is off.

Notice the tension between this pattern and Minimize Human Intervention.

Author: Robert Gamoke, 1995/03/24


Pattern: Five Minutes of No Escalation Messages

Problem: Rolling in console messages: the human-machine interface is saturated with error reports that may be rolling off the screen, or consuming resources just for the intense displaying activity.

Context: Any continuous-running, fault-tolerant system with escalation, where transient conditions may be present.

Forces: There is no sense in wasting time or reducing level of service trying to solve a problem that will go away by itself.

Many problems work themselves out, given time.

You don't want the switch using all of its resources displaying messages.

You don't want to panic the user by making them think the switch is out of control (Minimize Human Intervention).

The only user action related to the escalation messages may be inappropriate to the goal of preserving system sanity.

There are other computer systems monitoring the actions taken. These systems can deal with a great volume of messages.

Solution: When taking the first action down the scenario that could lead to an excess number of messages, display a message. Then periodically display an update message. If the abnormal condition ends, display a message that everything is back to normal. Do not display a message for every change in state.

Continue continuous machine to machine communication of status and actions throughout this period.

For example when the 4ESS switch enters the first level of system overload, post a user message. Post no more messages for 5 minutes, even if there is additional escalation. At the end of 5 minutes, display a status message indicating the current status. When the condition clears, display an appropriate message.

Resulting Context: The system operator won't panic from seeing too many messages. Machine to machine messages and measurements will keep a record for later evaluation as well as keeping the systems actions visible to people who can deal with it. For the 4ESS overload example, measurement counters continue to track overload dynamics; some downstream support systems track these counters.

Other messages, not related to the escalating situation that is producing too many messages will be displayed as though the system were normal. Thus the normal functioning of the system is not adversely affected by the volume of escalation messages.

Note the conflict with People Know Best.

Rationale: Don't freak the user, because the only resort for an on-site user to 4ESS overload is "Cancel Overload Controls", which tells the system to ignore its overload controls and act as though there is no overload.

This is a special case of Aggressive versus Tentative.

Author: Robert Hanmer, Mike Adams, 1995/03/23


Pattern: Riding Over Transients

Alias: Make sure problem really exists

Problem: How do you know whether a problem will work itself out or not?

Context: A fault-tolerant application where some errors, overload conditions, etc. may be transient. The system can escalate through recovery strategies, taking more drastic action at each step. A typical example is a fault tolerant telecommunication system using static traffic engineering, where you want to check for overload or transient faults.

Forces: You want to catch faults and problems.

There is no sense in wasting time or reducing level of service trying to solve a problem that will go away by itself.

Many problems work themselves out, given time.

Solution: Don't react immediately to detected conditions. Make sure the condition really exists by checking several times, or use Leaky Bucket Counters to detect a critical number of occurrences in a specific time interval. For example: by averaging over time or just by waiting a while, give transient faults a chance to pass.

Resulting Context: Errors can be resolved with truly minimal effort, because the effort is expended only if the problem really exists. It allows the system to roll through problems without its users noticing, or without bothering the machine operator to intervene (as in the pattern Minimize Human Interaction).

Rationale: This pattern detects "temporally dense" events. Think of the events as spikes on a time line. If a small number of spikes (specified by a threshold) occur together (where "together" is specified by the interval), then the error is a transient. Used by Leaky Bucket Counters, Five Minutes of No Escalation Messages, and many others.

Author: James O. Coplien


Pattern: Leaky bucket counters

Problem: How do you deal with transient faults?

Context: Fault-tolerant system software that must deal with failure events. Failures are tied to episode counts and frequencies.

One example from 1A/1B processor systems in AT&T telecommunication products: As memory words (dynamic RAM) got weak, the memory module would generate a parity error trap. Examples include both 1A processor dynamic RAM and 1B processor static RAM.

Forces: You want a hardware module to exhibit hard failures before taking drastic action. Some failures come from the environment, and should not be blamed on the device.

Solution: A failure group has a counter that is initialized to a predetermined value when the group is initialized. The counter is decremented for each fault or event (usually faults) and incremented on a periodic basis; however, the count is never incremented beyond its initial value. There are different initial values and different leak rates for different subsystems: for example, it is a half-hour for the 1A memory (store) subsystem The strategy for 1A dynamic RAM specifies that the first failure in a store (within the timing window) causes the store to be taken out of service, diagnosed, and then automatically restored to service. On the second, third, and fourth failure (within the window) you just leave it in service. For the fifth episode within the timing window, take the unit out of service, diagnose it and leave it out.

If the episode transcends the interval, it's not transient: the leak rate is faster than the refill rate, and the pattern indicates an error condition. If the burst is more intense than expected (it exceeds the error threshold) then it's unusual behavior not associated with a transient burst, and the pattern indicates an error condition.

Resulting Context: A system where errors are isolated and handled (by taking devices out of service), but where transient errors (e.g., room humidity) don't cause unnecessary out of service action.

Rationale: The history is instructive: In old call stores (1A memories that contained dynamic data), why did we collect data? For old call stores, the field replaceable unit (FRU) was a circuit pack, while the failure group was a store comprising 12 or 13 packs. We needed to determine which pack is bad. Memory may be spread across 7 circuit packs; the transient bit was only one bit, not enough to isolate the failure. By recording data from four events, we were better able to pinpoint (90% accuracy) which pack was bad, so the machine operator didn't have to change 7 packs.

Why go five failures before taking a unit out of service? By collecting failure data on the second, third, and fourth time, you are making sure you know the characteristics of the error, and are reducing the uncertainty about the FRU. By the fifth time, you know it's sick, and need to take it out of service.

Periodically increasing the count on the store creates a sliding time window. The resource is considered sane when the counter (re-)attains its initialized value. Humidity, heat, and other environmental problems cause transient errors which should be treated differently (i.e., pulling the card does no good).

See, for example, Fool Me Once, which uses simple leaky bucket counters.

This is a special case of the pattern Riding Over Transients.

Strategy alluded to in p. 2003-4 OF BSTJ XLIII 5(10), Sept. 1964.

Author: Robert Gamoke, 1995/03/14


Pattern: SICO First and Always

Problem: Making a system highly available and resilient in the face of hardware and software faults and transient errors.

Context: Systems where the ability to do meaningful work is of utmost importance, but rare periods of partial application functionality can be tolerated. For example, the 1A/1B processor-based 4ESS switch from AT&T.

Forces: Bootstrapping is initialization.

A high-availability system might require (re)initialization at any time to ensure system sanity.

The System Integrity Control Program (SICO) coordinates system integrity.

System integrity must be in control during bootstrap.

The focus of operational control changes from bootstrap to the executive control during normal call processing.

Application functioning is very important.

System integrity takes processor time but that is acceptable in this context.

The system is composed of proprietary elements, for which design criteria may be required of all developers.

Hardware designed for fault tolerance which reduces the occurrence of hardware errors.

Solution: Give system integrity the ability and power to re-initialize the system whenever system sanity is threatened by error conditions. The same system integrity should oversee both the initialization process and the normal application functionality so that initialization can be restarted if it runs into errors.

Resulting Context: In short, System Integrity Control has a major role during bootstrapping, after which it hands control over to the executive scheduler, which in turn lets System Integrity Control regain control for short periods of time on a periodic basis.

See also Audit-Derivable Constants After Recovery.

Rationale: During a recovery event (phase or bootstrap), SICO calls processor initialization software first, peripheral initialization software second, then application initialization software, and finally transfers to executive control. Unlike a classic computer program where initialization takes place first, and "normal execution" second, the SICO architecture does not place software initialization as the highest level function. System integrity is at an even higher level than system initialization.

The architecture is based on a base level cycle in the executive control. After bootstrapping, the first item in the base cycle is SICO (though this is different code than that run during bootstrapping). So after the SICO part of bootstrapping is done, the base level part of SICO is entered each base level cycle to monitor the system on a periodic basis.

System integrity must be alert to watch for failures during both bootstrap and normal base level operation. There is a system integrity monitor in the base level to watch timers as well as overload control and audit control (not to run audits, but to ask audits if there are error conditions). These are checking in with SICO to report software and hardware failures and potentially request initialization.

During bootstrap and initialization, system integrity employs a number of similar mechanisms to monitor the system. For example, Analog Timers, Boot Timers, Try All Hardware Combos and others.

Much of the rationale comes from AUTOVON, Safeguard, missile guidance systems, and other high-reliability real-time projects from early AT&T stored program control experience. See the Bell System Technical Journal Vol. 56 No. 7, Sept. 1977, pp. 1145-7, 1163-7.

Author: Robert Hanmer


Pattern: Try All Hardware Combos

Problem: The Central Controller (CC) has several configurations. There are many possible paths through CC subsystems depending on the configuration. How do you select a workable configuration in light of a faulty subsystem?

Context: Highly fault-tolerant computing complexes such as the 1B processor.

The processing complex has a number of duplicated subsystems. Each one consists of a CC, a set of call stores, a call store bus, a set of program stores, a program store bus, and an interface bus. Major subsystems are duplicated with standby units to increase reliability not to provide distributed processing capabilities. There are 64 possible configurations of these subsystems, given fully duplicated sparing. Each configuration is said to represent a configuration state.

The system is brought up in stages. First, you need memory units working; then you need to talk to the disk, so you can pump stuff into memory, which allows you to run programs to pump the rest of the stores, so that code can recover other units. Second, after the base system is configured and refreshed from disk, you can bring up the application.

Forces: You want to catch and remedy single, isolated errors.

You also want to catch errors that aren't easily detected in isolation, but which result from interaction between modules.

You sometimes must catch multiple concurrent errors.

The CC can't sequence subsystems through configurations, since it may be faulty itself.

The machine should recover by itself without human intervention (Minimize Human Intervention).

Solution: Maintain a 64-counter in hardware. We call this the configuration counter. There is a table that maps from that counter onto a configuration state; in the 1A, it's in the hardware; in the 1B, it's in the Boot ROM. Every time the system fails to get through a Processor Configuration (PC) to a predetermined level of stability, it restarts the system with a successive value of the configuration counter.

In 5ESS®, there is a similar 16-counter. It first tries all side zero (a complete failure group), then all side one (the other failure group), hoping to find a single failure. The subsequent counting states look for more insidious problems that come from interactions between members of these coarse failure groups.

Resulting Context: The system can deal with any number of concurrent faults provided that there is at most one fault per subsystem.

The state will increment when a reboot (PC) fails.

Sometimes the fault won't be detected right away after the reboot sequence (i.e. not until more than 30 seconds after the resumption of normal activities). This problem is addressed in Fool Me Once.

Sometimes, going through all 64 states isn't enough; see Don't Trust Anyone and Analog Timer.

Rationale: This design is based on the FIT rates of the original hardware, and on the extreme caution of first-generation stored program control switching system developers.

Note that the pattern Blind Search apparently violates this pattern, because it uses a store to hold the identity of the out-of-service module; this is addressed with the pattern Multiple Copies of Base Store.

TABLE 1. Configurations Established by Emergency Action Switching (Status of Units after a Switch Performed by the Indicated State). From p. 2006 of BSTJ XLIII 5(10), Sept. 1964.


PC      CC0     CC1     PS0    PS1     Bus0    Bus1    Other   

State                                                  Stores  

X000    U       U       U      U       U       U       U       

X001    C       C       U      U       U       U       U       

X010    U       U       U      U       C       C       U       

X011    U       U       A      S       A       S       T       

X100    U       U       A      S       S       A       T       

X101    U       U       S      A       S       A       T       

X110    U       U       S      A       A       S       T       

X111    U       U       S      A       A       S       T       



X: Don't care

A: Active

S: Standby

U: Unchanged

C: Complemented

T: Marked as having trouble

Author: Robert Gamoke, 1995/03/24; 5ESS information, Fred Keeve, 1995/04/14.

See BSTJ XLIII No. 5, Part 1, p. 2005-2009.


Pattern: Fool Me Once

Problem: Sometimes the fault causing a Processor Configuration (PC) is very intermittent (usually triggered by software, such as diagnostics). After a recovery in PC completes, users expect the configuration state display to disappear from the system human control interface, and the system to be sane. If the configuration display state continues to be displayed for more than 30 seconds, users become concerned that the system may still have a problem. But if the system in fact trips on another fault, it may reboot itself (take a phase) and re-initiate the initialization sequence using the same configuration as it did the previous time (or, worse, at the beginning of the configuration sequence) which raises the probability that the system will loop in reboots ("roll in recovery") and never attempt different configurations.

Context: Highly available systems using redundancy, employing the pattern Try All Hardware Combos.

You're going through Try All Hardware Combos. The system finds an ostensibly sane state and progresses 30 seconds into initialization, beyond boot and into the application. The application claims to know that the hardware is sane if it can get 30 seconds into initialization (using a leaky bucket counter). When the system reaches this state, it resets the configuration counter. However, a latent error can cause a system fault after the configuration counter has been reset. The system no longer "knows" that it is in PC escalation, and retries the same configuration that has already been proven to not work.

Forces: It's hard to set a universally correct interval for a leaky bucket counter; sometimes, 30 seconds is too short. The application (and customer) would be upset if the leaky bucket counter were set too long (for example, for half an hour; the customer doesn't want to wait a half hour for a highly reliable system to clear its fault status). Some errors take a long time to appear, even though they're fundamental hardware errors (e.g., an error in program store that isn't accessed until very late in the initialization cycle, or until a hardware fault is triggered by a diagnostic run out of the scheduler). People expectations are among the most important forces at work here. In spite of the potential latencies for some classes of fault, the application and user feel assured that the system must be sane if it's been exercised for 30 seconds.

Solution: The first time the application tells PC that "all is well", believe it, and reset the configuration counter. The second and subsequent times within a longer time window, ignore the request.

The first request to reset the configuration counter indicates that the application's 30 second leaky bucket counter says that everything is fine. Set up a half-hour leaky bucket counter to avoid being fooled. If the application tries to reset the 64-state configuration counter twice in a half hour, ignore it. This would indicate recurring failures that result in reboots.

Resulting Context: Any subsequent latent failures will cause the configuration counter to advance, guaranteeing that the next PC will use a fresh configuration. For a single subsystem error that's taking the system down, this strategy will eventually reach a workable configuration. Once the system is up, schedule diagnostics to isolate the faulty unit. See also People Know Best. The system will be able to handle repetitive failures outside the shorter window thereby reinforcing Minimize Human Intervention.

Rationale: See the forces. It's better to escalate to exceptionally extravagant strategies like this, no matter how late, if it eventually brings the system back on line. The pattern has been found to be empirically sound.

Author: Robert Gamoke, 1995/03/24

Comments