Fault-Tolerant Telecommunication System Patterns
by
Michael Adams, James Coplien, Robert Gamoke, Robert Hanmer, Fred
Keeve, Keith Nicodemus
AT&T Bell Laboratories
Copyright ©1995 AT&T. All rights reserved.
Introduction
These patterns form part of a much larger pattern
catalogue in use at AT&T. The patterns presented here form
a small partial pattern language within the larger collection
of patterns. We chose them because of their interconnectedness,
the diversity of their authorship, and because they are probably
well-known to the telecommunications programming community. Many
of these patterns work in other domains, but for now, we take
telecommunications designers as our audience.
Several of the unique characteristics of telecommunications software
are its reliability and human factors. Many switching systems,
including the ones referred to in these patterns are designed
to be in continuous operation with the requirement that they be
out of service no more than two hours in forty years. This requirement
in many cases limit the design choices that can be made.
The systems must also be designed so that the human maintenance
personnel requirements are optimized. This can lead to automatic
systems as well as providing remote computer systems to monitor
and control the switching equipment.
Many thanks to Gerard Meszaros of BNR, who served as the PLoP/95
shepherd for these patterns.
Glossary
1A: A central processor for telecommunications systems.
1B: A second-generation central processor based on the
1A architecture.
4ESS™, 5ESS©: Members of the AT&T
Electronic Switching System product lines.
Application: The portions of the systems software that
relate to its call processing functionality.
Call Store: The memory stores used to for static or dynamic
data.
CC: Central control, the central processor complex, a
1A or a 1B.
FIT: Failures in a Trillion, a measurement of the failure
rate of hardware components (one component failure in 109 hours).
OOS: Out-of-Service.
PC: Processor Configuration, the initialization and recovery
mechanisms independent of the application that deal with the common
underlying hardware/software platform. The term is also used as
a verb, a synonym for a level of system reboot.
Phase: A level of system recovery escalation.
Program Store: The memory stores used for program text.
Stored Program Control: A term used to differentiate between
central control based switching and the older relay and crossbar
based systems.
Transient Fault: A condition that is transitory in nature.
It appears and disappears. Lightning might produce transient
errors.
Pattern: Minimize Human Intervention
Problem: History has shown that people cause the majority
of problems in continuously running systems (wrong actions, wrong
systems, wrong button).
Context: High-reliability continuous-running digital systems,
where downtime, human-induced or otherwise, must be minimized.
Forces: Humans are truly intelligent; machines aren't.
Humans are better at detecting patterns of system behavior, especially
among seemingly random occurrences separated by time. (People
Know Best)
Machines are good at orchestrating a well thought-out, global
strategy, and humans aren't.
Humans are fallible; computers are often less fallible.
Humans feel a need to intervene if they can't see that the system
is making serious attempts at restoration. Human reaction and
decision times are very slow (by orders of magnitude) compared
to computer processors.
A quiet system is a dead system.
Human operators get bored with ongoing surveillance and may ignore
or miss critical events.
Events, normal processing or failures, are happening so quickly
that inclusion of the human operator is infeasible.
Solution: Let the machine try to do everything itself,
deferring to the human only as an act of desperation and last
resort.
Resulting Context: A system less susceptible to human error.
This will make the systems customers happier. In many administrations,
the system operator's compensation is based on system availability,
so this strategy actually improves the lot of the operator.
Application of this pattern leads to a system where patterns such
as Riding Over Transients, SICO First and Always
and Try All Hardware Combos apply to provide the system
with the ability to proceed automatically.
Rationale: Empirically, a disproportionate fraction of
high-availability system failures are operator errors, not primary
system errors. By minimizing human intervention, the overall system
availability can be improved. Human intervention can be reduced
by building in strategies that counter human tendencies to act
rashly; see patterns like Fool Me Once, Leaky Bucket
Counters and Five Minutes of No Escalation Messages.
Notice the tension between this pattern and People Know Best.
Author: Robert Hanmer, Mike Adams, 1995/03/23
Pattern: People Know Best
Problem: How do you balance automation with human authority
and responsibility?
Context: High-reliability continuous-running systems, where
the system itself tries to recover from all error conditions.
Forces: People have a good subjective sense of the passage
of time, and how it relates to the probability of a serious failure,
or how it will be perceived by the customer.
The system is set up to recover from failure cases. (Minimize
Human Intervention)
People feel a need to intervene.
Most system errors can be traced to human error.
Solution: Assume that people know best, particularly the
maintenance folks. Design the system to allow knowledgeable
users to override the automatic controls.
Example: As you escalate through the 64 states of Processor
Configuration (Try All Hardware Combos), a human who understands
what's going on can intervene and stop it.
Resulting Context: People feel empowered; however, they
also are responsible for their actions.
This is an absolute rule: people feel a need to intervene. There
is no perfect solution for this problem, and the pattern cannot
resolve all the forces well. Fool Me Once is a partial
solution, in that it doesn't give the human a chance to intervene.
Rationale: There is no try; there is only do or fail-Yoda,
in Star Wars.
Consider the input command to unconditionally restore a unit.
What does "unconditional" mean? Let's say that the system
thinks that the unit is powered down; what should happen when
the operator asks for the unit to be restored unconditionally?
Answer: try to restore it anyhow, no excuses allowed; the fault
detection hardware can always detect the powered-down condition
and generate an interrupt for the unit out of service. Why might
the operator want to do this? Because it may be a problem not
with the power, but with the sensor that wrongly reports the power
is off.
Notice the tension between this pattern and Minimize Human
Intervention.
Author: Robert Gamoke, 1995/03/24
Pattern: Five Minutes of No Escalation Messages
Problem: Rolling in console messages: the human-machine
interface is saturated with error reports that may be rolling
off the screen, or consuming resources just for the intense displaying
activity.
Context: Any continuous-running, fault-tolerant system
with escalation, where transient conditions may be present.
Forces: There is no sense in wasting time or reducing level
of service trying to solve a problem that will go away by itself.
Many problems work themselves out, given time.
You don't want the switch using all of its resources displaying
messages.
You don't want to panic the user by making them think the switch
is out of control (Minimize Human Intervention).
The only user action related to the escalation messages may be
inappropriate to the goal of preserving system sanity.
There are other computer systems monitoring the actions taken.
These systems can deal with a great volume of messages.
Solution: When taking the first action down the scenario
that could lead to an excess number of messages, display a message.
Then periodically display an update message. If the abnormal
condition ends, display a message that everything is back to normal.
Do not display a message for every change in state.
Continue continuous machine to machine communication of status
and actions throughout this period.
For example when the 4ESS switch enters the first level of system
overload, post a user message. Post no more messages for 5 minutes,
even if there is additional escalation. At the end of 5 minutes,
display a status message indicating the current status. When
the condition clears, display an appropriate message.
Resulting Context: The system operator won't panic from
seeing too many messages. Machine to machine messages and measurements
will keep a record for later evaluation as well as keeping the
systems actions visible to people who can deal with it. For the
4ESS overload example, measurement counters continue to track
overload dynamics; some downstream support systems track these
counters.
Other messages, not related to the escalating situation that is
producing too many messages will be displayed as though the system
were normal. Thus the normal functioning of the system is not
adversely affected by the volume of escalation messages.
Note the conflict with People Know Best.
Rationale: Don't freak the user, because the only resort
for an on-site user to 4ESS overload is "Cancel Overload
Controls", which tells the system to ignore its overload
controls and act as though there is no overload.
This is a special case of Aggressive versus Tentative.
Author: Robert Hanmer, Mike Adams, 1995/03/23
Pattern: Riding Over Transients
Alias: Make sure problem really exists
Problem: How do you know whether a problem will work itself
out or not?
Context: A fault-tolerant application where some errors,
overload conditions, etc. may be transient. The system can escalate
through recovery strategies, taking more drastic action at each
step. A typical example is a fault tolerant telecommunication
system using static traffic engineering, where you want to check
for overload or transient faults.
Forces: You want to catch faults and problems.
There is no sense in wasting time or reducing level of service
trying to solve a problem that will go away by itself.
Many problems work themselves out, given time.
Solution: Don't react immediately to detected conditions.
Make sure the condition really exists by checking several times,
or use Leaky Bucket Counters to detect a critical number
of occurrences in a specific time interval. For example: by averaging
over time or just by waiting a while, give transient faults a
chance to pass.
Resulting Context: Errors can be resolved with truly minimal
effort, because the effort is expended only if the problem really
exists. It allows the system to roll through problems without
its users noticing, or without bothering the machine operator
to intervene (as in the pattern Minimize Human Interaction).
Rationale: This pattern detects "temporally dense"
events. Think of the events as spikes on a time line. If a small
number of spikes (specified by a threshold) occur together (where
"together" is specified by the interval), then the error
is a transient. Used by Leaky Bucket Counters, Five
Minutes of No Escalation Messages, and many others.
Author: James O. Coplien
Pattern: Leaky bucket counters
Problem: How do you deal with transient faults?
Context: Fault-tolerant system software that must deal
with failure events. Failures are tied to episode counts and frequencies.
One example from 1A/1B processor systems in AT&T telecommunication
products: As memory words (dynamic RAM) got weak, the memory module
would generate a parity error trap. Examples include both 1A processor
dynamic RAM and 1B processor static RAM.
Forces: You want a hardware module to exhibit hard failures
before taking drastic action. Some failures come from the environment,
and should not be blamed on the device.
Solution: A failure group has a counter that is initialized
to a predetermined value when the group is initialized. The counter
is decremented for each fault or event (usually faults) and incremented
on a periodic basis; however, the count is never incremented beyond
its initial value. There are different initial values and different
leak rates for different subsystems: for example, it is a half-hour
for the 1A memory (store) subsystem The strategy for 1A dynamic
RAM specifies that the first failure in a store (within the timing
window) causes the store to be taken out of service, diagnosed,
and then automatically restored to service. On the second, third,
and fourth failure (within the window) you just leave it in service.
For the fifth episode within the timing window, take the unit
out of service, diagnose it and leave it out.
If the episode transcends the interval, it's not transient: the
leak rate is faster than the refill rate, and the pattern indicates
an error condition. If the burst is more intense than expected
(it exceeds the error threshold) then it's unusual behavior not
associated with a transient burst, and the pattern indicates an
error condition.
Resulting Context: A system where errors are isolated and
handled (by taking devices out of service), but where transient
errors (e.g., room humidity) don't cause unnecessary out of service
action.
Rationale: The history is instructive: In old call stores
(1A memories that contained dynamic data), why did we collect
data? For old call stores, the field replaceable unit (FRU) was
a circuit pack, while the failure group was a store comprising
12 or 13 packs. We needed to determine which pack is bad. Memory
may be spread across 7 circuit packs; the transient bit was only
one bit, not enough to isolate the failure. By recording data
from four events, we were better able to pinpoint (90% accuracy)
which pack was bad, so the machine operator didn't have to change
7 packs.
Why go five failures before taking a unit out of service? By collecting
failure data on the second, third, and fourth time, you are making
sure you know the characteristics of the error, and are reducing
the uncertainty about the FRU. By the fifth time, you know it's
sick, and need to take it out of service.
Periodically increasing the count on the store creates a sliding
time window. The resource is considered sane when the counter
(re-)attains its initialized value. Humidity, heat, and other
environmental problems cause transient errors which should be
treated differently (i.e., pulling the card does no good).
See, for example, Fool Me Once, which uses simple leaky
bucket counters.
This is a special case of the pattern Riding Over Transients.
Strategy alluded to in p. 2003-4 OF BSTJ XLIII 5(10), Sept. 1964.
Author: Robert Gamoke, 1995/03/14
Pattern: SICO First and Always
Problem: Making a system highly available and resilient
in the face of hardware and software faults and transient errors.
Context: Systems where the ability to do meaningful work
is of utmost importance, but rare periods of partial application
functionality can be tolerated. For example, the 1A/1B processor-based
4ESS switch from AT&T.
Forces: Bootstrapping is initialization.
A high-availability system might require (re)initialization at
any time to ensure system sanity.
The System Integrity Control Program (SICO) coordinates system
integrity.
System integrity must be in control during bootstrap.
The focus of operational control changes from bootstrap to the
executive control during normal call processing.
Application functioning is very important.
System integrity takes processor time but that is acceptable in
this context.
The system is composed of proprietary elements, for which design
criteria may be required of all developers.
Hardware designed for fault tolerance which reduces the occurrence
of hardware errors.
Solution: Give system integrity the ability and power
to re-initialize the system whenever system sanity is threatened
by error conditions. The same system integrity should oversee
both the initialization process and the normal application functionality
so that initialization can be restarted if it runs into errors.
Resulting Context: In short, System Integrity Control has
a major role during bootstrapping, after which it hands control
over to the executive scheduler, which in turn lets System Integrity
Control regain control for short periods of time on a periodic
basis.
See also Audit-Derivable Constants After Recovery.
Rationale: During a recovery event (phase or bootstrap),
SICO calls processor initialization software first, peripheral
initialization software second, then application initialization
software, and finally transfers to executive control. Unlike a
classic computer program where initialization takes place first,
and "normal execution" second, the SICO architecture
does not place software initialization as the highest level function.
System integrity is at an even higher level than system initialization.
The architecture is based on a base level cycle in the executive
control. After bootstrapping, the first item in the base cycle
is SICO (though this is different code than that run during bootstrapping).
So after the SICO part of bootstrapping is done, the base level
part of SICO is entered each base level cycle to monitor the
system on a periodic basis.
System integrity must be alert to watch for failures during both
bootstrap and normal base level operation. There is a system
integrity monitor in the base level to watch timers as well as
overload control and audit control (not to run audits, but to
ask audits if there are error conditions). These are checking
in with SICO to report software and hardware failures and potentially
request initialization.
During bootstrap and initialization, system integrity employs
a number of similar mechanisms to monitor the system. For example,
Analog Timers, Boot Timers, Try All Hardware
Combos and others.
Much of the rationale comes from AUTOVON, Safeguard, missile guidance
systems, and other high-reliability real-time projects from early
AT&T stored program control experience. See the Bell System
Technical Journal Vol. 56 No. 7, Sept. 1977, pp. 1145-7, 1163-7.
Author: Robert Hanmer
Pattern: Try All Hardware Combos
Problem: The Central Controller (CC) has several configurations.
There are many possible paths through CC subsystems depending
on the configuration. How do you select a workable configuration
in light of a faulty subsystem?
Context: Highly fault-tolerant computing complexes such
as the 1B processor.
The processing complex has a number of duplicated subsystems.
Each one consists of a CC, a set of call stores, a call store
bus, a set of program stores, a program store bus, and an interface
bus. Major subsystems are duplicated with standby units to increase
reliability not to provide distributed processing capabilities.
There are 64 possible configurations of these subsystems, given
fully duplicated sparing. Each configuration is said to represent
a configuration state.
The system is brought up in stages. First, you need memory units
working; then you need to talk to the disk, so you can pump stuff
into memory, which allows you to run programs to pump the rest
of the stores, so that code can recover other units. Second,
after the base system is configured and refreshed from disk, you
can bring up the application.
Forces: You want to catch and remedy single, isolated
errors.
You also want to catch errors that aren't easily detected in isolation,
but which result from interaction between modules.
You sometimes must catch multiple concurrent errors.
The CC can't sequence subsystems through configurations, since
it may be faulty itself.
The machine should recover by itself without human intervention
(Minimize Human Intervention).
Solution: Maintain a 64-counter in hardware. We call this
the configuration counter. There is a table that maps from that
counter onto a configuration state; in the 1A, it's in the hardware;
in the 1B, it's in the Boot ROM. Every time the system fails to
get through a Processor Configuration (PC) to a predetermined
level of stability, it restarts the system with a successive value
of the configuration counter.
In 5ESS®, there is a similar 16-counter. It first tries all
side zero (a complete failure group), then all side one (the other
failure group), hoping to find a single failure. The subsequent
counting states look for more insidious problems that come from
interactions between members of these coarse failure groups.
Resulting Context: The system can deal with any number
of concurrent faults provided that there is at most one fault
per subsystem.
The state will increment when a reboot (PC) fails.
Sometimes the fault won't be detected right away after the reboot
sequence (i.e. not until more than 30 seconds after the resumption
of normal activities). This problem is addressed in Fool Me
Once.
Sometimes, going through all 64 states isn't enough; see Don't
Trust Anyone and Analog Timer.
Rationale: This design is based on the FIT rates of the
original hardware, and on the extreme caution of first-generation
stored program control switching system developers.
Note that the pattern Blind Search apparently violates
this pattern, because it uses a store to hold the identity of
the out-of-service module; this is addressed with the pattern
Multiple Copies of Base Store.
TABLE 1. Configurations Established by Emergency Action Switching
(Status of Units after a Switch Performed by the Indicated State).
From p. 2006 of BSTJ XLIII 5(10), Sept. 1964.
PC CC0 CC1 PS0 PS1 Bus0 Bus1 Other
State Stores
X000 U U U U U U U
X001 C C U U U U U
X010 U U U U C C U
X011 U U A S A S T
X100 U U A S S A T
X101 U U S A S A T
X110 U U S A A S T
X111 U U S A A S T
X: Don't care
A: Active
S: Standby
U: Unchanged
C: Complemented
T: Marked as having trouble
Author: Robert Gamoke, 1995/03/24; 5ESS information, Fred
Keeve, 1995/04/14.
See BSTJ XLIII No. 5, Part 1, p. 2005-2009.
Pattern: Fool Me Once
Problem: Sometimes the fault causing a Processor Configuration
(PC) is very intermittent (usually triggered by software, such
as diagnostics). After a recovery in PC completes, users expect
the configuration state display to disappear from the system human
control interface, and the system to be sane. If the configuration
display state continues to be displayed for more than 30 seconds,
users become concerned that the system may still have a problem.
But if the system in fact trips on another fault, it may reboot
itself (take a phase) and re-initiate the initialization sequence
using the same configuration as it did the previous time (or,
worse, at the beginning of the configuration sequence) which raises
the probability that the system will loop in reboots ("roll
in recovery") and never attempt different configurations.
Context: Highly available systems using redundancy, employing
the pattern Try All Hardware Combos.
You're going through Try All Hardware Combos. The system
finds an ostensibly sane state and progresses 30 seconds into
initialization, beyond boot and into the application. The application
claims to know that the hardware is sane if it can get 30 seconds
into initialization (using a leaky bucket counter). When the system
reaches this state, it resets the configuration counter. However,
a latent error can cause a system fault after the configuration
counter has been reset. The system no longer "knows"
that it is in PC escalation, and retries the same configuration
that has already been proven to not work.
Forces: It's hard to set a universally correct interval
for a leaky bucket counter; sometimes, 30 seconds is too
short. The application (and customer) would be upset if the leaky
bucket counter were set too long (for example, for half an hour;
the customer doesn't want to wait a half hour for a highly reliable
system to clear its fault status). Some errors take a long time
to appear, even though they're fundamental hardware errors (e.g.,
an error in program store that isn't accessed until very late
in the initialization cycle, or until a hardware fault is triggered
by a diagnostic run out of the scheduler). People expectations
are among the most important forces at work here. In spite of
the potential latencies for some classes of fault, the application
and user feel assured that the system must be sane if it's been
exercised for 30 seconds.
Solution: The first time the application tells PC that
"all is well", believe it, and reset the configuration
counter. The second and subsequent times within a longer time
window, ignore the request.
The first request to reset the configuration counter indicates
that the application's 30 second leaky bucket counter says
that everything is fine. Set up a half-hour leaky bucket counter
to avoid being fooled. If the application tries to reset the 64-state
configuration counter twice in a half hour, ignore it. This
would indicate recurring failures that result in reboots.
Resulting Context: Any subsequent latent failures will
cause the configuration counter to advance, guaranteeing that
the next PC will use a fresh configuration. For a single subsystem
error that's taking the system down, this strategy will eventually
reach a workable configuration. Once the system is up, schedule
diagnostics to isolate the faulty unit. See also People Know
Best. The system will be able to handle repetitive failures
outside the shorter window thereby reinforcing Minimize Human
Intervention.
Rationale: See the forces. It's better to escalate to exceptionally
extravagant strategies like this, no matter how late, if it eventually
brings the system back on line. The pattern has been found to
be empirically sound.
Author: Robert Gamoke, 1995/03/24
|