Accurately Predicting System Success and Failure

Accurately Predicting System Success and Failure

Success and failure of a system are often the result of a complex relationship between parts (or nodes) of the system that are all operating at reasonable fractions of full efficiency. Naturally, the goal is 100% success for any system. How can the outcome be predicted in advance? And if the outcome can be predicted, how can the nodes in the system be adjusted or modified to increase the probability of success?

Some of the worst failures are system failures; that is, they originate from the interaction of subsystem deficiencies which of themselves do not produce an end system failure, but together can and do. Four catastrophic civil space system failures were of this variety: Apollo 1, Apollo 13, Challenger, and the Hubble Space Telescope. Tom Clancy fans might remember that such a failure almost caused World War III in his work Debt of Honor. In each of these cases, had any one of the deficiencies not existed, the nearly catastrophic end result could not have occurred. In other words, though each deficiency was necessary, none were sufficient for end failure. Failure review boards should operate under the rule that until a diagnosis is made that indicates that the set of presumed causes are both necessary and sufficient - and that no other such set exists - the discovery-and-fix process is incomplete and high quality performance of the system is not assured.

It should be noted that fault tolerance through redundancy has proved to be less desirable than proponents had hoped. Because fault tolerant designs "hide" single faults by working around them, faults accumulate until an overall system failure occurs. Diagnostics then become very much more difficult, and symptoms become intertwined. Certification of complete repair cannot be guaranteed because successful-but-partial operation again hides undetected (tolerated) faults. The problem is most evident in servicing modern, microprocessor-rich automobile controls. System engineers know that fault avoidance is preferable to fault tolerance in system design.

Current research on mathematical models is revealing that internet connectivity, power distribution, pharmaceutical production processes, clinical trial operations, and even successful INDs and NDAs share common network graph properties.

Network Analysis Model

Fig. 1a.

Fig. 1b.

Fig. 1c.

Figure 1a shows two hypothetical power distribution networks and the connections between them, based on geographical locations. The figure exemplifies a situation in which an initial contamination and subsequent failure of only one distribution center may lead to an iterative cascade of failures that causes both networks to become fragmented, leading to systemic failure. (Periodically, the US power grid has suffered major failures in population centers like New York and along the eastern corridor for such reasons.) Obviously, pharmaceutical production processes can easily be modeled as a similar network of nodes, as can multisite, global clinical trials.

Furthermore, an IND can also be modeled as a network of nodes. Some nodes are scientific and some nodes are regulatory, but each has defined functions and performance standards. Success results from a properly connected network without systemic failure. Figure 1b summarizes the sections of an IND that can also be modeled as interconnected network nodes. Of course, each of these sections have a number of subsections that add granularity to the network of nodes. For an isolated single network, a significant number of the network nodes must be randomly removed before the network breaks down. However, when taking into account the dependencies between the networks, removal of only a small fraction of nodes can result in the complete fragmentation of the entire system.

To model interdependent networks, we consider for the sake of simplicity, and without loss of generality, two networks, A and B, with the same number of nodes, N. For example, A might be the preclinical data nodes in an IND, while B might be clinical nodes. The functioning of node Ai (i=1, 2, …, N), in network A, depends on the ability of node Bi , in network B, to supply a critical operation, and vice versa. If node Ai stops functioning owing to some degree of failure, node Bi stops functioning. Similarly, if node Bi stops functioning then node A i stops functioning. We denote such dependence by a bidirectional link, Ai ↔Bi, that defines a one-to-one correspondence between nodes of network A and nodes of network B. Within network A, the nodes are randomly connected by A-links with degree distribution PA (k), where the degree, k, of each node is defined as the number of A-links connected to that node in network A. Analogously, within network B, the nodes are randomly connected by B-links with degree distribution PB(k) .

In simple simulation runs, one begins by randomly removing a fraction, 1-p, of the nodes of network A and removing all the A-links connected to these removed nodes. Because of the dependence between the networks, all the nodes in network B that are connected to the removed A-nodes by A↔B links must also be removed. Any B-links connected to the removed B-nodes are then also removed. As nodes and links are sequentially removed, each network begins to fragment into connected components, which are called node clusters. The clusters in network A and the clusters in network B are different because each network is connected differently. A set of nodes, ɑ, in network A and the corresponding set of nodes, b, in network b form a mutually connected set if (1) each pair of nodes in ɑ is connected by a path that consists of nodes belonging to ɑ and links of network A, and (2) each pair of nodes in b is connected by a path that consists of nodes belonging to b and links of network B. A mutually connected set is called a mutually connected cluster if it cannot be enlarged by adding other nodes and still satisfy the conditions above. Only mutually connected clusters are potentially functional.

To identify these mutually connected clusters, the ɑ1-clusters are defined as the clusters of network A remaining after a fraction 1-p of the A-nodes are removed as the result of a malfunction. This state of the networks is the first stage in the cascade of failures. Next we define the b1-sets as the sets of B-nodes that are connected to ɑ1-clusters by A↔B links. According to the definition of mutually connected clusters, all the B-links connecting different b1-sets must be removed. Because the two networks are connected differently, each b1 -set may split into several clusters, which we define as b2-clusters. The b1 -sets that do not split, and hence coincide with ɑ1-clusters, are mutually connected. This state of the networks is the second stage in the cascade of failures. In the third stage, we determine all the ɑ3-clusters, in a similar way, and in the fourth stage we determine all the b4-clusters. We continue this process until no further splitting and link removal can occur. This process leads to a percolation phase transition for the two interdependent networks at a critical threshold, ppc, which is significantly larger than the equivalent threshold for a single network. As in classical network theory, the giant mutually connected component is defined to be the mutually connected cluster spanning the entire network. Below pc there is no giant mutually connected component, whereas above pc a giant mutually connected cluster exists.

Our insight based on percolation theory is that when the network is fragmented, the nodes belonging to the giant component connecting a finite fraction of the network are still functional, whereas the nodes that are part of the remaining small clusters become nonfunctional. Therefore, for interdependent networks only the giant mutually connected cluster is of interest. The probability that two neighboring A-nodes are connected by A↔B links to two neighboring B-nodes scales as 1/N. Hence, at the end of the cascade process of failures, above pc only very small mutually connected clusters and one giant mutually connected cluster exist, in contrast to traditional percolation, wherein the cluster size distribution obeys a power law. When the giant component exists, the interdependent networks preserve their functionality; if it does not exist, the networks split into small fragments that cannot function on their own.

The model is first applied to the case of two Erdős–Rényi networks with average degrees ‹kA› and ‹kB›. A random fraction, 1-p, of the nodes in network A is removed and the iterative process of forming a ɑ1-, b2-, ɑ3-, …, b2k- and ɑ 2k+1-clusters is followed as described above. After each stage, n, in the cascade of failures, we determine the fraction of nodes, µn, in the largest ɑn - or bn -cluster. At the end of the process, µn converges to µ , the probability that a randomly chosen node belongs to the mutually connected largest cluster. As N→∞, the probability µ converges to a well defined function, µ∞(p), which has a single step discontinuity at the threshold ppc, where it changes from zero for p<pc to µ(pc)> 0 for ppc . This behavior is characteristic of a first-order phase transition, quite different from a second-order phase transition such as that characterizing percolation of a single network, where µ(p) is a continuous function at ppc. For a finite N and p close to pc, the giant mutually connected component exists in a particular network realization with probability P(p,N). As N→∞, P(p,N) converges to a Heaviside step function, Q (p-pc), which discontinuously changes value from zero for p<pc to one for p>pc. Simulation results for the value of pc agree with the analytical results presented below.

For two interdependent scale-free networks with power-law degree distributions, PA(k)=PB(k-l, the existence criteria for the giant component are quite different from those in a single network. For a single scale-free network with l£3, a giant component exists for every non-zero value of p. However, for interdependent scale-free networks, the giant component does not exist below the critical value pc¹0, even for 2<l£3.

In the case of a single network, pc is smaller for a broader degree distribution. In sharp contrast, for interdependent networks a broader degree distribution results in a larger value of pc because high-degree nodes of one network can depend on low-degree nodes of the other. The hubs (defined as nodes of exceptionally large degree) that have a dominant role in the robustness of a single network become vulnerable when a cascade of failures occurs in two interdependent networks. Moreover, a broader distribution with the same average degree implies that there are more low-degree nodes. Because the low-degree nodes are more easily disconnected, the advantage of a broad distribution for a single network becomes a disadvantage for interdependent networks. This behavior is demonstrated by comparing simulation results for several scale-free networks with different l values, an Erdős–Rényi network and a random regular network, all with an average degree of ‹k›=4. The simulation results are in full agreement with analytical results and show that pc is indeed higher for a broader distribution.

Next, the model of interconnected networks is solved using the mathematical technique of generating functions. The generating functions are defined for network A; similar equations describe network B. Generating functions of the degree distributions, GA0(Z)=åkPA(k)Zk, are introduced. Analogously, generating functions are also introduced of the underlying branching processes, GA1(Z)=G’A0(Z)/G’A0(1).

Random removal of fraction, 1-p, of nodes will change the degree distribution of the remaining nodes, so the generating function of the new distribution is equal to the generating function of the original distribution with argument 1-p(1-Z). Denote the subsets of nodes remaining after the random removal of 1-p nodes as A0 Ì A

and B0 Ì B, and note that there is one-to-one correspondence between nodes in A0 and nodes in B0, established by A↔B links. As the total number of nodes in network A is N, the number of nodes in A0 and B0 is N0=pN. The fraction of nodes that belong to the giant component of network A0 is gA(p)=1- GA0[1-p(1-fA)] where fA is a function of p that satisfies the transcendental equation fA =GA1[1-p(1-fA). Analogous equations exist for network B.

Using the generating function approach, the fraction, µn, of the nodes in the giant component after stage n in the cascade of failures obeys a simple recursion relation. There is good agreement between simulations and theory.S

To determine the final size of the giant mutually connected component, recall that the fraction of nodes in the giant mutually connected component, µ, is the limit of the sequence µn as n→∞. This limit must satisfy the equations µ 2m+1= µ2m= µ2m-1 because the cluster is not further fragmented. This condition leads to the following system of two unknowns, x and y, where µ=xgB(x)=ygA(y):

The system of equations (1) has one trivial solution, x=0and y=0, for any p value, corresponding to the giant mutually connected component being of zero size. If p is large enough, there also exists a solution such that the giant mutually connected component is of non-zero size. Exclude y from these equations and obtain a single equation

X=gA[gB (x)p]p

which can be solved graphically as the intersection of the straight line y=x and the curve y=gA[gB(x)p]p. When p is small enough, the curve increases very slowly and does not intersect the straight line (except at the origin, which intersection corresponds to the trivial solution). A nontrivial solution first emerges in the critical case (p=pc), in which the line touches the curve at a single point, x=xc, where they have equal derivatives. So the condition is

which, together with equation (2), yields the solution for pc and the critical size of the giant mutually connected component, µ(pc)=xcgB(xc).

In the case of two Erdős–Rényi networks, the problem can be solved explicitly. Then, GA1(x)=GA0=exp[‹kA›(x-1)], GB1= GB0=exp[‹kB›(x-1)] and the system of transcendental equations (2) and (3) for the critical value of p5pc can be expressed in terms of elementary functions. In the simple case with ‹kA›=‹kB›=‹k›, the critical parameters can be expressed in terms of the nontrivial root f=fA=fB=0.28467 of the equation

f=exp[(f-1)/2f]. So pc=[2‹k›f (1-f)]-1=2.4554/‹k› and that µ(pc)=(1-f)/(2‹k›f)=1.2564/‹k›. The simulations of Erdős–Rényi networks agree with the theory.

The known result for a single scale-free network, namely that pc→0 as N→∞‘ for l£3, is not valid for two scale-free interdependent networks, where instead pc is finite for any l>2. Analysis of the behavior of the generating functions as Z→1 shows that as x→0 the right-hand side of equation (2) can be approximated by a power law, Cxh, where C is constant and

h=1/(3-lA)(3-lB)

For 2<l< 3 and 2< lB< 3, h> 1. Thus, the curve y=gA[gB(x)p]p always passes below y=x as x→0 and for sufficiently small values of p there is not a non-trivial solution, which means that the giant mutually connected component is absent. Hence, there is a percolation phase transition at some finite p=pc>0.

The model presented here captures the important phenomenon of a cascade of failures in interdependent networks that results in the first-order percolation phase transition. The model can be generalized to the case of three or more interdependent networks (e.g., preclinical, clinical, and regulatory sections of an IND), to the case in which the A↔B links connecting the networks are unidirectional rather than bidirectional (e.g., preclinical experiments are not refined or executed on the basis of clinical results), and to the case in which a node from network A can depend on more than one node from network B (e.g., a clinical node depends upon input from a regulatory node as well as a preclinical node). All these generalizations can be treated analytically by using generating functions, provided the networks are randomly connected and uncorrelated. In real life, however, generating functions cannot be used because the networks are never randomly connected and uncorrelated. In such cases, explicit simulation of every step in the evolution of the network operation is required.