Accurately Predicting System Success and Failure

Accurately Predicting System Success and Failure

Success and failure of a system are often the result of a complex relationship between parts (or nodes) of the system that are all operating at reasonable fractions of full efficiency. Naturally, the goal is 100% success for any system. How can the outcome be predicted in advance? And if the outcome can be predicted, how can the nodes in the system be adjusted or modified to increase the probability of success?

Some of the worst failures are system failures; that is, they originate from the interaction of subsystem deficiencies which of themselves do not produce an end system failure, but together can and do. Four catastrophic civil space system failures were of this variety: Apollo 1, Apollo 13, Challenger, and the Hubble Space Telescope. Tom Clancy fans might remember that such a failure almost caused World War III in his work Debt of Honor. In each of these cases, had any one of the deficiencies not existed, the nearly catastrophic end result could not have occurred. In other words, though each deficiency was necessary, none were sufficient for end failure. Failure review boards should operate under the rule that until a diagnosis is made that indicates that the set of presumed causes are both necessary and sufficient - and that no other such set exists - the discovery-and-fix process is incomplete and high quality performance of the system is not assured.

It should be noted that fault tolerance through redundancy has proved to be less desirable than proponents had hoped. Because fault tolerant designs "hide" single faults by working around them, faults accumulate until an overall system failure occurs. Diagnostics then become very much more difficult, and symptoms become intertwined. Certification of complete repair cannot be guaranteed because successful-but-partial operation again hides undetected (tolerated) faults. The problem is most evident in servicing modern, microprocessor-rich automobile controls. System engineers know that fault avoidance is preferable to fault tolerance in system design.

Current research on mathematical models is revealing that internet connectivity, power distribution, pharmaceutical production processes, clinical trial operations, and even successful INDs and NDAs share common network graph properties.

Network Analysis Model

Fig. 1a.

Fig. 1b.

Fig. 1c.

Figure 1a shows two hypothetical power distribution networks and the connections between them, based on geographical locations. The figure exemplifies a situation in which an initial contamination and subsequent failure of only one distribution center may lead to an iterative cascade of failures that causes both networks to become fragmented, leading to systemic failure. (Periodically, the US power grid has suffered major failures in population centers like New York and along the eastern corridor for such reasons.) Obviously, pharmaceutical production processes can easily be modeled as a similar network of nodes, as can multisite, global clinical trials.

Furthermore, an IND can also be modeled as a network of nodes. Some nodes are scientific and some nodes are regulatory, but each has defined functions and performance standards. Success results from a properly connected network without systemic failure. Figure 1b summarizes the sections of an IND that can also be modeled as interconnected network nodes. Of course, each of these sections have a number of subsections that add granularity to the network of nodes. For an isolated single network, a significant number of the network nodes must be randomly removed before the network breaks down. However, when taking into account the dependencies between the networks, removal of only a small fraction of nodes can result in the complete fragmentation of the entire system.

To model interdependent networks, we consider for the sake of simplicity, and without loss of generality, two networks, A and B, with the same number of nodes, N. For example, A might be the preclinical data nodes in an IND, while B might be clinical nodes. The functioning of node A_i (i=1, 2, …, N), in network A, depends on the ability of node B_i , in network B, to supply a critical operation, and vice versa. If node A_i stops functioning owing to some degree of failure, node Bi stops functioning. Similarly, if node B_i stops functioning then node A_i stops functioning. We denote such dependence by a bidirectional link, A_i ↔B_i, that defines a one-to-one correspondence between nodes of network A and nodes of network B. Within network A, the nodes are randomly connected by A-links with degree distribution P_A (k), where the degree, k, of each node is defined as the number of A-links connected to that node in network A. Analogously, within network B, the nodes are randomly connected by B-links with degree distribution P_B(k) .

In simple simulation runs, one begins by randomly removing a fraction, 1-p, of the nodes of network A and removing all the A-links connected to these removed nodes. Because of the dependence between the networks, all the nodes in network B that are connected to the removed A-nodes by A↔B links must also be removed. Any B-links connected to the removed B-nodes are then also removed. As nodes and links are sequentially removed, each network begins to fragment into connected components, which are called node clusters. The clusters in network A and the clusters in network B are different because each network is connected differently. A set of nodes, ɑ, in network A and the corresponding set of nodes, b, in network b form a mutually connected set if (1) each pair of nodes in ɑ is connected by a path that consists of nodes belonging to ɑ and links of network A, and (2) each pair of nodes in b is connected by a path that consists of nodes belonging to b and links of network B. A mutually connected set is called a mutually connected cluster if it cannot be enlarged by adding other nodes and still satisfy the conditions above. Only mutually connected clusters are potentially functional.

To identify these mutually connected clusters, the ɑ₁-clusters are defined as the clusters of network A remaining after a fraction 1-p of the A-nodes are removed as the result of a malfunction. This state of the networks is the first stage in the cascade of failures. Next we define the b₁-sets as the sets of B-nodes that are connected to ɑ₁-clusters by A↔B links. According to the definition of mutually connected clusters, all the B-links connecting different b_1-sets must be removed. Because the two networks are connected differently, each b₁ -set may split into several clusters, which we define as b₂-clusters. The b₁ -sets that do not split, and hence coincide with ɑ₁-clusters, are mutually connected. This state of the networks is the second stage in the cascade of failures. In the third stage, we determine all the ɑ₃-clusters, in a similar way, and in the fourth stage we determine all the b₄-clusters. We continue this process until no further splitting and link removal can occur. This process leads to a percolation phase transition for the two interdependent networks at a critical threshold, p꞊p_c, which is significantly larger than the equivalent threshold for a single network. As in classical network theory, the giant mutually connected component is defined to be the mutually connected cluster spanning the entire network. Below p_c there is no giant mutually connected component, whereas above p_c a giant mutually connected cluster exists.

Our insight based on percolation theory is that when the network is fragmented, the nodes belonging to the giant component connecting a finite fraction of the network are still functional, whereas the nodes that are part of the remaining small clusters become nonfunctional. Therefore, for interdependent networks only the giant mutually connected cluster is of interest. The probability that two neighboring A-nodes are connected by A↔B links to two neighboring B-nodes scales as 1/N. Hence, at the end of the cascade process of failures, above p_c only very small mutually connected clusters and one giant mutually connected cluster exist, in contrast to traditional percolation, wherein the cluster size distribution obeys a power law. When the giant component exists, the interdependent networks preserve their functionality; if it does not exist, the networks split into small fragments that cannot function on their own.

The model is first applied to the case of two Erdős–Rényi networks with average degrees ‹k_A› and ‹k_B›. A random fraction, 1-p, of the nodes in network A is removed and the iterative process of forming a ɑ₁-, b₂-, ɑ₃-, …, b₂_k- and ɑ_2k+1-clusters is followed as described above. After each stage, n, in the cascade of failures, we determine the fraction of nodes, µ_n, in the largest ɑ_n - or b_n -cluster. At the end of the process, µ_n converges to µ_∞ , the probability that a randomly chosen node belongs to the mutually connected largest cluster. As N→∞, the probability µ_∞ converges to a well defined function, µ_∞(_p), which has a single step discontinuity at the threshold p꞊p_c, where it changes from zero for p<p_c to µ_∞(p_c)> 0 for p꞊p_c . This behavior is characteristic of a first-order phase transition, quite different from a second-order phase transition such as that characterizing percolation of a single network, where µ_∞(p) is a continuous function at p꞊p_c. For a finite N and p close to p_c, the giant mutually connected component exists in a particular network realization with probability P_∞(p,N). As N→∞, P_∞(p,N) converges to a Heaviside step function, Q (p-p_c), which discontinuously changes value from zero for p<p_c to one for p>p_c. Simulation results for the value of p_c agree with the analytical results presented below.

For two interdependent scale-free networks with power-law degree distributions, P_A(k)=P_B(k)µ^-l, the existence criteria for the giant component are quite different from those in a single network. For a single scale-free network with l£3, a giant component exists for every non-zero value of p. However, for interdependent scale-free networks, the giant component does not exist below the critical value p_c¹0, even for 2<l£3.

In the case of a single network, p_c is smaller for a broader degree distribution. In sharp contrast, for interdependent networks a broader degree distribution results in a larger value of p_c because high-degree nodes of one network can depend on low-degree nodes of the other. The hubs (defined as nodes of exceptionally large degree) that have a dominant role in the robustness of a single network become vulnerable when a cascade of failures occurs in two interdependent networks. Moreover, a broader distribution with the same average degree implies that there are more low-degree nodes. Because the low-degree nodes are more easily disconnected, the advantage of a broad distribution for a single network becomes a disadvantage for interdependent networks. This behavior is demonstrated by comparing simulation results for several scale-free networks with different l values, an Erdős–Rényi network and a random regular network, all with an average degree of ‹k›=4. The simulation results are in full agreement with analytical results and show that p_c is indeed higher for a broader distribution.

Next, the model of interconnected networks is solved using the mathematical technique of generating functions. The generating functions are defined for network A; similar equations describe network B. Generating functions of the degree distributions, G_A0(Z)=å_kP_A(k)Z^k, are introduced. Analogously, generating functions are also introduced of the underlying branching processes, G_A1(Z)=G’_A0(Z)/G’_A0(1).

Random removal of fraction, 1-p, of nodes will change the degree distribution of the remaining nodes, so the generating function of the new distribution is equal to the generating function of the original distribution with argument 1-p(1-Z). Denote the subsets of nodes remaining after the random removal of 1-p nodes as A₀Ì A

and B₀ Ì B, and note that there is one-to-one correspondence between nodes in A₀ and nodes in B₀, established by A↔B links. As the total number of nodes in network A is N, the number of nodes in A₀ and B₀ is N₀=pN. The fraction of nodes that belong to the giant component of network A₀ is g_A(p)=1- G_A0[1-p(1-f_A)] where f_A is a function of p that satisfies the transcendental equation f_A =G_A1[1-p(1-f_A). Analogous equations exist for network B.

Using the generating function approach, the fraction, µ_n, of the nodes in the giant component after stage n in the cascade of failures obeys a simple recursion relation. There is good agreement between simulations and theory.S

To determine the final size of the giant mutually connected component, recall that the fraction of nodes in the giant mutually connected component, µ_∞, is the limit of the sequence µ_n as n→∞. This limit must satisfy the equations µ_2m+1= µ₂_m= µ_2m-1 because the cluster is not further fragmented. This condition leads to the following system of two unknowns, x and y, where µ_∞=xg_B(x)=yg_A(y):

The system of equations (1) has one trivial solution, x=0and y=0, for any p value, corresponding to the giant mutually connected component being of zero size. If p is large enough, there also exists a solution such that the giant mutually connected component is of non-zero size. Exclude y from these equations and obtain a single equation

X=g_A[g_B (x)p]p

which can be solved graphically as the intersection of the straight line y=x and the curve y=g_A[g_B(x)p]p. When p is small enough, the curve increases very slowly and does not intersect the straight line (except at the origin, which intersection corresponds to the trivial solution). A nontrivial solution first emerges in the critical case (p=p_c), in which the line touches the curve at a single point, x=x_c, where they have equal derivatives. So the condition is

which, together with equation (2), yields the solution for p_c and the critical size of the giant mutually connected component, µ_∞(p_c)=x_cg_B(x_c).

In the case of two Erdős–Rényi networks, the problem can be solved explicitly. Then, G_A1(x)=G_A0=exp[‹k_A›(x-1)], G_B1= G_B0=exp[‹k_B›(x-1)] and the system of transcendental equations (2) and (3) for the critical value of p5pc can be expressed in terms of elementary functions. In the simple case with ‹k_A›=‹k_B›=‹k›, the critical parameters can be expressed in terms of the nontrivial root f=f_A=f_B=0.28467 of the equation

f=exp[(f-1)/2f]. So p_c=[2‹k›f (1-f)]^-1=2.4554/‹k› and that µ_∞(p_c)=(1-f)/(2‹k›f)=1.2564/‹k›. The simulations of Erdős–Rényi networks agree with the theory.

The known result for a single scale-free network, namely that p_c→0 as N→∞‘ for l£3, is not valid for two scale-free interdependent networks, where instead p_c is finite for any l>2. Analysis of the behavior of the generating functions as Z→1 shows that as x→0 the right-hand side of equation (2) can be approximated by a power law, Cx^h, where C is constant and

h=1/(3-l_A)(3-l_B)

For 2<l< 3 and 2< lB< 3, h> 1. Thus, the curve y=g_A[g_B(x)p]p always passes below y=x as x→0 and for sufficiently small values of p there is not a non-trivial solution, which means that the giant mutually connected component is absent. Hence, there is a percolation phase transition at some finite p=p_c>0.

The model presented here captures the important phenomenon of a cascade of failures in interdependent networks that results in the first-order percolation phase transition. The model can be generalized to the case of three or more interdependent networks (e.g., preclinical, clinical, and regulatory sections of an IND), to the case in which the A↔B links connecting the networks are unidirectional rather than bidirectional (e.g., preclinical experiments are not refined or executed on the basis of clinical results), and to the case in which a node from network A can depend on more than one node from network B (e.g., a clinical node depends upon input from a regulatory node as well as a preclinical node). All these generalizations can be treated analytically by using generating functions, provided the networks are randomly connected and uncorrelated. In real life, however, generating functions cannot be used because the networks are never randomly connected and uncorrelated. In such cases, explicit simulation of every step in the evolution of the network operation is required.

Page updated

Report abuse