Would You Like Constant Failure Rate?

Reliability Management of Failure Rates:

Convert Failure Rates between Operating Hours and Calendar Hours

Abstract

This screed contains two parts:

In the former part, a random operating rate per calendar hour converts a constant failure rate into an inconstant operating-time failure rate function of age. In the latter part, the objective is to smooth the calendar-time failure rate to achieve an approximately constant failure rate with down-time following operating failures; i.e., convert any operating-time failure rate function into a constant calendar-time failure rate.  

Convert a constant failure rate in calendar hours to operating hours

Someone asked, "if you can give me quick explanation: For Example, EPRD 2014 part, Category : IC, Subcategory : Digital, Subtype1 : JK, Failure Rate (FPMH) = 0.083632 per (million) calendar hour! How do you convert to operational hour?" I.e., life T has exponential distribution in calendar (million) hours.

Good question. I did an example assuming the rate R = operating hours/calendar hour was distributed Uniform[0, 0.5]. The failure rate function of T/R (T = calendar-hour time to failure) is

(23.914 + (–23.914–0.99999x)Exp(0.041816x))/(x(23.914 – 23.914/ Exp(0.041816x))), where x = operating-hour value of T/R.

TOO BAD, SO SAD! If the operating rate R is random, then converting a constant calendar-time failure rate results in an operating-hour failure rate that is NOT CONSTANT! For the example, the operating time failure rate function of T/R is almost linearly decreasing from 0.0218 (at age 0) to 0.018 (at age ~20 million operating hours). (figure 1)

Figure 1. Failure rates in operating and calendar time scales; x-axis is operating time

Inconstant failure rate (FR) function means the simple series system MTBF prediction, 1/(1/FR1 + 1/FR2 + ... + 1/FRk), for k independent components, is wrong! It is wronger if system is not in series. It is even more wronger if not all parts have same distributions of operating hours/calendar hours. It is wrongest if parts' operating hours/calendar hour are differently distributed, statistically dependent, or all of the above.

Alternatives: 

I wonder how EPRD came up with Failure Rate (FPMH) = 0.083632 per million calendar hours for an IC?

How to get a constant failure rate in calendar time?

The final test of truth is ridicule. Very few dogmas have ever faced it and survived. H. L. Mencken

Motivations for constant calendar-time failure rate

Could there be a place in reliability management for constant failure rate equal to 1/MTBF? Many people believe reliability is MTBF, act accordingly, argue about twitches in Annualized Failure Rate (AFR), adjust forecasts to resemble actual demands, and suffer the consequences of errors due to assuming constant failure rates. Other people make untenable, mathematically convenient assumptions about operating and calendar time reliability and suffer the consequences of errors due to untested distribution assumptions [Vergenz]. Reliability guidance documents approximate inconstant system failure rate functions with constants, [AOpen, Lee and Ingegneri, GammaTronic, NASA,], and the author the NASA document was offended when I pointed this out. He replied, “We do not feel that adding complexity to this process is justified at this time.” The rest of us endure and try to manage the consequences of random failures, with warranty reserves, excess and sometimes-obsolescent spares inventories, recalls, “velocity management,” service tiger teams, and ??? [Bowles, McLinn, ReliaSoft, Schenkelberg] Why not make life easier on others and ourselves by managing operating reliability so that there is a constant, calendar-time failure rate? 

Reliability-Centered Maintenance (RCM) asserts that the optimal maintenance policy for parts with constant failure rates is to leave them alone, replace them on failure [Kamins and McCall, Nowlan and Heap]. Their documents are vague about operating time or calendar time failure rates. Kamins and McCall said, “You need failure data—information on which parts failed and at what age.” If the operating-time failure rate is constant, then constant operating time per calendar hour will yield constant calendar-time failure rate too. RCM is casual about verifying both constant operating time failure rate and constant operating time per calendar hour. Even if operating-time failure rate is constant, wouldn’t it be nice if the calendar-time failure rate was constant too?

We live one hour per calendar hour, but some products and their components operate intermittently, less than one hour per calendar hour, randomly. Their failures could occur at inconvenient times for humans. Their inconstant calendar-time failure rates screw up operations, service, maintenance, and spares inventory by introducing unnecessary variability.  

Misoversimplified spare parts demand forecasts assume demand rates are constant, and the demand forecast is S(installed base)*(demand rate). The assumption of constant demand rates is imbedded in the Wilson square root lot size formula, EOQ reorder points, and Defense Logistics Agency model [Long and Engberson]. Even newsboy and (s, S) inventory models for random demands assume expected demands are constant over time. Such models incur unnecessary service and maintenance labor costs due to fluctuations in actual demands driven by varying usage, ages, and unreliability. Furthermore, spares inventory incurs holding and backorder costs due to excess inventory or stock-outs. These costs could be reduced or avoided when demands arise from unreliability of products or their service parts by reliability management to achieve constant demand rates.  

The first operations research problem I learned was how to smooth production and service labor costs: the costs of hiring, training, and worker layoffs induced by variable demands [Kramarz and Michaud]. One way to smooth production and service costs is “demand leveling.” In service, maintenance, and spares inventory contexts; demand leveling could be achieved by adjusting operating hours per calendar hour to yield approximately constant failure rates per calendar hour. 

Kelly AFB engine shop manager said, "Build that sucker (F100-PW100 engine] so it doesn't come back for 600 [operating] hours." Imagine you ran the US Air Force. Part of your job would be to produce the “flying-hour program” [Air Force Instructions 11-101 and 102, 1 November 2002 and 30 August 2011, http://static.e-publishing.af.mil/production/1/af_a3/publication/afi11-101/afi11-101.pdf and  http://static.e-publishing.af.mil/production/1/af_a3_5/publication/afi11-102/afi11-102.pdf]. The flying-hour program tells how many operating hours each MAJCOM and their aircraft and crews fly per month or calendar quarter. Sure, you have to produce a flying-hour program that meets tactical and training requirements, but you are also recognize the costs of jacking the flying-hour program around too much. Huge budgets for labor, fuel, parts, and facilities depend on the flying-hour program (E.g., $20,000/hour for a B-52 way back then). Except for expensive engines and their modules for which the US Air Force uses actuarial forecasts, the US Air Force forecasts service parts’ costs using,

(installed base)*(cost per failure)*(failure rate/operating hour)*(flying-hours).

Tyler Hess’ 2009 AFIT Thesis says, “Inaccurate [cost] estimates result in budget risks and undermine the ability of Air Force leadership to allocate resources efficiently.” “…the current forecasting method’s assumption of a proportional relationship between cost and flying hours is inappropriate and the relationship is actually inelastic.” [By inelastic, I believe he meant that cost is not proportional to flying-hours.]

Lewis Hogge’s 2012 AFIT thesis starts by saying, “The USAF generally does not know the reliability of its fielded repairable systems.” In fact, the USAF does know actuarial rates a(t), only for engines and major engine modules tracked by serial numbers for operating hours and cycles. And they use actuarial forecasts Sa(t)n(t) where n(t) is installed base of age t.

For years, the US Air Force sent me a heavy, blue book containing approved part failure rates. I finally wrote and asked them to stop sending it. I’d been trying to show the US Air Force how to estimate and use actuarial forecasts, Sa(t)*n(t) with age-specific failure rates a(t) for all service parts, not just expensive engines and modules tracked by serial number for hours and cycles. (n(t) is installed base of age t in forecast interval.) Spares demand forecasts and stock level recommendations could be more accurate and precise, using installed base and spares consumption data required by GAAP, without tracking parts by serial number, hours, and cycles.

Why not manage the random rate R = operating hours per calendar hour of products and parts, so that the calendar-hour failure rate R and consequent demand rates are constant in calendar time? That would achieve demand leveling and improve accuracy and precision of the misoversimplified demand forecast. Imagine how constant output rates would simplify coordination in a sequential work-station production line or in multi-echelon production, service and repair organization like the US Air Force Material Command. This is a serious suggestion; I’m not kidding.

Randomize operation rate R so that failure rate function of T/R is constant in calendar time? (T is time-to-failure in operating hours.)

The problem is to find a distribution function of random operating rate R so that the calendar-time failure rate function of T/R is constant for any distribution of operating-time-to-failure T. That’s not a problem if R is a constant and the operating-time failure rate of T is constant; then r = constant/(failure rate per operating hour) for whatever constant failure rate per calendar hour you want, as long as it’s less than the failure rate per operating hour.

If T is random, then there could be distributions of R and parameter values that could make the calendar-time failure rate equal to a constant, as long as that failure rate constant was less than the operating-time failure rate(s). With independent, uniform distributions of T and R, T ~Uniform[0,T] in operating hours and R ~ Uniform[0,b], then the uniform distribution upper limit b should be 2*c*T/(1+x*c), at age x! I.e., the operating proportion of time should be distributed uniformly on 0 < R < b(x). With independent, exponential distribution of T, T ~Exponential[l] in operating hours and R ~ Uniform[0,b], then the distribution parameter b should be (c*x-1-ProductLog[Exp[c*x-1]*(c*x-1)])/l*x, at age x! (ProductLog[] is the principal solution for w in w = w*Exp[w].) Again, the random, uniform distribution of operating hours per calendar hour depends on age x of unit being operated.

Manage down-times to yield constant failure rate in calendar time?

OK I was not entirely serious. It seems pretty dumb to operate a random proportion of time R, and it seems really stupid when the distribution of R depends on the age of the unit being operated. The variance around chosen constant calendar-time failure rate could be annoying and defeat the demand leveling objective of smoothing the calendar-time failure rate.

If the distribution of R must depend on the age of the unit being operated, why not let the observed failure rates determine subsequent down-times after failure operating times t(1), t(2),…? If calendar-time failure rate 1/t(1) is greater than a desired calendar-time constant failure rate c, then stop operation until t’(1) = t(1)/(c+d), where d is your choice or 1/c–t(1) > 0. Then run after t’(1) until the next failure at t’(1)+t(2), where t(2) is the operating hours of the second failure. If calendar time failure rate 2/(t’(1)+t(2)) > c, then stop operation until t’(2) = t’(1)+ t(2)+1/(c+d). Etc.

For deterministic failure times, Dt in operating hours, recursion yields failure rates FR(()…

FR(1) = 1/(Dt+1/(c+d)) and

FR(k) = k/t’(k) = k/(t’(k–1)+Dt+1/(c+d)), k = 2,3,…

What is the value of d? It’s d=c2Dt/(1–cDt), with a little help from algebra. This yields an increasing, calendar-time failure rate function that is asymptotically constant c. Are there any ways to accelerate convergence? Could other functions besides 1/(c+d) that work? Sure. I tried power function and exponential functions. However, they overshoot (figure 2).

Figure 2. Demand-leveled failure rates at failure times. Failure rate is 0.1 per operating hour. “Linear Smooth…” uses d=c2Dt/(1–cDt); “Power” uses dk; “2-Power” uses adb; and “Exponential” uses dexp[–d].

What if failure times Dt in operating hours are random variables? I ran exponential and Weibull simulations of 100 failures to find d to achieve calendar-time failure rate c and got approximately the same answer as the deterministic Dt. See figures 3 and 4 and table 1. Table 1 shows how close the optimum d is to 0.05 = c2Exp[Dt]/(1–cE[Dt]) for the distributions simulated.  

Figures 3 and 4. Demand leveling for Dt~Exponential[1/10] and Dt~Weibull[10.562333, 10] with MTBFs of 10 operating time units. X-axis is calendar time units

Table 1. Alternative time-between-failures distributions and demand leveling parameters. All have means = 10 operating-time units. Exponential and Weibull are simulations of 100 failures, so distribution parameters vary from simulations in table.

The near-equality of d-values for alternative distributions of operating-times-between-failures leads to the conjecture that the same reliability operation management with  d=c2E[T]/(1–cE[T]) stochastically achieves convergence to desired constant failure rate c in calendar time, at least for times-to-failures with relatively peaked probability density functions, for a renewal process in operating time. “Proof is left as an exercise.”

Let the record show…

I proposed maintenance of an alternating renewal process (http://www.math.uah.edu/stat/renewal/Alternating.html) so that iid up-down cycles were the same length, stochastically. In other words, make repairs depend on the previous time to failure, aka opportunistic maintenance. This was intended to smooth workflow and buffer requirements among successive Solyndra solar tube workstations. I found bivariate distributions so that workstation cycle times Z = X+Y have distributions F(z) = òP[Y<z|X=x]dF(y) where the integral is from 0 to z [Kakubava, George], and the joint distribution of workstations' X and Y were dependent in a particular way. For example, find the joint distribution function so that the failure rate function in calendar time is c = òf[z|X=x]dF(y)/(1–òP[Y<z|X=x]dF(y)), or find the E[X+Y] has some specified value given the marginal distributions of X and Y (copulas). I can’t find that exercise, but I could reproduce it if you want.

Stochastically optimizing down-times following random failures

It is tempting to try d(k) = E[c2Dt(k–1)/(1–cDt(k–1))] as a constant for every k, because using a random d(k) seems a nuisance. This isn’t a game theory contest between adversaries, although some learning about the distribution of Dt(k) should take place as observations are acquired. This is really a problem in stochastic optimization. Imagine each product or part as input to a G/G/1 service system with recycling and the service time depends on the previous input time. This seems to resemble packet-switching attempts to enforce maximum rate or to smooth output rate [Whitt, Cidon et al.]. 

I tried Excel Solver’s Evolutionary method to optimize a simulation, but it's not stochastic optimization. The problem is to find a functional form or value of d to achieve an asymptotic constant failure rate c, assuming the underlying operating-times between failures Dt(k) are stationary. I.e., solve

FR(1) = 1/(Dt(0)+1/(c+d)) and

FR(k) = k/t’(k) = k/(t’(k–1)+Dt(k-1)+1/(c+d)), k = 2,3,…

where Dt(k) is a recurrent stochastic process, so that FR(k) converges to a constant calendar-time failure rate c in some sense. I tried https://en.wikipedia.org/wiki/Cross-entropy_method for stochastic optimization, but the method didn’t converge in four steps.

Solving d = E[c2Dt/(1–cDt)] with random Dt can be done for some distributions. 

T (aka Dt) ~Uniform[a,b] => d  = (–a c+b c–Log[1–a c]+Log[1–b c])/(a–b) = – (b c+Log[1–b c])/b when a = 0

T~Exponential[l] => d  = –(1/2) e(–( l /c)) (2 c e( l /c)–2 l ExpIntegralEi[l/c]+ l Log[1/c]+2 l Log[-c]–l Log[c]), where ExpIntegralEi(z) = –òExp[-t[/t dt integrated from –z to infinity.

Perhaps it would be interesting to find d that minimizes (E[FR(k)]–c)2 asymptotically for random lives, FR(k) = k/(kDt+k/(c+d)) = 1/(Dt+1/(c+d)), and its expected value depends on the distribution of T (aka Dt). I.e., solve E[1/(Dt+1/(c+d))]=c for d. This yields,

T (aka Dt) ~Uniform[0,b] => d = (Exp[bc]–1–bc)/b

Mathematica gave the mean formula for exponential distributions but couldn’t solve the equation with c for d. Mathematica couldn’t find the mean for the Weibull or lognormal distributions. Numerical methods sometimes work.   

T (aka Dt) ~Exponential[1/10] (MTBF = 10) => d = 0.0275744 for c = 0.05.

Conclusions

Converting constant calendar-time failure rate, with random operating time per calendar-time unit, yields inconstant operating-time failure rate function. That blows up simple reliability and MTBF predictions that depend on constant failure rates.

It’s stupid to follow failures with random down-time to achieve constant failure rate in calendar time, because random down-times increase the variability of cycle time. Any operating-time failure rate function can be converted to an approximately constant calendar-time failure rate, using down-time following failures; that down-time depends on currently observed calendar-time failure rate and the desired, constant failure rate in calendar-time. If operating times between failures Dt are deterministic, then down-time is 1/(c+d), d=c2Dt/(1–cDt). If Dt are random, then d=E[c2Dt/(1–cDt)] may be OK; alternatively solving E[1/(Dt+1/(c+d))]=c for d is possible for some distributions including empirical.

Send data on field reliability and operating rates per unit time (please specify) to pstlarry077@gmail.com, and I will send back nonparametric field reliability and failure rate function estimates, operating rate distribution estimates, and convert between operation and calendar time scales. If TBFs and down-times alternate, I will check for failure-rate smoothness, estimate asymptotic constant failure rate and try to help you achieve it. If you include cost data on hiring and separations and costs of downtime, I would try to optimize the process.

References

AOpen, “Reliability Prediction of AOpen DX2G Plus,” AOpen Component Solutions, http://www.aopen.com/products/server/pdf/DX2G_MTBF.pdf

Bowles, John, “Caution, Constant Failure-Rate Models may be Hazardous to your Design,” IEEE Transactions on Reliability, Volume: 51, Issue: 3, Sept. 2002, http://ieeexplore.ieee.org/document/1028412/

Cidon, Israel, Roch Guerin, Asad Khamisy, and Moshe Sidi, “On Queues with Interarrival Times,” doi:10.1017/S0269964800004198, 1996 https://pdfs.semanticscholar.org/1348/eb530bacf163cf693b944371966a72090876.pdf

Gamatronic, “Comparing Series and Parallel Redundant UPS Systems,” Gamatronic Electronic Industries, Ltd., http://www.gamatronic.co.il/downloads/ser_par.pdf

George, L. L., "The Superposition of an Alternating Poisson Square Wave Process and a Poisson Shock Process," NATO Advanced Study Institute on Statistical Extremes, Probability Problems in Seismic Risk, Vimeiro, Portugal, Sept. 1983

Hess, Tyler, “Cost Forecasting Models for the Air Force Flying Hour Program,” AFIT/GCA/ENV/09-M07, Air Force Institute of Technology, March 2009, http://www.dtic.mil/dtic/tr/fulltext/u2/a499658.pdf

Hogge, Lewis J., “Effective Measurement of Reliability of Repairable Air Force Systems, AFIT/GSE/ENV/12-S02DL, Air Force Institute of Technology, September 2012, https://pdfs.semanticscholar.org/a005/9387c34417acc0a6223be7eaed2dbb3d9476.pdf 

Kakubava, Revaz, “Analysis of Alternating Renewal Processes with Depended [sic] Components,” R&RATA, # 1 (Vol.1) 2008, March, http://www.gnedenko‑forum.org/Journal/2008/012008/RATA_1_2008-08.pdf

Kamins, Milton, and J. J. McCall, Jr., “Rules for Planned Replacement of Aircraft and Missile Parts, RAND Research Memorandum RM-2810-PR, November 1961

Kramarz, Francis and Marie-Laure Michaud, “The Shape of Hiring and Separation Costs,” Discussion Paper 1170, June 2004, http://ftp.iza.org/dp1170.pdf

Lee, Lydia, and Antonino Ingegneri, “Swift X-Ray Telescope Reliability,” NASA GFC, http://www.swift.psu.edu/xrt/Documents/ICDR/27_ICDR_Reliability.pdf, May 2001

Long, William S. and Douglas H. Engberson, “The Effect of Violations of the Constant Demand Assumption on the Defense Logistics Agency Requirements Model,” AFIT/GLM/ILAL/94S-15, 1994, http://www.dtic.mil/dtic/tr/fulltext/u2/a285272.pdf

McLinn, James A., “Constant Failure Rate-A Paradigm in Transition,” Quality and Reliability Engineering International, Vol. 6, Issue 4, Sept./Oct. 1990, pp. 237-241, http://onlinelibrary.wiley.com/doi/10.1002/qre.4680060405/abstract

NASA, “Active Redundancy,” NASA Preferred Reliability Practices No. PD-ED-1216, NASA Headquarters, Washington, DC,  http://www.hq.nasa.gov/office/codeq/relpract/n1216.pdf,

Nowlan, F. Stanley and Howard F. Heap, “Reliability-Centered Maintenance,” United Airlines, San Francisco, Dec. 1978 

ReliaSoft, “Limitations of the Exponential Distribution for Reliability Analysis,” http://www.reliasoft.com/newsletter/4q2001/exponential.htm

Schenkelberg, Fred, “The Constant Failure Rate Myth,” http://nomtbf.com/2015/09/the-constant-failure-rate-myth/

Whitt, Ward, “Stabilizing performance in a single-server queue with time-varying arrival rate,” Queuing Systems, (2015) 81:341–378 DOI 10.1007/s11134-015-9462-x, http://www.columbia.edu/~ww2040/stable_single_QUESTA_2015.pdf

Vergenz, Peter, “Why I Hate MTBF, Like Failure Rates, and Love Life Data (Weibull) Analysis,” www.linkedin.com, July 16, 2016, https://www.linkedin.com/pulse/why-i-hate-mtbf-like-failure-rates-love-life-data-weibull-vergenz