Why Kill Controls?

The effects of chance are the most accurately calculable, and the least doubtful of all factors in the evolutionary situation.” R. A. Fisher, ca. 1953

Aug. 17, 2020 This article was inspired by Stephen Fisher, Warroad, Minnesota, who caught COVID-19, was hospitalized 42 days, 18 days on ventilator, treated with plasma antibodies, and recovered.

Nov. 13, 2020 Changed “accept null hypotheis” to “do not reject null hypothesis” thanks to Patrick Giuliano.

Dec. 30, 2020 Placebo efficacy and non-representative clinical trials   

Abstract

Abbott Laboratories clinical trials statisticians complained that they had a hard time finding enough sample subjects to test treatments or drugs for some diseases. David Moore (former statistics group leader of Abbott Laboratories) said, “We’re lucky to find 100 subjects with the disease, and we have to split them into control and treatment blocks.”

The FDA says, “Real-world data and real-world evidence are playing an increasing role in health care decisions.” https://www.fda.gov/science-research/science-and-research-special-topics/real-world-evidence. Why not compare treated sample life statistics with untreated population statistics?

There is at least one clinical trial to see whether plasma-antibody treatment improves corona-virus case fatality rate (deaths/cases) [Joyner et al]. Imagine a clinical trial to see whether treatment for corona virus prolongs lifetime (survival) or reduces time to recovery:

Ho: survival function of a treated sample is same as untreated population vs.

Ha: survival function of treated sample is stochastically better than that of population  P[Life>t|treated sample] > P[Life>t|untreated population] for some t.

Vaccine efficacy is computed as 1-Risk(vaccinated)/ Risk(unvaccinated), where Risk() is infections/sample size. Pfizer received emergency use authorization by showing COVID-19 vaccine efficacy =1-(8/21500)/(162/21728) = 95.06%. Suppose instead of 21728 unvaccinated sample, compare with unvaccinated US population Risk()? Placebo (saline) efficacy = 1-(162/21728)/(17.8M/328.2M) = 1-0.007455/0.054235 = 86.25%! The difference between unvaccinated case rate 0.75% and US population case rate 5.42% indicates that Pfizer’s sample is not representative. Others recognize this problem [Averitt et al.]. What are the consequences? COVID-19 efficacy may be closer to 99% than 95%? Translating clinical trials evidence into medical practice may be facilitated by representative sample vs. population comparisons, treated vs. untreated, and avoid the ethical dilemma of killing controls.   

Typical randomized clinical trial hypothesis tests presume similar data from randomly treated and untreated samples. Treated sample lifetime data differs from untreated population case and death or recovery counts data, although both contain survival function information. Treated sample subjects produce (censored) life data, times from infection to death or recovery, by patient name or unique identifier. The Kaplan-Meier nonparametric maximum likelihood estimator could be used to estimate the treated survival function, P[Life>t|Treated]. Untreated population produces case and death or recovery counts, without lifetime data. Periodic cohort case and death or recovery counts are statistically sufficient to make nonparametric population estimates of survival functions, P[Life>t|Untreated], https://sites.google.com/site/fieldreliability/corona-virus-survival-analysis/.

The clinical trial hypothesis test could be done by comparing a sample survival function estimate with a population survival function estimate: Kolmgorov-Smirnov maximum absolute difference, likelihood ratio, or other test statistics. The FDA would call this a “single-arm” trial and the population an “external” control. Dan Moore (biostatistician) says, “The FDA does accept “historical controls” as a comparison to treated in phase II trials. You have to show that there has been no change in your endpoint over chronological time.” [Leblanc and Tangen [2012], Belin et al. [2017], Dean et al. [2020], and others] Death is a clear endpoint, but recovery from corona virus may not be as clear. 

Background

The 1999 paper with the same title presumed that both the sample and population data consisted of cases and death counts, without lifetime data. It uses a likelihood ratio test. But life tests generate lifetime data, because sample subjects are tracked by name or unique identifier. Lifetimes give more precise survival function estimates than case and death counts; e.g., the Kaplan-Meier nonparametric maximum likelihood estimator for censored, grouped life data vs. nonparametric maximum likelihood estimator for case and death counts [https://sites.google.com/site/fieldreliability/random-tandem-queues-and-reliability-estimation-without-life-data/].

The references by Grover and by Fleming and Harrington deal with censored life tests. Grover’s paper and my 1997 presentation assumed equal size treated and untreated samples of life data. What if you had a huge untreated population case and endpoint event count data and treated sample life data, a much smaller sample than the population?

This problem falls in the realm of “neutrosophic” statistics [Smarandache], because the population case cohort and endpoint event counts could have come from a variety of lives with the same periodic event counts. Table 1 shows grouped life data and event counts from two cohorts started in two periods. Table 2 shows alternative life data that result in the same event counts in the bottom row. These alternatives don’t have the same probabilities, assuming the population survival function estimate from population case and event counts.

Table 1. Grouped life data and case and endpoint event counts. Period 1 cohort has 2 deaths in period 1 and 3 in period 2. Period 2 cohort has 2 deaths in period 2. Bottom row are endpoint event count sums of event counts by period. More than one period cohort (cross-section) of population cases are needed to reduce length-bias without life data [Chan].  

 

Table 2. Alternative grouped life data endpoint grouped event counts that could have resulted in same event counts as in table 1 bottom row. Each pair of columns shaded yellow shows alternative grouped life data that gives same column sums as in table 1.

Problem statement

From “To the Man with a Hammer,…” [George 1997]

“I compute nonparametric, age-specific reliability estimators from ships and failures data [4], without life data. Although they are maximum likelihood estimators, they are not Kaplan-Meier (K-M) estimators because failures are grouped by calendar time intervals regardless of ages-at-failures. Fortunately they are population, not sample, estimators, so their only uncertainty is due to censoring.”

“The modified (for censored data) Kolmgorov-Smirnov (K-S) test applies to K-M estimators (from life data), not ships and returns estimators. What is the asymptotic distribution of the maximum difference between two reliability functions estimated from grouped ships and returns data [George 1996]? Is the modification in [Gnedenko] still appropriate? Is only power affected, not P[type I error]? I conjecture the modified K-S test still has the same asymptotic distribution, but the numbers of observed failures should be replaced by the numbers of time intervals containing failures. Reference [5] derives the asymptotic distribution of the K-S test statistic, in English. Reference [6] contains a robust program for the K-S test statistic, but not for the modification [3]. Reference [7] describes log-rank statistic alternatives to the K-S test, which may be more powerful than K-S tests when reliability estimates cross.”

Muhammad Aslam [2020] proposed one- and two-sample “neutrosophic” Kolmgorov-Smirnov tests (NK-S) where observations are contained in intervals, not known “crisply”. The test is based on an interval containing the K-S difference statistic instead of its exact value. Aslam does not specify how to deal with functions of interval observations: enumeration, interval arithmetic, simulation, or ???

Solution

Simulate population life data with the same column sums or event counts as in the population data. Compute the K-M estimator from the simulated population life data and its K-S distance from the sample K-M survival function estimate. If the sample K-M estimator K-S distance is less than some percentile of the simulated |population-sample| K-S distance, do not reject the null hypothesis. Naturally, I call this an SNK-S test.

I simulated life data from the population data in table 3 and 20 simulations of the K-S distance between population and sample data. Figure 1 shows lognormal distribution fit pretty well, especially near the upper end. Simulated mean of ln(K-S distance) was -4.23 and standard deviation was 5.2. The 95-th percentile was 0.032. If a population nonparametric maximum likelihood estimator, from case and death counts, and sample K-M K-S distance is less than 0.032, do not reject the null hypothesis with significance level 95%.

However, each set of simulated life data is not equally likely, assuming the population survival function estimate from population case and event counts. So I weighted each simulated K-S distance by a normalized Kullback-Leibler divergence of its simulated K-M estimator from the population survival function. Figure 2 shows the weighted alternative to figure 1, for the same simulated K-S distances. Simulated mean of ln(weighted K-S distance) was 0.00115 and standard deviation was 0.00112. The 95-th percentile was 0.00304. If a population nonparametric maximum likelihood estimator, from cases and deaths, and weighted sample K-M |population-sample| K-S distance is less than 0.00304, do not reject the null hypothesis with significance level 95%.

Table 3. Life data for simulation to give same bottom row

Figure 1. Simulated K-S distances from table 3 data. Distance is maximum absolute difference between nonparametric maximum likelihood estimator from bottom row and the Kaplan-Meier estimator from simulated event counts.

Figure 2. Simulated K-S distances from table 3 data, weighted by K-L divergence from population survival function. Horizontal axis differs from figure 1, because K-S distances are multiplied by ratio of (K-L divergence)/S(K-L divergences).

Afterthoughts: Multiple inference and COVID-19 vaccine

"Recognize that any frequentist statistical test has a random chance of indicating significance when it is not really present. Running multiple tests on the same data set at the same stage of an analysis increases the chance of obtaining at least one invalid result. Selecting the one "significant" result from a multiplicity of parallel tests poses a grave risk of an incorrect conclusion. Failure to disclose the full extent of tests and their results in such a case would be highly misleading." Professionalism Guideline 8, Ethical Guidelines for Statistical Practice, American Statistical Association, 1997 [https://web.ma.utexas.edu/users/mks/statmistakes/multipleinference.html]/.

1. Simulate the K-S distance for all simulated population K-M estimates and do not reject the null hypothesis if all the K-S distances are small. If somebody gives me some treatment life data and I can find corresponding population case and death (event) counts, I will do both tests.

2. Do the likelihood ratio test too, [George 1999] using the column sums from the sample life data and the population counts.

3. Do log-rank and Gehan-Wilcoxon tests too? [Ed Gehan suggested that to me, 1976]

4. Do the weighted-difference in cumulative failure rate functions proposed by Fleming and Harrington [1980]? This deals with crossing failure rate functions. Their test statistic has known asymptotic properties.

5. Does 95% Pfizer COVID-19 vaccine trial efficacy apply to population? Vaccine efficacy = (Cases(unvacc.)/TTT(unvacc.)-Cases(vacc.)/TTT(vacc.))/Cases(unvacc.)/TTT(unvacc.); TTT() stands for total time on test. That is time since vaccination for the treated subjects, and is total time since February or March 2020 when COVID-19 started or comparable time since June when vaccination trials started.

References

Aslam, Muhammad,  “Introducing Kolmogorov−Smirnov Tests under Uncertainty: An Application to Radioactive Data,” http://pubs.acs.org/journal/acsodf,ACS Omega 2020, 5, 914−917

Amelia J. Averitt, Chunhua Weng, Patrick Ryan, and Adler Perotte, “Translating evidence into practice: eligibility criteria fail to eliminate clinically significant differences between real-world and study populations,” npj Digital Medicine (2020) 3:67 ; https://doi.org/10.1038/s41746-020-0277-8

Belin, Lisa, Yann De Rycke, and Phillippe Broët, “A two-stage design for phase II trials with time-to-event endpoint using restricted follow-up,” Contemporary Clinical Trials Communications, Volume 8, December 2017, Pages 127-134, https://doi.org/10.1016/j.conctc.2017.09.010

Chan, Kwun Chuen Gary. (2013) “Survival analysis without survival data: connecting length-biased and case-control data,” Biometrika 100 (3): 764-770

Dean, N., Gsell, P.S., Brookmeyer, R., Crawford, F., Donnelly, C., Ellenberg, S., Fleming, T., Halloran, M. E., Horby, P., Jaki, T., Krause, P., Longini, I., Mulangu, S., Muyembe-Tamfum, J.J., Nason, M., Smith, P., Wang, R., Henao-Restrepo, A., and De Gruttola, V. (2020).  “Creating a Framework for Conducting Randomized Clinical Trials During Disease Outbreaks.” The New England Journal of Medicine, 382, 1366-1369

FDA, “Submitting Documents Using Real-World Data and Real-World Evidence to FDA for Drugs and Biologics Guidance for Industry,” May 2019

Fleming, Thomas R. and David P. Harrington, “A Class Of Hypothesis Tests For One and Two Sample Censored Survival Data,” Technical Report Series, No. 9, August 1980

[7] Fleming, T. R. and D. P. Harrington, Counting Processes and Survival Analysis, Wiley-Interscience, New York, 1991

Gehan, E. A. (1965). “A generalized Wilcoxon test for comparing arbitrarily singly-censored samples.” Biometrika 52, 203-223

[4] George, L. L., and A. C. Agrawal, “Estimation of a hidden service distribution of an M/G/∞ system,” Naval Research Logistics, 20: 549–555. doi: 10.1002/nav.3800200314, https://sites.google.com/site/fieldreliability/home/m-g-infinity-service-distribution

George, L. L.  “Ergodic Theory, Nyquist Samples, and Field Reliability,“ Triad Systems Corp., March 1996 (Ergodeny.doc)

George, L. L.  "Product Reliability Comparison with Censored Data,” or “To the Man With a Hammer, Everything Looks Like a Nail," ASQ Reliability Review, Vol. 17, No. 1, March 1997 (KSCentst.doc)

George, L. L.,  “Compare Population and Customer Reliability,” Quality and Productivity Research Conference, ASQ and UC Berkeley, Santa Rosa, CA May 1998 (QPRC98.doc)

George, L. L. ,“Why Kill Controls? R. A. Fisher says so,” 1999 (ClinTril.doc)

Gnedenko, B. V., Yu. K. Belyayev, and A. D. Solovyev, Mathematical Methods of Reliability Theory, Academic Press, New York, pp. 274-276, 1969

Grover, N. B., “Two-sample Kolmogorov-Smirnov test for truncated data,” https://doi.org/10.1016/0010-468X(77)90039-3

Joyner, Michael, et al., “Effect of Convalescent Plasma on Mortality among Hospitalized Patients with COVID-19: Initial Three Month Experience,” MedRxiv preprint, Aug. 2020, https://doi.org/10.1101/2020.08.12.20169359

Koziol, James A.  and  David P. Byar, “Percentage Points of the Asymptotic Distributions of One and Two Sample K-S Statistics for Truncated or Censored Data,” Technometrics, Vol. 17, No. 4, pp. 507-510, 1975, doi = 10.1080/00401706.1975.10489380, https://www.tandfonline.com/doi/abs/10.1080/00401706.1975.10489380

LeBlanc, Michael and Catherine Tangen, “Choosing Phase II Endpoints and Designs: Evaluating the possibilities,” Clin. Cancer Res. 2012 Apr 15; 18(8): 2130–2132. Published online 2012 Mar 8. doi: 10.1158/1078-0432.CCR-12-0454

[5] Nikiforov, A. M.  “Algorithm AS288, Exact Smirnov Two-sample Tests for Arbitrary Distributions,” Appl. Statist, v. 43, No. 1, pp265-284, 1994

[6] ibid, “Subroutine GSMIRN,” statlib@lib.stat.cmu.edu

Smarandache, Florentin, Introduction to Neutrosophic Statistics, Sitech & Education Publishing, Columbus, Ohio, 2014