Selected Publications
ALL TYPES
ORDERED EARLIEST TO LATEST
ALL TYPES
ORDERED EARLIEST TO LATEST
Density ratio model for multivariate outcomes
Scott Marchese, Guoqing DiaoJournal Paper Journal of Multivariate Analysis, Volume 154, February 2017, Pages 249-261Abstract
The Density Ratio Model is a semi-parametric regression model which allows analysis of data from any exponential family without making a parametric distribution assumption. For univariate outcomes several authors have shown desirable properties of this model including robustness to mis-specification and efficiency of the estimators within a suitable class. In this paper we consider analysis of multivariate outcomes with this model, where each marginal distribution is from an exponential family. We show that the model successfully analyzes data from mixed outcome types (continuous, integer, binary), providing valid tests of the joint effects of covariates. Furthermore, for continuous outcomes we provide a bootstrap technique which correctly estimates the underlying marginal regression parameters and provides appropriate coverage probabilities without specifying the covariance structure. The methods are demonstrated via simulation studies and analysis of healthcare data.Joint regression analysis of mixed-type outcome data via efficient scores
Scott Marchese, Guoqing DiaoJournal Paper Computational Statistics & Data Analysis, Volume 125, September 2018, Pages 156-170Abstract
Joint analysis of multivariate outcomes composed of mixed data types (continuous, count, binary, survival, etc.) induces special complexity in model specification and analysis. When the scientific question of interest involves a joint effect of covariate(s) of interest on the set of outcome variables, specifying a full probability model may be infeasible, undesirably complex, or computationally intractable. A flexible method to estimate and conduct inference on such joint effects is presented which accounts for correlation among the outcomes without needing to explicitly specify their joint distribution. Simulation studies and an analysis of health care data illustrate the approach and its operating characteristics vis-à-vis other methods.Joint factor and regression analyses of multivariate ordinal data - Application to psychiatric assessments
Guoqing Diao, Srikanth Gottipati, Peter ZhangManuscript October 2018, Pages 1-26No External Link
Semiparametric frailty models for zero-inflated event count data in the presence of informative dropout
Guoqing Diao, Donglin Zeng, Kuolung Hu, Joseph G IbrahimJournal Paper Biometrics, in pressAbstract
Recurrent events data are commonly encountered in medical studies. In many applications, only the number of events during the follow‐up period rather than the recurrent event times is available. Two important challenges arise in such studies: (a) a substantial portion of subjects may not experience the event, and (b) we may not observe the event count for the entire study period due to informative dropout. To address the first challenge, we assume that underlying population consists of two subpopulations: a subpopulation nonsusceptible to the event of interest and a subpopulation susceptible to the event of interest. In the susceptible subpopulation, the event count is assumed to follow a Poisson distribution given the follow‐up time and the subject‐specific characteristics. We then introduce a frailty to account for informative dropout. The proposed semiparametric frailty models consist of three submodels: (a) a logistic regression model for the probability such that a subject belongs to the nonsusceptible subpopulation; (b) a nonhomogeneous Poisson process model with an unspecified baseline rate function; and (c) a Cox model for the informative dropout time. We develop likelihood‐based estimation and inference procedures. The maximum likelihood estimators are shown to be consistent. Additionally, the proposed estimators of the finite‐dimensional parameters are asymptotically normal and the covariance matrix attains the semiparametric efficiency bound. Simulation studies demonstrate that the proposed methodologies perform well in practical situations. We apply the proposed methods to a clinical trial on patients with myelodysplastic syndromes.Biomarker threshold adaptive designs for survival endpoints
Guoqing Diao, Jun Dong, Donglin Zeng, Chunlei Ke, Alan Rong, Joseph G. IbrahimJournal Paper Journal of Biopharmaceutical Statistics, February 2018 (published online), Pages 1-17Abstract
Due to the importance of precision medicine, it is essential to identify the right patients for the right treatment. Biomarkers, which have been com- monly used in clinical research as well as in clinical practice, can facilitate selection of patients with a good response to the treatment. In this paper, we describe a biomarker threshold adaptive design with survival endpoints. In the first stage, we determine subgroups for one or more biomarkers such that patients in these subgroups benefit the most from the new treatment. The analysis in this stage can be based on historical or pilot studies. In the second stage, we sample subjects from the subgroups determined in the first stage and randomly allocate them to the treatment or control group. Extensive simulation studies are conducted to examine the performance of the proposed design. Application to a real data example is provided for implementation of the first-stage algorithms.Quantification of muscle tissue properties by modeling the statistics of ultrasound image intensities using a mixture of Gamma distributions in children with and without cerebral palsy
Siddhartha Sikdar, Guoqing Diao, Diego Turo, Christopher J. Stanley, Abhinav Sharma, Amy Chambliss, Loretta Laughrey, April Aralar, Diane L. DamianoJournal Paper Journal of Ultrasound in Medicine, Volume 37, Issue 9, September 2018, Pages 2157-2169Abstract
Objectives
To investigate whether quantitative ultrasound (US) imaging, based on the envelope statistics of the backscattered US signal, can describe muscle properties in typically developing children and those with cerebral palsy (CP).Methods
Radiofrequency US data were acquired from the rectus femoris muscle of children with CP (n = 22) and an age‐matched cohort without CP (n = 14) at rest and during maximal voluntary isometric contraction. A mixture of gamma distributions was used to model the histogram of the echo intensities within a region of interest in the muscle.Results
Muscle in CP had a heterogeneous echo texture that was significantly different from that in healthy controls (P < .001), with larger deviations from Rayleigh scattering. A mixture of 2 gamma distributions showed an excellent fit to the US intensity, and the shape and rate parameters were significantly different between CP and control groups (P < .05). The rate parameters for both the single gamma distribution and mixture of gamma distributions were significantly higher for contracted muscles compared to resting muscles, but there was no significant interaction between these factors (CP and muscle contraction) for a mixed‐model analysis of variance.Conclusions
Ultrasound tissue characterization indicates a more disorganized architecture and increased echogenicity in muscles in CP, consistent with previously documented increases in fibrous infiltration and connective tissue changes in this population. Our results indicate that quantitative US can be used to objectively differentiate muscle architecture and tissue properties.Modeling event count data in the presence of informative dropout with application to bleeding and transfusion events in myelodysplastic syndrome
Guoqing Diao, Donglin Zeng, Kuolung Hu, Joseph G. IbrahimJournal Paper Statistics in Medicine, September 2017, Volume 36, Issue 22, Pages 3475-3494Abstract
In many biomedical studies, it is often of interest to model event count data over the study period. For some patients, we may not follow up them for the entire study period owing to informative dropout. The dropout time can potentially provide valuable insight on the rate of the events. We propose a joint semiparametric model for event count data and informative dropout time that allows for correlation through a Gamma frailty. We develop efficient likelihood‐based estimation and inference procedures. The proposed nonparametric maximum likelihood estimators are shown to be consistent and asymptotically normal. Furthermore, the asymptotic covariances of the finite‐dimensional parameter estimates attain the semiparametric efficiency bound. Extensive simulation studies demonstrate that the proposed methods perform well in practice. We illustrate the proposed methods through an application to a clinical trial for bleeding and transfusion events in myelodysplastic syndrome.A class of semiparametric cure models with current status data
Guoqing Diao, Ao YuanJournal Paper Lifetime Data Analysis, January 2019, Volume 25, Issue 1, Pages 25-51Abstract
Current status data occur in many biomedical studies where we only know whether the event of interest occurs before or after a particular time point. In practice, some subjects may never experience the event of interest, i.e., a certain fraction of the population is cured or is not susceptible to the event of interest. We consider a class of semiparametric transformation cure models for current status data with a survival fraction. This class includes both the proportional hazards and the proportional odds cure models as two special cases. We develop efficient likelihood-based estimation and inference procedures. We show that the maximum likelihood estimators for the regression coefficients are consistent, asymptotically normal, and asymptotically efficient. Simulation studies demonstrate that the proposed methods perform well in finite samples. For illustration, we provide an application of the models to a study on the calcification of the hydrogel intraocular lenses.Analysis of Secondary Phenotype Data under Case-Control Designs
Guoqing Diao, Donglin Zeng, Dan-Yu LinBook Chapter in Handbook of Statistical Methods for Case-Control Studies | CRC Press | June 27, 2018Chapter 28. Analysis of Secondary Phenotype Data under Case-Control Designs
Although the primary objective of case-control studies is to assess the effects of genetic variants between cases and controls, secondary phenotypes are often collected in such studies without much extra cost. For example, in the Diabetes Genetics Initiative (DGI) study, there were 1,464 patients with type 2 diabetes and 1,467 controls from Finland and Sweden, while at the same time, a variety of secondary phenotype traits were available for these patients, including anthropometric measures, glucose tolerance and insulin secretion, lips and apoliporoteins and blood pressure. These secondary phenotypes are typically the exposures/risk-factors of interest for the main outcome. In the Wellcome Trust Case Control Consortium (WTCCC), a case-control study consisting of 1,924 U.K. type-2 diabetes patients and 2,938 U.K. population controls, body mass index (BMI) and adult height were also measured as secondary traits in the study. With the availability of second phenotype information, it is cost-effective to study the association between genetic variants and these additional traits without need to conduct new studies. Indeed, the DGI study identified association of a particular single nucleotide polymorphism (SNP) in an intron of glucokinase regulatory protein with serum triglycerides in both case and control groups.Controlling false discovery proportion in identification of drug‐related adverse events from multiple system organ classes
Xianming Tan, Guanghan F. Liu, Donglin Zeng, William Wang, Guoqing Diao, Joseph F. Heyse, Joseph G. IbrahimJournal Paper Statistics in Medicine, Volume 38, September 2019, Pages 4378-4389Abstract
Analyzing safety data from clinical trials to detect safety signals worth further examination involves testing multiple hypotheses, one for each observed adverse event (AE) type. There exists certain hierarchical structure for these hypotheses due to the classification of the AEs into system organ classes, and these AEs are also likely correlated. Many approaches have been proposed to identify safety signals under the multiple testing framework and tried to achieve control of false discovery rate (FDR). The FDR control concerns the expectation of the false discovery proportion (FDP). In practice, the control of the actual random variable FDP could be more relevant and has recently drawn much attention. In this paper, we proposed a two‐stage procedure for safety signal detection with direct control of FDP, through a permutation‐based approach for screening groups of AEs and a permutation‐based approach of constructing simultaneous upper bounds for false discovery proportion. Our simulation studies showed that this new approach has controlled FDP. We demonstrate our approach using data sets derived from a drug clinical trial.Sparsity analysis of a sonomyographic muscle-computer interface
Nima Akhlaghi, Ananya Dhawan, Amir Khan, Biswarup Mukherjee, Guoqing Diao, Cecile Truong, Siddhartha SikdarJournal Paper IEEE Transactions on Biomedical Engineering, in pressAbstract
Objective: Sonomyography has been shown to be a promising method for decoding volitional motor intent from analysis of ultrasound images of the forearm musculature. The objectives of this paper are to determine the optimal location for ultrasound transducer placement on the anterior forearm for imaging maximum muscle deformations during different hand motions and to investigate the effect of using a sparse set of ultrasound scanlines for motion classification for ultrasound-based muscle computer interfaces (MCIs). Methods: The optimal placement of the ultrasound transducer along the forearm is identified using freehand 3D reconstructions of the muscle thickness during rest and motion completion. From the ultrasound images acquired from the optimally placed transducer, we determine classification accuracy (CA) with equally spaced scanlines across the cross-sectional field-of-view (FOV). Furthermore, we investigated the unique contribution of each scanline to class discrimination using Fisher criterion (FC) and mutual information (MI) with respect to motion discriminability. Results: Experiments with 5 able-bodied subjects show that the maximum muscle deformation occurred between 40-50% of the forearm length for multiple degrees-of-freedom. The average classification accuracy was 94 $\pm$ 6% with the entire 128 scanline image and 94°5% with 4 equally-spaced scanlines. However, no significant improvement in classification accuracy was observed with optimal scanline selection using FC and MI. Conclusion: For an optimally placed transducer, a small subset of ultrasound scanlines can be used instead of a full imaging array without sacrificing performance in terms of classification accuracy for multiple degrees-of-freedom. Significance: The selection of a small subset of transducer elements can enable the reduction of computation, and simplification of the instrumentation and power consumption of wearable sonomyographic MCIs particularly for rehabilitation and gesture recognition applications.Semiparametric regression analysis for composite endpoints subject to componentwise censoring
Guoqing Diao, Donglin Zeng, Chunlei Ke, Haijun Ma, Qi Jiang, Joseph G. IbrahimJournal Paper Biometrika, Volume 105, Issue 2, June 2018, Pages 403-418Abstract
Composite endpoints with censored data are commonly used as study outcomes in clinical trials. For example, progression-free survival is a widely used composite endpoint, with disease progression and death as the two components. Progression-free survival time is often defined as the time from randomization to the earlier occurrence of disease progression or death from any cause. The censoring times of the two components could be different for patients not experiencing the endpoint event. Conventional approaches, such as taking the minimum of the censoring times of the two components as the censoring time for progression-free survival time, may suffer from efficiency loss and could produce biased estimates of the treatment effect. We propose a new likelihood-based approach that decomposes the endpoints and models both the progression-free survival time and the time from disease progression to death. The censoring times for different components are distinguished. The approach makes full use of available information and provides a direct and improved estimate of the treatment effect on progression-free survival time. Simulations demonstrate that the proposed method outperforms several other approaches and is robust against various model misspecifications. An application to a prostate cancer clinical trial is provided.Robust big data analytics via divergences
Lei Li, Anand N Vidyashankar, Guoqing Diao, Ejaz AhmedJournal Paper March 2019, Volume 21, Issue 4, 348 (40 pages)Abstract
Big data and streaming data are encountered in a variety of contemporary applications in business and industry. In such cases, it is common to use random projections to reduce the dimension of the data yielding compressed data. These data however possess various anomalies such as heterogeneity, outliers, and round-off errors which are hard to detect due to volume and processing challenges. This paper describes a new robust and efficient methodology, using Hellinger distance, to analyze the compressed data. Using large sample methods and numerical experiments, it is demonstrated that a routine use of robust estimation procedure is feasible. The role of double limits in understanding the efficiency and robustness is brought out, which is of independent interest.Efficient methods for signal detection from correlated adverse events in clinical trials
Guoqing Diao, Guanghan F Liu, Donglin Zeng, William Wang, Xianming Tan, Joseph F Heyse, Joseph G IbrahimJounral PaperBiometrics, September 2019, Volume 75, Issue 3, Pages 1000-1008Abstract
It is an important and yet challenging task to identify true signals from many adverse events that may be reported during the course of a clinical trial. One unique feature of drug safety data from clinical trials, unlike data from post‐marketing spontaneous reporting, is that many types of adverse events are reported by only very few patients leading to rare events. Due to the limited study size, the p‐values of testing whether the rate is higher in the treatment group across all types of adverse events are in general not uniformly distributed under the null hypothesis that there is no difference between the treatment group and the placebo group. A consequence is that typically fewer than urn:x-wiley:15410420:media:biom13031:biom13031-math-0001 percent of the hypotheses are rejected under the null at the nominal significance level of urn:x-wiley:15410420:media:biom13031:biom13031-math-0002. The other challenge is multiplicity control. Adverse events from the same body system may be correlated. There may also be correlations between adverse events from different body systems. To tackle these challenging issues, we develop Monte‐Carlo‐based methods for the signal identification from patient‐reported adverse events in clinical trials. The proposed methodologies account for the rare events and arbitrary correlation structures among adverse events within and/or between body systems. Extensive simulation studies demonstrate that the proposed method can accurately control the family‐wise error rate and is more powerful than existing methods under many practical situations. Application to two real examples is provided.