Poster Session I.
Time and venue: May 15th, 2:30 p.m. – 3:30 p.m., DDS Atrium.
Poster Session I.
Time and venue: May 15th, 2:30 p.m. – 3:30 p.m., DDS Atrium.
Title: ForLion: A New Algorithm for D-optimal Designs under General Parametric Statistical Models with Mixed Factors
Presenter: Abhyuday Mandal
Abstract: We address the problem of designing an experiment with both discrete and continuous factors under fairly general parametric statistical models. We propose a new algorithm, named ForLion, to search for optimal designs under the D-criterion. The algorithm performs an exhaustive search in a design space with mixed factors while keeping high efficiency and reducing the number of distinct experimental settings. Its optimality is guaranteed by the general equivalence theorem. We demonstrate its superiority over state-of-the-art design algorithms using real-life experiments under multinomial logistic models (MLM) and generalized linear models (GLM). Our simulation studies show that the ForLion algorithm could reduce the number of experimental settings by 25% or improve the relative efficiency of the designs by 17.5% on average. Our algorithm can help the experimenters reduce the time cost, the usage of experimental devices, and thus the total cost of their experiments while preserving high efficiencies of the designs. (This is a joint research with Yifei Huang, Keren Li and Jie Yang.)
Title: Comparison of Algorithms for Exact Optimal Designs in the gBLUP Model
Presenter: Alexandra Stadler
Abstract: In contemporary breeding programs, genomic best linear unbiased prediction (gBLUP) models are employed to drive decisions on artificial selection. Experiments are performed to obtain responses on the units in the breeding program. Due to restrictions on the size of the experiment, an efficient experimental design must be found. The submitted poster states the design problem for the gBLUP model and compares classical exchange-type algorithms for exact optimum designs to the TrainSel R package and algorithm by Akdemir et al. (2021). Particular emphasis is placed on evaluating the computational runtime of algorithms along with their respective efficiencies over different sample sizes. The algorithms are compared for the D-criterion and the CDMin-criterion. Akdemir, D., Rio, S., & Isidro-Sánchez, J. (2021). TrainSel: An R Package for Selection of Training Populations. Frontiers in Genetics, 12. DOI: 10.3389/fgene.2021.655287
Title: ESPs: A New Cost Efficient Sampler for Expensive Posterior Distributions
Presenter: Benedetta Bruni
Abstract: Bayesian inverse problem to model complex physical systems require the evaluation of forward simulation models, which can be prohibitively expensive in terms of CPU hours. Therefore it is important to design 'cost-efficient' samplers, to achieve a satisfactory representation of the desired posterior under a fixed computational budget. Most of current sampling algorithms (e.g., Hamiltonian Monte Carlo methods) are 'sample-efficient', meaning they provide a good representation of the posterior given limited samples, but hare highly cost inefficient, as they require at least one evaluation of the forward model per sample. We present a new sampler, cost-Efficient Stein Points (ESPs). ESPs is an extension of the recent Stein points from Chen et al. (2018, ICML), which achieves sample-efficiency by sequential minimization of the kernel Stein discrepancy with respect to the posterior of interest. The key novelty of ESPs is the use of carefully-constructed Gaussian process surrogate models of the kernel Stein discrepancy, for cost-efficient sequential minimization via Bayesian optimization based on Expected Improvement. We demonstrate the cost-efficiency of ESPs in comparison to state-of-the-art posterior sampling algorithms, via a suite of numerical experiments and a calibration application.
Title: Subdata Selection for Principal Component Analysis
Presenter: Bruce Phillips
Abstract: Principal component analysis (PCA) is a powerful statistical tool for data dimensionality reduction. It works by computing a projection of the data onto the axes along which it varies most. Finding these axes requires computing the eigen-decomposition of the sample covariance matrix, or the singular value decomposition of the data matrix, which may be too computationally expensive in the large data setting. In this poster, we will introduce a novel method for selecting an informative subset of the sample data that can well-estimate the principal components while being small enough to be usable in practice. We will demonstrate that our method achieves lower estimation error than uniform subsampling from exhaustive simulations.
Title: Highly Variable Drugs in Bioequivalence Trials in Mexican Population: An Exploratory Analysis and Proposals for Study Designs
Presenter: Carlos Alejandro Díaz-Tufinio
Abstract: Bioequivalence trials allow the testing of generic formulations, speeding the development process of new formations. Given the characteristics of these trials, such as that can be conducted in healthy volunteers in controlled conditions, the design of these randomized clinical trials (RCT) is critical for its conclusion, especially for highly variable drugs (HVD). In this project, we summarized the pharmacological information from 365 bioequivalence clinical trials, encompassing around 180 different active substances (parent drug and/or its metabolites), focusing on those identified to be highly variable in their pharmacokinetic parameters in Mexican population, summarizing and analyzing its characteristics. This study provides the follow-up of a full compendium of PK within-subject and between-subject variability in Mexican population, published recently in late 2023. This meta-analysis provides scientific evidence to support future design of controlled comparative bioavailability and bioequivalence trials, by suggesting a more adequate experimental design and sample size, especially important for those identified as HVD. The analysis of this work, based on a randomized controlled clinical trial setting, is reliable experimental evidence that aims to back up adequate bioequivalence trial design and the calculation of appropriate sample sizes, based on the statistical requirements of further studies. Finally, from the bedside perspective, this data - together with relevant clinically annotated genetic variations and drug pharmacological information - could support clinicians’ decision-making for improving the usage of medication in the real-world setting.
Title: The Uniform Placement of Alter-nodes on a Spherical Surface (U-PASS) for Ego-Centric Networks and its Link to Minimum Energy Designs
Presenter: Chao-Hui Huang
Abstract: An ego-centric network consists of a particular node (ego) that has relationships to all neighboring nodes (alters) in the network. Such network serves as an important tool to study the network structure of alters of the ego, and it is essential to present such network with good visualization. This work aims at introducing an efficient method, namely the Uniform Placement of Alters on Spherical Surface (U-PASS), to represent an ego-centric network so that all alters are scattered on the surface of the unit sphere uniformly. Unlike other simple uniformity that considers to maximize Euclidean distances among nodes, U-PASS is a three-stage method that spreads the alters with the consideration of existing edges among alters, no overlapping of node clusters, and node attribute information. Particle swarm optimization is employed to improve efficiency in node allocations. To guarantee the uniformity, we show the connection between our U-PASS to the minimum energy design on a two-dimensional flat plane with a specific gradient. Our simulation study shows good performance of U-PASS in terms of some distance statistics when compared to four state-of-the-art methods via self-organizing maps and force-driven approaches.
Title: Nested Strong Orthogonal Arrays
Presenter: Chunwei Zheng
Abstract: Nested space-filling designs are popular in conducting multiple computer experiments with different levels of accuracy. Strong orthogonal arrays (SOAs) as a special type of space-filling designs which own attractive low-dimensional stratifications. Combining these two kinds of designs, we propose a new type of designs called nested strong orthogonal arrays. In this paper, a nested strong orthogonal array with two layers enjoys different strengths between the large SOA and the small SOA which is nested in the large one. Our designs can accommodate more columns or own more economic runs, and some of them possess better stratifications in two dimensions. The methods for constructing this type of designs are based on regular second order saturated (SOS) designs and nonregular designs. In addition, project geometry plays a part in the construction.
Title: Subdata Selection for High-dimensional Big Data with Categorical Responses
Presenter: David Collins
Abstract: Classical statistical methods are no longer useful for analyzing big data due to computational limitations caused by the sheer size of the data. Subdata selection methods have, therefore, emerged to select informative subdata efficiently. Our interest is in fitting a generalized linear model to a high-dimensional dataset. While a few subdata selection methods can solve this problem, the state-of-the-art deterministic Information-Based Optimal Subdata Selection (IBOSS) approach only works when the sample size is at least twice the number of features. This is restrictive for high-dimensional datasets. For linear regression problems under theassumption of effect sparsity, Singh and Stufken (2023) proposed a novel approach called CLASS that first performs a variable selection followed by selecting an IBOSS sample based only on the selected variables. We extend the CLASS algorithm to the generalized linear model framework. In addition to the algorithmic extension, we theoretically establish the model selection consistency in the linear regression and generalized linear model framework. We then also show that the IBOSS sample on selected features enjoys nice statistical properties regarding the variance of the estimated model parameters. Using extensive simulations, we illustrate that our method has a superior screening performance, lower estimation error for active coefficients, and is computationally cheap.
Title: Complete Active Learning for Emulation and Optimization
Presenter: Difan Song
Abstract: Gaussian process (GP) models are widely used in active learning for emulation and optimization of black-box functions. The existing GP-based active learning procedures start with an initial design and then add points using some acquisition functions. If the initial design is too small, the response surface may be under explored and the algorithm may terminate prematurely at a local optima. On the other hand, if the initial design is too large, then we may waste valuable resources and may miss the interesting regions of the response surface. This article proposes a new active learning procedure that completely avoids the initial design. This is achieved by using a new correlation function and a new GP model, which automatically embeds a projection-based space-filling criterion into the acquisition functions. Through theory and simulations, we show that the proposed procedure, which we call COMPlete ACTive (COMPACT) learning, outperforms the existing active learning procedures.
Title: Sample Size Planning for Conditional Counterfactual Mean Estimation with a K-armed Randomized Experiment
Presenter: Gabriel Ruiz
Abstract: We cover how to determine a sufficiently large sample size for a K-armed randomized experiment in order to estimate conditional counterfactual expectations in data-driven subgroups. The sub-groups can be output by any feature space partitioning algorithm, including as defined by binning users having similar predictive scores or as defined by a learned policy tree. After carefully specifying the inference target, a minimum confidence level, and a maximum margin of error, the key is to turn the original goal into a simultaneous inference problem where the recommended sample size to offset an increased possibility of estimation error is directly related to the number of inferences to be conducted. Given a fixed sample size budget, our result allows us to invert the question to one about the feasible number of treatment arms or partition complexity (e.g. number of decision tree leaves). Using policy trees to learn sub-groups, we evaluate our nominal guarantees on a large publicly-available randomized experiment test data set. https://arxiv.org/abs/2403.04039
Title: Unweighted Estimation Based on Optimal Sample under Measurement Constraints
Presenter: Jing Wang
Abstract: Big data bring new challenges to data storage and processing, especially when computational resources are limited. Researchers have developed many subsampling methods for various models, such as linear, logistic and generalized linear models(GLMs) (see Ma et al. (2015), Wang et al. (2018), Ai et al. (2021)). Most algorithms developed for GLMs rely on all responses of the full data, which limits the application scope of subsampling when responeses are difficult to acuqire. To handle this problem, Zhang et al. (2021) proposed a response-free optimal sampling shceme. However, they use a reweighted estimator which assigns smaller weights for more informative data points. Thus, their approach is not efficient. We introduce an unweighted estimator to improve the estimating efficiency and investigate the theoretical propertities of both estimators. Asymptotic nomality is established using martingale techniques without conditioning on pilot estimation, which has been less investigated in existing subsampling literature. Both theoretical analysis and numerical experiments show that our estimator is more efficient and has a better performance without increasing computational complexity.
Title: Estimation and Variable Selection of Conditional Main Effects for Generalized Linear Models
Presenter: Kexin Xie
Abstract: In the evolving landscape of statistical analysis, the introduction of conditional main effects (CMEs) by Wu (2015) laid the groundwork for discerning the conditional influence of one variable at a specified level of another. In this work, our study seeks to extend the application of CMEs from linear regression to the generalized linear models. The proposed method leverages the foundational principles of CME coupling and reduction to refine variable selection. Our methodology shows advantage of selection accuracy through simulation studies. A case study of public health application is used to demonstrate the merits of the proposed method.
Title: Mixed Effects Model via Distribution-in-distribution-out Regression
Presenter: Mengfan Fu
Abstract: Mixed effects model has been the key technique to jointly consider fixed effects and random effects in the analysis of an experiment. It requires one to first create a design of experiment by generating design matrices for both fixed effects and random effects. Motivated by the imperfect controllers in many manufacturing and healthcare systems, rather than considering deterministic design matrices, we propose to create such design matrices as probability distributions. These distributional design matrices allow one to consider the randomness of the controllers that realize the design, leading to distributional response. To analyze the deterministic relationship between distributional design matrices and the distributional responses, we propose to employ a distribution-in-distribution-out (DIDO) regression model by analyzing such relationship on 2-Wasserstein space. We will also demonstrate that the existing mixed effects model is a special case of such DIDO regression model when reducing to Euclidean space. Simulation studies were created to validate the proposed methodology.
Title: Simultaneous Inferences for Multiple Utility in Time Choice Pairs under Copula Based Models
Presenter: Norou Diawara
Abstract: Discrete choice models are applied in many fields and in the statistical modelling of consumer behavior The construction of the DCMs takes many forms such as Binary Logit, Binary Probit Multinomial Logit, Conditional Logit, Multinomial Probit Nested Logit, Generalized Extreme Value Models, Mixed Logit, and Exploded Logit Choice behaviors and their utilities are illustrated in social sciences, health economics, transportation research, marketing and health systems researches They have a time dependent behavior We extend the DCMs with emphasis on time dependent best worst choice and discrimination between choice attributes using a flexible distribution function for the time dependence the copula method Here we fit a bivariate best worst copula distribution for consumer choice by including parameters for customer feeling and the state of uncertainty We used conditional logit model to calculate initial utility Expected utilities over time are obtained using backward recursive method based on Markov decision processes We used transition probabilities, derived using a copula method called CO CUB model to predict the utilities in time UiT through Flynn 2007 estimated covariates, we illustrate the behavior of the UiTs and their confidence/credible intervals We analyzed the UiTs. The properties of the transition probabilities are assessed in bootstrap study Under the copula and bootstrap approach, the transition probabilities follow a Bessel sequence under sufficient conditions.
Title: Optimal Designs for Order-of-Addition Two-Level Factorial Experiments
Presenter: Qiang Zhao
Abstract: A new type of experiment, called the order-of-addition factorial experiment, has recently received considerable attention in medicine science and bioengineering. These experiments aim to simultaneously optimize the order of addition and dose levels of drug components. In the experimental design literature, the idea of dual-orthogonal arrays (DOAs) was recently introduced for such experiments. However, constructing flexible DOAs is a challenging task. In this paper, we propose a novel theory-guided search method that efficiently identifies DOAs of any size (if present). We also provide an algebraic construction that instantly leads to certain DOAs. Moreover, to address the potential issue that DOA ignores interaction effects, we propose to construct a new type of optimal design under the expanded compound model, named the strong DOA(SDOA). We provide two algebraic constructions of the SDOA. We establish theoretical results on the optimality of both DOAs and SDOAs. Simulation studies are performed to demonstrate the superiority of our proposed designs.
Title: Strong Orthogonal Arrays of Strength Two Plus with Better Two-dimensional Projection Properties
Presenter: Sashini Silva
Abstract: Strong orthogonal arrays of strength two plus provide a class of useful designs for computer experiments. In this project, we formulate two criteria for selecting those arrays with better two-dimensional projection properties from the class of strong orthogonal arrays of strength two plus. A complete search is carried out for designs of 16 runs. Some results for designs of 32 runs are also obtained.
Title: Efficient Subdata Selection for Generalization Error Minimization in Kernel Regression
Presenter: Sheng-Zhan Hua
Abstract: This research aims to minimize generalization error in predictive modeling via kernel regression through subdata selection. By establishing an equivalence between linear models and kernel regression, we develop a subdata selection method that combines the simplicity of linear-model approaches with the adaptability of kernel methods. Demonstrated through numerical examples, the proposed method significantly reduces generalization errors across various datasets. This method proves particularly beneficial in massive data or low-quality data prone to yield unstable predictions.
Title: Modeling and Designs for Constrained Order-of-addition Experiments
Presenter: Xueru Zhang
Abstract: The order-of-addition (OofA) experiment aims to investigate the optimal sequence for adding various components, with the response contingent upon their order of addition. Many OofA experiments in pharmaceutical, technological, and traditional industries are constrained by prerequisites, where certain components must precede others. Such experiments are called constrained OofA experiments. While existing research on OofA experiments is extensive, it has primarily focused on experiments lacking specific constraints on the order of addition. This paper addresses such constrained OofA experiments, proposing a partial pairwise ordering (PPWO) model and optimal designs to identify the optimal order in scenarios where certain components require precedence. As the number of components increases, executing all possible orders with under the inherent constraints becomes financially impractical. To reduce the number of runs, systematic construction methods are provided to generate optimal designs under various criteria, such as A-optimal and D-optimal. Simulative studies are included to demonstrate the effectiveness of the proposed model and designs in achieving the optimal order in constrained OofA experiments.