Reinforcement learning (RL), which is frequently modeled as sequential learning and decision making in the face of uncertainty, is garnering growing interest in recent years due to its remarkable success in practice. In contemporary RL applications, it is increasingly more common to encounter environments with prohibitively large state and action space, thus imposing stringent requirements on the sample efficiency of the RL algorithms in use. Despite the empirical success, however, the theoretical underpinnings for many popular RL algorithms remain highly inadequate even for the tabular setting.
In this talk, we present two vignettes regarding the sample efficiency of RL algorithms. The first vignette demonstrates that a perturbed model-based RL approach is minimax optimal under a generative model, without suffering from a sample size barrier that was present in all past work. In the second vignette, we pin down the sample complexity of Q-learning on Markovian samples, which substantially improves upon prior results by a factor at least as large as the dimension of the state-action space. These results cover two distinctive RL paradigms and might shed light on the efficacy of these algorithms in more complicated scenarios.
Reinforcement Learning provides an attractive suite of online learning methods for personalizing interventions in a Digital Health. However after an reinforcement learning algorithm has been run in a clinical study, how do we assess whether personalization occurred? We might find users for whom it appears that the algorithm has indeed learned in which contexts the user is more responsive to a particular intervention. But could this have happened completely by chance? We discuss some first approaches to addressing these questions.
Abstract:
The Expected Improvement (EI) method, proposed by Jones et al. (1998), is a widely-used Bayesian optimization method, which makes use of a fitted Gaussian process model for efficient black-box optimization. However, one key drawback of EI is that it is overly greedy in exploiting the fitted Gaussian process model for optimization, which results in suboptimal solutions even with large sample sizes. To address this, we propose a new hierarchical EI (HEI) framework, which makes use of a hierarchical Gaussian process model. HEI preserves a closed-form acquisition function, and corrects the over-greediness of EI by encouraging exploration of the optimization space. We then introduce hyperparameter estimation methods which allow HEI to mimic a fully Bayesian optimization procedure, while avoiding expensive Markov-chain Monte Carlo sampling steps. We prove the global convergence of HEI over a broad function space, and establish near-minimax convergence rates under certain prior specifications. The improvement of HEI over existing methods is then demonstrated via numerical experiments and in applications to manufacturing optimization and hyperparameter tuning of deep learning models.
I will discuss two attempts at making causal inferences about gun violence prevention policies. The first policy is removing guns from intimate partner violence abusers and the second policy is background checks for private gun sales. I will also briefly describe some open areas of research about gun violence prevention that statisticians might be able to contribute to.
Dylan Small is the Class of 1965 Wharton Professor of Statistics at the Wharton School of the University of Pennsylvania. His research interests are in causal inference, the design and analysis of observational studies, the design and analysis of experiments, and its application to public health, medicine and public policy. He is the founding editor of the journal Observational Studies, and an Associate Editor for Annals of Statistics, Journal of the American Statistical Association, Journal of Causal Inference, Journal of Educational and Behavioral Statistics, and the American Statistician.
Science and engineering have benefited greatly from the ability of finite element methods (FEMs) to simulate nonlinear, time-dependent complex systems. The recent advent of extensive data collection from such complex systems now raises the question of how to systematically incorporate these data into finite element models, consistently updating the solution in the face of mathematical model misspecification with physical reality. This article describes general and widely applicable methodology for the coherent synthesis of data with FEM models, providing a data-driven probability distribution that captures all sources of uncertainty in the pairing of FEM with measurements.
For developing statistical and machine learning models, it is common to split the dataset into two parts: training and testing. The training part is used for fitting the model and the testing part for evaluating the performance of the fitted model. The most common strategy for splitting is to randomly sample a fraction of the dataset. In this talk, I will discuss an optimal method for doing this (joint work with my student Akhil Vakayil).
We propose a method to assess the sensitivity of data analyses to the removal of a small fraction of the data set. Analyzing all possible data subsets of a certain size is computationally prohibitive, so we provide a finite-data metric to approximately compute the number (or fraction) of observations that has the greatest influence on a given result when dropped. We call our resulting metric the Approximate Maximum Influence Perturbation. Our approximation is automatically computable and works for common estimators --- including (but not limited to) OLS, IV, GMM, MLE, and variational Bayes. We provide explicit finite-sample error bounds on our approximation for linear and instrumental variables regressions. At minimal computational cost, our metric provides an exact finite-data lower bound on sensitivity for any estimator, so any non-robustness our metric finds is conclusive. We demonstrate that the Approximate Maximum Influence Perturbation is driven by the signal-to-noise ratio in the inference problem, is not reflected in standard errors, does not disappear asymptotically, and is not a product of misspecification. We focus on econometric analyses in our applications. Several empirical applications show that even 2-parameter linear regression analyses of randomized trials can be highly sensitive. While we find some applications are robust, in others the sign of a treatment effect can be changed by dropping less than 1% of the sample even when standard errors are small.
Frequentist uncertainty is well understood in terms of the estimation of a sampling distribution of a statistic, acknowledging that the observed finite sample was one of many possible available. On the other hand, Bayesian uncertainty appears to start with a prior distribution and makes no acknowledgement of the variability of the data. The talk aims to shed light on Bayesian uncertainty and indeed shows how the posterior distribution can be understood through variability in the data. We show how it can lead to practical implementations of this interpretation of the Bayesian approach and illustrations will be presented.
One fundamental goal of high-dimensional statistics is to detect and recover structure from noisy data. But even for simple settings (e.g. a planted low-rank matrix perturbed by noise), the computational complexity of estimation is sometimes poorly understood. A growing body of work studies low-degree polynomials as a proxy for computational complexity: it has been demonstrated in various settings that low-degree polynomials of the data can match the statistical performance of the best known polynomial-time algorithms for detection. But prior work has failed to address settings in which there is a "detection-recovery gap" and detection is qualitatively easier than recovery. In this talk, I'll describe a recent result in which we extend the method of low-degree polynomials to address recovery problems. As applications, we resolve (in the low-degree framework) open problems about the computational complexity of recovery for the planted submatrix and planted dense subgraph problems.
I will present two Bayesian causal discovery approaches. The first approach is motivated by single-cell RNA-seq data. We proposed a zero-inflated Poisson Bayesian network which explicitly accounts for the sparse count nature of single-cell RNA-seq data. The second approach is motivated by breast cancer bulk RNA-seq data. We developed a Bayesian network with latent trajectory embedding to account for the tumor heterogeneity. Both approaches are uniquely identifiable for purely observational, cross-sectional data — a key property that many Bayesian networks do not possess due to Markov equivalence. Efficient parallel-tempered Markov chain Monte Carlo algorithms are designed to explore the multi-modal network space. We illustrate our methods using real RNA-seq datasets.
Kannan, Lovász and Simonovits (KLS) conjectured in 1995 that the Cheeger isoperimetric coefficient of any log-concave density is achieved by half-spaces up to a universal constant factor. This conjecture also implies other important conjectures such as Bourgain's slicing conjecture (1986) and the thin-shell conjecture (2003). In this talk, first we briefly survey the origin and the main consequences of these conjectures. Then we present the development and the refinement of the main proof technique, namely Eldan's stochastic localization scheme, which results in the current best bounds of the Cheeger isoperimetric coefficient in the KLS conjecture.
The analysis of tensor data has become an active research topic in statistics and data science recently. Many high order datasets arising from a wide range of modern applications, such as genomics, material science, and neuroimaging analysis, requires modeling with high-dimensional tensors. In addition, tensor methods provide unique perspectives and solutions to many high-dimensional problems where the observations are not necessarily tensors. High-dimensional tensor problems generally possess distinct characteristics that pose unprecedented challenges to the statistical community. There is a clear need to develop novel methods, algorithms, and theory to analyze the high-dimensional tensor data.
In this talk, we discuss some recent advances in high-dimensional tensor data analysis through several fundamental topics and their applications in microscopy imaging and neuroimaging. We will also illustrate how we develop new statistically optimal methods, computationally efficient algorithms, and fundamental theories that exploit information from high-dimensional tensor data based on the modern theory of computation, non-convex optimization, applied linear algebra, and high-dimensional statistics.
Recently the largest exome sequencing study to date of autism spectrum disorder (ASD) implicated 102 genes in risk. This risk gene set serves as a springboard for additional explorations into the etiological pathways of ASD, which can guide in the hunt for therapeutics. Quantification of gene expression using single cell RNA-sequencing of brain tissues, can be a critical step in such investigations. We describe statistical challenges encountered analyzing developing brain cells, including new methods for transfer learning and hierarchical reconstruction via reconciliation of multi-resolution cluer trees.
As datasets continue to grow in size, in many settings the focus of data collection has shifted away from testing pre-specified hypotheses, and towards hypothesis generation. Researchers are often interested in performing an exploratory data analysis in order to generate hypotheses, and then testing those hypotheses on the same data; I will refer to this as 'double dipping'. Unfortunately, double dipping can lead to highly-inflated Type 1 errors. In this talk, I will consider the special case of hierarchical clustering. First, I will show that sample-splitting does not solve the 'double dipping' problem for clustering. Then, I will propose a test for a difference in means between estimated clusters that accounts for the cluster estimation process, using a selective inference framework. I will also show an application of this approach to single-cell RNA-sequencing data. This is joint work with Lucy Gao (University of Waterloo) and Jacob Bien (University of Southern California).
The Lyman-α forest – a dense series of hydrogen absorptions seen in the spectra of distant quasars – provides a unique observational probe of the early Universe. The density of spectroscopically measured quasars across the sky has recently risen to a level that has enabled secure measurements of large-scale structure in the three-dimensional distribution of intergalactic gas using the inhomogeneous hydrogen absorption patterns imprinted in the densely sampled quasar sightlines. In principle, these modern Lyman-α forest observations can be used to statistically reconstruct three-dimensional density maps of the intergalactic medium over the massive cosmological volumes illuminated by current spectroscopic quasar surveys. However, until now, such maps have been impossible to produce without the development of scalable and statistically rigorous spatial modeling techniques. Using a sample of approximately 160,000 quasar sightlines measured across 25 percent of the sky by the SDSS-III Baryon Oscillation Spectroscopic Survey, here we present a 154 Gpc^3 large-scale structure map of the redshift 1.98≤z≤3.15 intergalactic medium — the largest volume large-scale structure map of the Universe to date — accompanied by rigorous quantification of the statistical uncertainty in the reconstruction.
Opportunities to use “real world data,” data generated as a by-product of digital transactions, have exploded over the past decade. Such data sources facilitate research in a naturalistic setting and with greater speed than is possible for research that relies on primary data collection. However, using data sources that were not collected for research purposes has a price and naïve use of such data without considering the complex data generating mechanisms they arise from can lead to biased inference. In this talk, I will use my research on electronic health records (EHR)-based phenotyping to motivate a discussion of the role of statistics in transforming real world data into knowledge. EHR-based phenotyping is hampered by complex missing data patterns and heterogeneity across patients and healthcare systems, features that have been largely ignored by existing phenotyping methods. As a result, not only are EHR-derived phenotypes expected to be imperfect, but they often feature exposure-dependent differential misclassification, which can bias results towards or away from the null. I will review novel and existing approaches to EHR-based phenotyping, highlighting the impact of missing data on phenotype estimation. Finally, I will discuss approaches to minimize bias when incorporating error-prone phenotypes into subsequent analyses. The overall goal of this presentation is to use the example of phenotyping to illustrate the unique contribution of statistics to the process of generating evidence from real world data.
A central goal of single cell genomics is to understand how cells interact and influence each other, and how tissues grow and respond to specific interventions. In my talk, I will give three examples of how we can use exploratory data analysis and statistical methods to begin to quantify relationships between cells. First, using pathology images and paired bulk RNA-seq data, I show how canonical correlation analysis models can be used to find image morphology that covaries with gene expression, and we use these results to identify image QTLs. Second, I describe a method for dimension reduction that allows us to augment disassociated single cell RNA-seq data with spatial information and, conversely, expand often sparse spatial transcriptomic data to all 20,000 genes in the human genome. Third, I show how Hawkes processes can be used to quantify spatial signaling between groups of heterogeneous cells across time and space, and illustrate the results through changes to spatial signaling in response to drugs inhibiting signaling and with respect to distance from a wound. Using these single cell data and models, we begin to quantify how specific cellular neighbors influence each other, and to predict how tissues might respond to interventions.
In this talk we will discuss statistical challenges and opportunities with joint analysis of electronichealth records and genomic data through "Genome and Phenome-Wide Association Studies (GWAS and PheWAS)". We posit a modeling framework that helps us to understand the effectof both selection bias and outcome misclassification in assessing genetic associations across themedical phenome. We will propose various inferential strategies that handle both sources of biasto yield improved inference. We will use data from the UK Biobank and the Michigan GenomicsInitiative, a longitudinal biorepository at Michigan Medicine, launched in 2012 to illustrate the analytic framework. The examples illustrate that understanding sampling design and selection bias matters for big data, and are at the heart of doing good science with data. This is joint work with Lauren Beesley at the University of Michigan.
Multivariate Hawkes processes are commonly used to model streaming networked event data in various applications, including neural science, social networks, seismic data, crime data, epidemiology, and so on.. Much progress in estimating such models has been made in statistics and machine learning; however, it remains a challenge to perform uncertainty quantification and extract reliable inference from complex datasets, especially considering general interaction patterns. This is essential for statistical tasks such as casual inference (where the underlying directed graph implies Granger causality) and change detection. Aiming towards this, we develop statistical inference tools for finding sequential confidence intervals and detecting changes, drawing ideas from concentration inequalities of continuous-time martingales and optimization. We compare our method to the previously derived asymptotic Hawkes process confidence interval and demonstrate our method's strengths in an application to neuronal connectivity reconstruction.
Psychologists developed Multiple Factor Analysis to decompose multivariate data into a small number of interpretable factors without any a priori knowledge about those factors [Thurstone, 1935]. In this form of factor analysis, the Varimax "factor rotation" is a key step to make the factors interpretable [Kaiser, 1958]. Charles Spearman and many others objected to factor rotations because the factors seem to be rotationally invariant [Thurstone, 1947, Anderson and Rubin, 1956]. These objections are still reported in all contemporary multivariate statistics textbooks. This is an engima because this vintage form of factor analysis has survived and is widely popular because, empirically, the factor rotation often makes the factors easier to interpret.
In a recent paper, Muzhe Zeng and I overturn a great deal of this controversy. Just as sparsity helps to find a solution in p>n regression, we show that sparsity resolves the rotational invariance of factor analysis. In fact, this was predicted by Thurstone in 1947. Moreover, we show that Principal Components Analysis (PCA) with the Varimax rotation provides a unified spectral estimation strategy for a broad class of “semi-parametric factor models," including Stochastic Blockmodels, topic models (LDA), and nonnegative matrix factorization (https://arxiv.org/abs/2004.05387).
This talk has three parts. Part I will return to the origin of the factor analysis misunderstanding. Part II will clarify what was misunderstood. Part III will create a new and unified understanding. This new understanding has clear implications for practice. With a sparse eigensolver, PCA with Varimax is both fast and stable. Combined with Thurstone’s straightforward sparsity diagnostics, this vintage approach is suitable for a wide array of modern applications. For example, in my applied work on social networks, PCA with Varimax easily scales (on my laptop) to graphs with millions of people.
Stein's method is an analytical tool developed in probability theory to control the distance between probability distributions. A central by-product of this method is the construction of function spaces whose expectation is zero against some distribution of interest. In this talk, we will demonstrate that this somewhat abstract analytical tool can be leveraged to construct practical algorithms for solving problems in computational statistics and machine learning. Examples will include the construction of deterministic approximations of Bayesian posterior distribution, of control variates for Markov chain Monte Carlo methods, and of estimators for intractable likelihood problems.
Bias in the estimation of causal effects is a function of distributional imbalance of covariates between treatment groups. Weighting strategies such as inverse propensity score methods attempt to mitigate bias by either modeling the treatment assignment mechanism or balancing specified covariate moments. In practice, these approaches can be quite sensitive to modeling decisions. This talk instead introduces a novel weighting method based on energy distance and is explicitly designed to balance weighted covariate distributions, thus targeting the source of bias. The method has several advantages compared with existing weighting techniques. First, our weighting strategy provides a model free approach for causal comparisons and can be flexibly utilized in a wide variety of downstream causal analyses, such as the estimation of average treatment effects, individualized treatment rules, and more. Second, our approach is based on a genuine measure of distributional balance, providing a means of precisely assessing the covariate balance induced by a given set of weights. Finally, the method is computationally feasible and provides strong theoretical guarantees under weak conditions. The effectiveness of our approach is demonstrated in the analysis of two real-world applications, the first a study in the safety of right heart catheterization and the second a study of the effectiveness of a transitional care intervention at a large Midwestern academic medical center.
Near real-time monitoring of outbreak transmission dynamics and evaluation of public health interventions are critical to interrupting the spread of the novel coronavirus (SARS-CoV-2) and mitigating morbidity and mortality caused by coronavirus disease (COVID-19). Avoiding or delaying the implementation of blunt transmission mitigation policies, such as stay-at-home orders and school closures, is only sustainable if policy-makers base decisions of whether to relax or intensify mitigation policies based on careful monitoring of regional and local transmission dynamics. Formulating a regional mechanistic model of SARS-CoV-2 transmission dynamics and frequently estimating parameters of this model using streaming surveillance data offers one way to accomplish data-driven decision making. For example, to detect an increase in new SARS-CoV-2 infections due to relaxation of previously implemented mitigation measures one can monitor estimates of the basic and/or effective reproductive number. In addition, frequently updated estimates of SARS-CoV-2 transmission model parameters enables the forecasting of regional critical care demand (e.g., hospital and intensive care unit beds).
However, parameter estimation can be imprecise, and sometimes even impossible, because surveillance data are noisy and not informative about all aspects of the mechanistic model, even for reasonably parsimonious epidemic models.
To overcome this obstacle, at least partially, we propose a Bayesian modeling framework that integrates multiple surveillance data streams. Our model uses both COVID-19 incidence and mortality time series to estimate our model parameters. Importantly, our data generating model for incidence data takes into account changes in the total number of tests performed. As a result, in our model both increases/decreases in testing and increases/decreases in actual number of infections affect observed case count changes. We apply our Bayesian data integration method to COVID-19 surveillance data collected in Orange County, California. Our results suggest that California Department of Public Health stay-at-home order, issued on March 19, 2020, lowered the SARS-CoV-2 effective reproductive number in Orange County below 1.0, which means that the order was successful in suppressing SARS-CoV-2 infections. However, subsequent "re-opening'' steps took place when thousands of infectious individuals remained in Orange County, so the effective reproductive number increased to approximately 1.0 by mid-June and above 1.0 by mid-July.
We investigate the merits of replication, and provide methods that search for optimal designs (including replicates), in the context of noisy computer simulation experiments. We first show that replication offers the potential to be beneficial from both design and computational perspectives, in the context of Gaussian process surrogate modeling. We then develop a lookahead based sequential design scheme that can determine if a new run should be at an existing input location (i.e., replicate) or at a new one (explore). When paired with a newly developed heteroskedastic Gaussian process model, our dynamic design scheme facilitates learning of signal and noise relationships which can vary throughout the input space. We show that it does so efficiently, on both computational and statistical grounds. In addition to illustrative synthetic examples, we demonstrate performance on two challenging real-data simulation experiments, from inventory management and epidemiology.