Statistics Seminar

  Department of Mathematics, University of Houston

From Fall 2023, Dr. Yabo Niu maintains the statistics seminar page: link


Past seminars:

Spring 2023 Schedule

Speaker: Dr. Raymond Wong (Department of Statistics, TAMU)  website 

Location: PGH 646A

Title: Balancing Weights for Offline Reinforcement Learning

Abstract: Offline policy evaluation (OPE) is considered a fundamental and challenging problem in reinforcement learning (RL). In this talk, I will focus on the value estimation of a target policy based on pre-collected data generated from a possibly different policy, under the framework of infinite-horizon Markov decision processes. I will discuss a novel estimator with approximately projected state-action balancing weights for the policy value estimation. These weights are motivated by the marginal importance sampling method in RL and the covariate balancing idea in causal inference. Corresponding asymptotic convergence will be presented. Our results scale with both the number of trajectories and the number of decision points at each trajectory. As such, consistency can still be achieved with a limited number of subjects when the number of decision points diverges. In addition, I will introduce a necessary and sufficient condition for establishing the well-posedness of the Bellman operator in the off-policy setting, which characterizes the difficulty of OPE.

Speaker: Dr. Sebastian Lerch (Karlsruhe Institute of Technology) website

Location: zoom (email me for the zoom link)

Title: Deep learning models for post-processing ensemble weather forecasts

Abstract: Ensemble weather predictions require statistical postprocessing of systematic errors to obtain reliable and accurate probabilistic forecasts. Traditionally, this is accomplished with distributional regression models in which the parameters of a predictive distribution are estimated from a training period. In this talk, I will discuss various aspects of designing, estimating and evaluating distributional regression models utilizing recent advances from machine learning. In particular modern deep learning methods offer various advantages over standard parametric distributional regression models. For example, distributional regression models based on neural networks allow to flexibly model nonlinear relations between arbitrary predictor variables and forecast distribution parameters that are automatically learned in a data-driven way rather than requiring prespecified link functions. The methodological developments will be illustrated by a comprehensive review and systematic comparison of eight statistical and machine learning methods for probabilistic wind gust forecasting via ensemble postprocessing. The methods are systematically compared using 6 years of data from a high-resolution, convection-permitting ensemble prediction system that was run operationally at the German weather service, and hourly observations at 175 surface weather stations in Germany. While all postprocessing methods yield calibrated forecasts and are able to correct the systematic errors of the raw ensemble predictions, incorporating information from additional meteorological predictor variables beyond wind gusts leads to significant improvements in forecast skill. In particular, we propose a flexible framework of locally adaptive neural networks with different probabilistic forecast types as output, which not only significantly outperform all benchmark postprocessing methods but also learn physically consistent relations associated with the diurnal cycle, especially the evening transition of the planetary boundary layer.

Speaker: Dr. Xianyang Zhang (Department of Statistics, TAMUwebsite 

Location: PGH 646A

Title: Change-point Detection: Computation and Statistical Inference

Abstract: Change-point analysis is concerned with detecting and locating structure breaks in the underlying model of a data sequence. It finds an abundance of applications in a wide variety of fields, for example, bioinformatics, finance, and engineering. This talk provides an overview of two different change-point detection frameworks in the literature. The first approach is based on minimizing a cost function over possible numbers and locations of change points. Such an approach requires finding the cost value repeatedly over different segments of the data set, which can be time-consuming. To tackle this issue, we introduce a new method based on sequential gradient descent to find the cost value accurately and efficiently. The core idea is to update the cost value using the information from previous steps without re-optimizing the objective function. Numerical studies show that the new approach can be orders of magnitude faster than the Pruned Exact Linear Time method without sacrificing estimation accuracy. The second approach combines two-sample hypothesis testing with segmentation techniques. A particular challenge within this framework is dealing with the high-dimensionality of data and the nonparametric nature of structure break. We develop a new methodology to detect structural breaks in the distributions of a sequence of high-dimensional observations. We show that the new approach is more efficient than the existing methods.

Speaker: Dr. Yang Ni (Department of Statistics, TAMU)  website 

Location: PGH 646A

Title: Causal Graphical Models for Discovering Gene Regulations

Abstract: I will present several causal graphical models for discovering gene regulations from observational genomic data in an exploratory fashion. Our methods are specifically tailored to common features of genomic data including high level of noise, high skewness, zero-inflation, sample heterogeneity, feedback loops, and presence of unmeasured confounders. Our theories show that causal structure is identifiable under all the presented causal graphical models with purely observational data. I will provide intuition as to why causality is identifiable under different scenarios and demonstrate the practical utility using multiple real datasets with known causal structure.

Speaker: Dr. Xi Luo (School of Public Health, The University of Texas Health Science Center at Houston)

Location: PGH 646A

Title: TBA

Abstract: TBA

Fall 2022

Speaker: Dr. Hao Zhang (Department of Statistics, Purdue University)  website 

Location: PGH 646A

Title: Gaussian Process for Spatial Big Data


Abstract: Gaussian processes are fundamental to spatial statistics and machine learning. One main objective in the study of Gaussian process is prediction based on a partial realization. The best linear unbiased prediction, or kriging, is now applied in several disciplines including engineering, ecology, environmental and climate studies, and the application has extended to new areas such as computer simulation, and uncertainty quantification. Although kriging only requires the first two moments and no distributional assumptions, the Gaussian process theory provides nice tools to study the properties of kriging. Some fascinating theory for kriging has been developed that provides insight into the understanding of kriging results. In recent years, a focus area in Gaussian process has been in computational methods that can accommodate big data. Most of the methods are based on some kind of approximation to the underlying process. I will review some recent theoretical results to show that the low-rank models may be our only choice to overcome some inherent issues in working with a large spatial matrix. However, existing low-rank models are known to have undesirable properties which limit the performance of the approximation. New low-ranks models could negate that. 


Bio: Hao Zhang is Professor of Statistics at Purdue University. He is Fellow of American Statistical Association and an Elected Member of the International Statistical Institute. He has served editorial boards of Journal of the American Statistical Association, Statistica Sinica, Environmetrics, and Statistics & Probability Letters. His research interests are primarily in spatial and spatio-temporal statistics. His work includes both theoretical investigation into asymptotic properties of inferential methods for spatial data and development of algorithms for the analysis of big spatial data. He collaborates with researchers in ecology, environmental sciences, climatology, and natural resources. 



Speaker: Dr. Huiyan Sang (Department of Statistics, Texas A&M University)  website 

Location: PGH 646A

Title: Bayesian Graphical Decision Tree Boosting for Nonparametric Machine Learning Regression  

Abstract:  Ensemble decision tree methods such as XGBoost, Random Forest, and Bayesian additive regression trees have gained great popularity as a flexible nonparametric function estimation and modeling tool. Most existing ensemble decision tree models rely on decision tree weak learners with axis-parallel univariate split rules to partition the Euclidean feature space into rectangular regions. In practice, however, many regression problems involve features with multivariate structures (e.g., spatial locations) possibly lying in a manifold, where rectangular partitions may fail to respect irregular intrinsic geometry and boundary constraints of the structured feature space. In this paper, we develop a new class of Bayesian additive multivariate decision tree models, which combine univariate split rules for handling possibly high dimensional features without known multivariate structures and novel multivariate split rules for features with multivariate structures in each weak learner. The proposed multivariate split rules are built upon stochastic predictive spanning tree bipartition models on reference knots, which are capable of achieving highly flexible nonlinear decision boundaries on manifold feature spaces while enabling efficient dimension reduction computations. We demonstrate the superior performance of the proposed method using simulation data and a Sacramento housing price data set.

Seminar recording

Speaker: Dr. Victor De Oliveira (Department of  Management Science and Statistics, UT San Antonio)

Location: PGH 646A


The Mat\'ern family of covariance functions is currently the most commonly used for the analysis of geostatistical data due to its ability to describe different smoothness behaviors. Yet, in many applications the smoothness parameter is set at an arbitrary value. This practice is due  partly to computational challenges faced when attempting to estimate all covariance parameters and partly to unqualified claims in the literature stating that geostatistical data have little or no information about the smoothness parameter. This work critically investigates this claim and shows it is not true in general. 

Specifically, it is shown that the information the data have about the correlation parameters varies substantially depending on the true model and sampling design and, in particular, the information about the smoothness parameter can be large, in some cases larger than the information about the range parameter. In light of these findings, we suggest to reassess the aforementioned practice and instead establish inferences from data--based estimates of both range and smoothness parameters, especially for strongly dependent non--smooth processes observed on irregular sampling designs. A data set of daily rainfall totals is used to motivate the discussion and gauge this common practice.

Seminar recording

Speaker: Dr. Guanyu Hu (Department of Statistics, University of Missouri)

Location: PGH 646A

Title: What could statistics offer for sports analytics?

Abstract: Sports analytics are applications of data science to decision-making in all aspects of sports. As important as player/team performance evaluations are, there are also a wide range of sports analytics problems beyond this category. In this talk, I will describe how statistics can help sports industry practitioners to learn more information from their data . I will mainly focus on two interesting statistical problems: heterogeneity and causality. The first topic of today’s talk is heterogeneity learning of NBA players. A Bayesian nonparametric matrix clustering approach is proposed to analyze the latent heterogeneity structure in the shot selection data collected from professional basketball players in the National Basketball Association (NBA). The proposed method adopts a mixture of finite mixtures framework and fully utilizes the spatial information via a mixture of matrix normal distribution representation. In the second part of today’s talk, I will discuss a novel causal inference approach to study the causal effect of home field advantage in English Premier League. A hierarchical causal model is developed to show that both league level and team level causal effects are identifiable and can be conveniently estimated.