Speaker: Dr. Hao Zhang (Department of Statistics, Purdue University) website
Location: PGH 646A
Title: Gaussian Process for Spatial Big Data
Abstract: Gaussian processes are fundamental to spatial statistics and machine learning. One main objective in the study of Gaussian process is prediction based on a partial realization. The best linear unbiased prediction, or kriging, is now applied in several disciplines including engineering, ecology, environmental and climate studies, and the application has extended to new areas such as computer simulation, and uncertainty quantification. Although kriging only requires the first two moments and no distributional assumptions, the Gaussian process theory provides nice tools to study the properties of kriging. Some fascinating theory for kriging has been developed that provides insight into the understanding of kriging results. In recent years, a focus area in Gaussian process has been in computational methods that can accommodate big data. Most of the methods are based on some kind of approximation to the underlying process. I will review some recent theoretical results to show that the low-rank models may be our only choice to overcome some inherent issues in working with a large spatial matrix. However, existing low-rank models are known to have undesirable properties which limit the performance of the approximation. New low-ranks models could negate that.
Bio: Hao Zhang is Professor of Statistics at Purdue University. He is Fellow of American Statistical Association and an Elected Member of the International Statistical Institute. He has served editorial boards of Journal of the American Statistical Association, Statistica Sinica, Environmetrics, and Statistics & Probability Letters. His research interests are primarily in spatial and spatio-temporal statistics. His work includes both theoretical investigation into asymptotic properties of inferential methods for spatial data and development of algorithms for the analysis of big spatial data. He collaborates with researchers in ecology, environmental sciences, climatology, and natural resources.
Speaker: Dr. Huiyan Sang (Department of Statistics, Texas A&M University) website
Location: PGH 646A
Title: Bayesian Graphical Decision Tree Boosting for Nonparametric Machine Learning Regression
Abstract: Ensemble decision tree methods such as XGBoost, Random Forest, and Bayesian additive regression trees have gained great popularity as a flexible nonparametric function estimation and modeling tool. Most existing ensemble decision tree models rely on decision tree weak learners with axis-parallel univariate split rules to partition the Euclidean feature space into rectangular regions. In practice, however, many regression problems involve features with multivariate structures (e.g., spatial locations) possibly lying in a manifold, where rectangular partitions may fail to respect irregular intrinsic geometry and boundary constraints of the structured feature space. In this paper, we develop a new class of Bayesian additive multivariate decision tree models, which combine univariate split rules for handling possibly high dimensional features without known multivariate structures and novel multivariate split rules for features with multivariate structures in each weak learner. The proposed multivariate split rules are built upon stochastic predictive spanning tree bipartition models on reference knots, which are capable of achieving highly flexible nonlinear decision boundaries on manifold feature spaces while enabling efficient dimension reduction computations. We demonstrate the superior performance of the proposed method using simulation data and a Sacramento housing price data set.
Speaker: Dr. Victor De Oliveira (Department of Management Science and Statistics, UT San Antonio)
Location: PGH 646A
Title: ON INFORMATION ABOUT COVARIANCE PARAMETERS IN GAUSSIAN MAT\'ERN RANDOM FIELDS
The Mat\'ern family of covariance functions is currently the most commonly used for the analysis of geostatistical data due to its ability to describe different smoothness behaviors. Yet, in many applications the smoothness parameter is set at an arbitrary value. This practice is due partly to computational challenges faced when attempting to estimate all covariance parameters and partly to unqualified claims in the literature stating that geostatistical data have little or no information about the smoothness parameter. This work critically investigates this claim and shows it is not true in general.
Specifically, it is shown that the information the data have about the correlation parameters varies substantially depending on the true model and sampling design and, in particular, the information about the smoothness parameter can be large, in some cases larger than the information about the range parameter. In light of these findings, we suggest to reassess the aforementioned practice and instead establish inferences from data--based estimates of both range and smoothness parameters, especially for strongly dependent non--smooth processes observed on irregular sampling designs. A data set of daily rainfall totals is used to motivate the discussion and gauge this common practice.
Speaker: Dr. Guanyu Hu (Department of Statistics, University of Missouri)
Location: PGH 646A
Title: What could statistics offer for sports analytics?
Abstract: Sports analytics are applications of data science to decision-making in all aspects of sports. As important as player/team performance evaluations are, there are also a wide range of sports analytics problems beyond this category. In this talk, I will describe how statistics can help sports industry practitioners to learn more information from their data . I will mainly focus on two interesting statistical problems: heterogeneity and causality. The first topic of today’s talk is heterogeneity learning of NBA players. A Bayesian nonparametric matrix clustering approach is proposed to analyze the latent heterogeneity structure in the shot selection data collected from professional basketball players in the National Basketball Association (NBA). The proposed method adopts a mixture of finite mixtures framework and fully utilizes the spatial information via a mixture of matrix normal distribution representation. In the second part of today’s talk, I will discuss a novel causal inference approach to study the causal effect of home field advantage in English Premier League. A hierarchical causal model is developed to show that both league level and team level causal effects are identifiable and can be conveniently estimated.