Coral bleaching is a major concern for marine ecosystems; more than half of the world's coral reefs have either bleached or died over the past three decades. Increasing sea surface temperatures, along with various spatiotemporal environmental factors, are considered the primary reasons behind coral bleaching. The statistical and machine learning communities have focused on multiple aspects of the environment in detail. However, the literature on various stochastic modeling approaches for assessing coral bleaching is extremely scarce. Data-driven strategies are crucial for effective reef management, and this review article provides an overview of existing statistical and machine learning methods for assessing coral bleaching. Statistical frameworks, including simple regression models, generalized linear models, generalized additive models, Bayesian regression models, spatiotemporal models, and resilience indicators, such as Fisher’s Information and Variance Index, are commonly used to explore how different environmental stressors influence coral bleaching. On the other hand, machine learning methods, including random forests, decision trees, support vector machines, and spatial operators, are more popular for detecting nonlinear relationships, analyzing high-dimensional data, and allowing integration of heterogeneous data from diverse sources. In addition to summarizing these models, we also discuss potential data-driven future research directions, with a focus on constructing statistical and machine learning models in specific contexts related to coral bleaching.
Spatial generalized linear mixed-effects methods are popularly used to model spatially indexed univariate responses. However, with modern technology, it is common to observe vector-valued mixed-type responses, e.g., a combination of binary, count, or continuous types, at each location. Methods that allow joint modeling of such mixed-type multivariate spatial responses are rare. Using latent multivariate Gaussian processes (GPs), we present a class of Bayesian spatial methods that can be employed for any combination of exponential family responses. Since multivariate GP-based methods can suffer from computational bottlenecks when the number of spatial locations is high, we further employ a computationally efficient Vecchia approximation for fast posterior inference and prediction. Key theoretical properties of the proposed model, such as identifiability and the structure of the induced covariance, are established. Our approach employs a Markov chain Monte Carlo-based inference method that utilizes elliptical slice sampling in a blocked Metropolis-within-Gibbs sampling framework. We illustrate the efficacy of the proposed method through simulation studies and a real-data application on joint modeling of wildfire counts and burnt areas across the United States.
The generalized exponential distribution is a well-known probability model in lifetime data analysis and several other research areas, including precipitation modeling. Despite having broad applications for independently and identically distributed observations, its uses as a generalized linear model for non-identically distributed data are limited. This paper introduces a semiparametric Bayesian generalized exponential (GE) regression model. Our proposed approach involves modeling the GE rate parameter within a generalized additive model framework. An important feature is the integration of a principled distance-based prior for the GE shape parameter; this allows the model to shrink to an exponential regression model that retains the advantages of the exponential family. We draw inferences using the Markov chain Monte Carlo algorithm. Extensive simulations demonstrate that the proposed model outperforms simpler alternatives. The Western Ghats mountain range holds critical importance in regulating monsoon rainfall across Southern India, profoundly impacting regional agriculture. Here, we analyze daily wet-day rainfall data for the monsoon months between 1901--2022 for the Northern, Middle, and Southern Western Ghats regions. Applying the proposed model to analyze the rainfall data over 122 years provides insights into model parameters, short-term temporal patterns, and the impact of climate change. We observe a significant decreasing trend in wet-day rainfall for the Southern Western Ghats region.
Assessing the availability of rainfall water plays a crucial role in rainfed agriculture. Given the substantial proportion of agricultural practices in India being rainfed and considering the potential trends in rainfall amounts across years due to climate change, we build a statistical model for analyzing monsoon total rainfall data for 34 meteorological subdivisions of mainland India available for 1951-2014. Here, we model the marginal distributions using a gamma regression model and the dependence through a Gaussian conditional autoregressive (CAR) copula model. Due to the natural variation of the monsoon total rainfall received across various dry through wet regions of the country, we allow the parameters of the marginal distributions to be spatially varying, under a latent Gaussian model framework. The neighborhood structure of the regions determines the dependence structure of both the likelihood and the prior layers, where we explore both CAR and intrinsic CAR structures for the priors. The proposed methodology also effectively imputes the missing data. We use the Markov chain Monte Carlo algorithms to draw Bayesian inferences. In simulation studies, the proposed model outperforms several competitors that do not allow a dependence structure at the data or prior layers. Implementing the proposed method for the Indian areal rainfall dataset, we draw inferences about the model parameters and discuss the potential effect of climate change on rainfall across India. While the assessment of the impact of climate change on rainfall motivates our study, the proposed methodology can be easily adapted to other contexts dealing with non-Gaussian non-stationary areal datasets where data from single or multiple temporal covariates are also available, and it is appropriate to assume their coefficients to be spatially varying.
Analyzing the impact of climate change on rainfall is crucial from the perspectives of agriculture, water resource management, and disaster management. Motivated by the key features present in a gridded rainfall dataset published by the India Meteorological Department (IMD) for the years 1951--2022 at a spatial resolution of 0.25X0.25 degrees across India, we propose a high-dimensional Bayesian skew-Gaussian conditional autoregressive model to analyze spatially-varying temporal patterns in Indian monsoon rainfall. We model the mean component in a semiparametric regression framework with spatially varying coefficients and assume the skewness component to be spatially varying as well. We briefly discuss some theoretical properties of the proposed model. We draw inferences using Gibbs sampling by exploiting the hierarchical definition of a skew-Gaussian distribution as a random location-mixture of a Gaussian distribution. We study the posterior mean and the corresponding T-statistics of the overall change in rainfall amounts during the observational period at each grid cell. Our results indicate a significant negative trend in various parts of the Gangetic River basin, where agriculture is predominantly dependent on the availability of rainwater.
Soil hydraulic properties (SHPs) are fundamental to the accuracy of land-surface, hydrological and agro-hydrological models. Most modelling studies assume static SHPs due to limited information on how climate and anthropogenic factors drive their temporal fluctuations. This study aims to address this gap by investigating the temporal variation of SHPs in a cropping season (rice and wheat) under different fertiliser and irrigation management practices, identifying the key factors controlling this variability, and understanding the resulting impacts on hydrological processes. Experiments were conducted at an agricultural plot at IIT Kanpur, India, located within the Ganga Basin under no-tillage conditions. The temporal variability was analysed in soil organic carbon (OC) content, bulk density (rho_b), saturated hydraulic conductivity (K_sat) and soil water retention curve (SWRC) for the period 2022–2023. The highest variability was observed in K_sat. The maximum change in value for the study duration was observed to be 160% at 25 cm depth. The saturated water content, theta_sat remained consistent across both crops. Trend analysis revealed a significant negative temporal trend in soil OC for both cropping seasons, irrespective of the fertiliser treatment. Plant root depth, irrigation and evapotranspiration significantly influenced SHP variability across all crops and fertiliser treatments. The temporal evolution of SHPs suggests dynamic shifts in infiltration rates, water retention and water redistribution throughout the cropping season. Further, simplified relationships were developed to predict the SHPs at any given time for the two crop seasons, with an R-square up to 0.98. These findings enhance our understanding of the dynamic nature of SHPs and their controlling factors.
This research presents the contribution of Team Yahabe to the EVA (2023) Conference Data Challenge. We tackle the four problems posed by the organizers by revisiting the current and existing literature on conditional univariate and multivariate extremes. We highlight overarching themes linking the four tasks, ranging from model validation at extremely high quantile levels to building customized estimation strategies that leverage model assumptions.
This paper addresses the challenge of obtaining precise demographic information at a fine-grained spatial level, which is necessary for planning localized public services, such as water distribution networks, or understanding local human impacts on the ecosystem. While population sizes are commonly available for large administrative areas, such as wards in India, practical applications often demand knowledge of population density at smaller spatial scales. We explore the integration of alternative data sources, specifically satellite-derived products, including land cover, land use, street density, building heights, vegetation coverage, and drainage density. Using a case study focused on Bangalore City, India, with a ward-level population dataset for 198 wards and satellite-derived sources covering 786,702 pixels at a 30 m × 30 m resolution, we propose a semiparametric Bayesian spatial regression model for obtaining pixel-level population estimates. Given the high dimensionality of the problem, exact Bayesian inference is deemed impractical. We discuss an approximate Bayesian inference scheme based on the recently proposed max-and-smooth approach, a combination of the Laplace approximation and Markov chain Monte Carlo. A simulation study validates the reasonable performance of our inferential approach. Mapping pixel-level estimates to the ward level demonstrates the effectiveness of our method in capturing the spatial distribution of population sizes. While our case study focuses on a demographic application, the methodology developed here can readily be applied to count-type spatial datasets from various scientific disciplines where high-resolution alternative data sources are available.
The dependence in the tails of the joint distribution of two random variables is generally assessed using the chi-measure, which is the limiting conditional probability of one variable being extremely high given that the other variable is also extremely high. This work is motivated by the structural changes in chi-measure between the daily rate of return (RoR) of the two Indian airlines, IndiGo and SpiceJet, during the COVID-19 pandemic. We model the daily maximum and minimum RoR vectors (potentially transformed) using the bivariate Husler-Reiss (BHR) distribution. To estimate the changepoint in the chi-measure of the BHR distribution, we explore two changepoint detection procedures based on the Likelihood Ratio Test (LRT) and Modified Information Criterion (MIC). We obtain critical values and power curves for the LRT and MIC test statistics across a range of chi-measure values, from low to high. We also numerically explore the consistency of the estimators of the change point based on the LRT and MIC. In our data application, for RoR maxima and minima, the most prominent changepoints detected by LRT and MIC are close to the announcement of the first phases of lockdown and unlock, respectively, which are realistic; thus, our study would be beneficial for portfolio optimization in the case of future pandemic situations.
Observations of groundwater pollutants, such as arsenic or Perfluorooctane Sulfonate (PFOS), are often left censored. These measurements have a significant impact on the health and lifestyle of the population. Left censoring of these spatially correlated observations is usually addressed by applying Gaussian processes (GPs), which have theoretical advantages. However, this comes with a challenging computational complexity of O(n^3), impractical for large datasets. Additionally, a sizable proportion of the left-censored data creates further bottlenecks since the likelihood computation now involves an intractable high-dimensional integral of the multivariate Gaussian density. In this article, we address these two problems simultaneously by approximating the GP with a Gaussian Markov random field (GMRF) approach that leverages an explicit connection between a GP with a Matern correlation function and a GMRF using stochastic partial differential equations (SPDEs). We introduce a GMRF-based measurement error into the model, which alleviates the likelihood computation for the censored data, drastically improving the computational speed while maintaining admirable accuracy. Our approach demonstrates robustness and substantial computational scalability compared to state-of-the-art methods for censored spatial responses across various simulation settings. Finally, the fit of this fully Bayesian model to the concentration of PFOS in groundwater, available at 24,959 sites across California, where 46.62% of the responses are censored, produces a prediction surface and uncertainty quantification in real-time, thereby substantiating the applicability and scalability of the proposed method. Code for implementation is made available via GitHub.
Increasingly large and complex spatial datasets pose massive inferential challenges due to high computational and storage costs. Our study is motivated by the KAUST Competition on Large Spatial Datasets 2023, which tasked participants with estimating spatial covariance-related parameters and predicting values at testing sites, along with uncertainty estimates. We compared various statistical and deep learning approaches through cross-validation and ultimately selected the Vecchia approximation technique for model fitting. To overcome the constraints in the R package GpGp, which lacked support for fitting zero-mean Gaussian processes and direct uncertainty estimation — two features necessary for the competition —we developed additional R functions. Besides, we implemented certain subsampling-based approximations and parametric smoothing for skewed sampling distributions of the estimators. Our team secured first place in two out of four sub-competitions and second place in the other two, validating the effectiveness of our proposed strategies. Moreover, we extended our evaluation to a large real spatial satellite-derived dataset on total precipitable water, where we compared the predictive performances of different models using multiple diagnostics.
Statistical modeling of a nonstationary spatial extremal dependence structure is challenging. Max-stable processes are common choices for modeling spatially indexed block maxima, where the assumption of stationarity is typically made to facilitate inference. However, this assumption is often unrealistic for data observed over a large or complex domain. We propose a computationally efficient method for estimating extremal dependence using a globally nonstationary, but locally stationary, max-stable process by exploiting nonstationary kernel convolutions. We divide the spatial domain into a fine grid of subregions, assign each of them its own set of dependence parameters, and use LASSO (L1) or ridge (L2) penalties to obtain spatially smooth parameter estimates. We then develop a novel data-driven algorithm to merge homogeneous neighboring subregions. The algorithm facilitates model parsimony and interpretability. To make our model suitable for high-dimensional data, we exploit a pairwise likelihood to draw inferences and discuss computational and statistical efficiency. An extensive simulation study demonstrates the superior performance of our proposed model and the subregion-merging algorithm over the approaches that either do not model nonstationarity or do not update the domain partition. We apply our proposed method to model monthly maximum temperatures at over 1,400 sites in Nepal and the surrounding Himalayan and sub-Himalayan regions. We again observe significant improvements in model fit compared to a stationary process and a nonstationary process without subregion merging. Furthermore, we demonstrate that the estimated merged partition is interpretable from a geographic perspective and leads to better model diagnostics by adequately reducing the number of subregion-specific parameters.
Wildfires pose a significant threat to both the ecosystem and the economy. The related risk assessment is typically conducted based on fire danger indexes; the McArthur Forest Fire Danger Index (FFDI) is widely used in Australia, for example, to assess the impact of weather on fire behavior. Studying the spatial tail dependence structure of high-resolution FFDI data is crucial for estimating the chances of wildfires. Multivariate Pareto distributions have been widely used for modeling the simultaneous occurrence of extreme events above a high threshold. However, the existing likelihood-based inference approaches are computationally prohibitive in high dimensions, due to the need to censor observations in the bulk of the distribution. Here we construct models for spatial FFDI extremes by exploiting the sparse conditional independence structure of Husler–Reiss-type generalized Pareto processes defined on trees. Such models enable a simplified likelihood function that can be computed more efficiently. While a single tree-based model cannot capture complex spatial dependence flexibly, our modeling framework is based on a mixture of tree-based multivariate Pareto distributions, which allow for randomly generated tree structures. The resulting model is flexible and allows a nonstationary spatial dependence structure. We fit the model to summer (November, December, January, and February) FFDI data across different non-overlapping spatial clusters within Mainland Australia and 14 decadal windows between 1999-2022 to study local spatiotemporal variability. The results show that our proposed method fits the margins and spatial tail dependence structure reasonably well.
Statistical modeling of rainfall data is an active research area in agro-meteorology. The most common models fitted to such datasets are exponential, gamma, log-normal, and Weibull distributions. As an alternative to some of these models, the generalized exponential (GE) distribution was proposed by Gupta and Kundu (2001, Exponentiated Exponential Family: An Alternative to Gamma and Weibull Distributions, Biometrical Journal). Rainfall (specifically for short periods) datasets often include outliers, and thus, a proper robust parameter estimation procedure is necessary. Here, we use the popular minimum density power divergence estimation (MDPDE) procedure developed by Basu et al. (1998, Robust and Efficient Estimation by Minimising a Density Power Divergence, Biometrika) for estimating the GE parameters. We derive the analytical expressions for the estimating equations and asymptotic distributions. We analytically compare MDPDE with maximum likelihood estimation in terms of robustness, through an influence function analysis. Besides, we study the asymptotic relative efficiency of MDPDE analytically for different parameter settings. We apply the proposed technique to some simulated datasets and two rainfall datasets from Texas, United States. The results indicate superior performance of MDPDE compared to the other existing estimation techniques in most of the scenarios.
Various natural phenomena exhibit spatial extremal dependence at short spatial distances. However, existing models proposed in the spatial extremes literature often assume that extremal dependence persists across the entire domain. This is a strong limitation when modeling extremes over large geographical domains, and yet it has been mostly overlooked in the literature. We develop a more realistic Bayesian framework here, based on a novel Gaussian scale mixture model. The Gaussian process component is defined by a stochastic partial differential equation, yielding a sparse precision matrix, and the random scale component is modeled as a low-rank Pareto-tailed or Weibull-tailed spatial process, determined by compactly supported basis functions. We demonstrate that our proposed model is approximately tail-stationary and can capture a wide range of extremal dependence structures. Its inherently sparse structure allows for fast Bayesian computations in high spatial dimensions, based on a customized Markov chain Monte Carlo algorithm that prioritizes calibration in the tail. We fit our model to analyze heavy rainfall data from the monsoon season in Bangladesh. Our study demonstrates that our model outperforms natural competitors and accurately captures precipitation extremes. We finally use the fitted model to draw inferences on long-term return levels for marginal precipitation and spatial aggregates.
Statistical modeling of monthly, seasonal, or annual rainfall data is an important research area in meteorology. These models play a crucial role in rainfed agriculture, where a precise assessment of future rainfall availability is necessary. The rainfall amount during a rainy month or a whole rainy season can take any positive value and some simple (one or two-parameter) probability models supported over the positive real line that are generally used for rainfall modeling are exponential, gamma, Weibull, lognormal, Pearson Type-V/VI, log-logistic, etc., where the unknown model parameters are routinely estimated using the maximum likelihood estimator (MLE). However, the presence of outliers or extreme observations is a common issue in rainfall data, and the MLEs being highly sensitive to them often leads to spurious inference. Here, we discuss a robust parameter estimation approach based on the minimum density power divergence estimator (MDPDE). We fit the above four parametric models to the detrended, areally weighted, monthly rainfall data from the 36 meteorological subdivisions of India for the years 1951–2014 and compare the fits based on MLE and the proposed optimum MDPDE. The superior performance of MDPDE is showcased in several cases. For all month-subdivision combinations, we discuss the best-fit models and median rainfall amounts.
Motivated by the Extreme Value Analysis 2021 (EVA 2021) data challenge, we propose a method based on statistics and machine learning for the spatial prediction of extreme wildfire frequencies and sizes. This method is specifically designed to handle large datasets, including those with missing observations. Our approach relies on a four-stage, bivariate, sparse spatial model for high-dimensional zero-inflated data, which we develop using stochastic partial differential equations (SPDEs), allowing for sparse precision matrices in the latent processes. In Stage 1, the observations are separated into zero/nonzero categories and modeled using a two-layered hierarchical Bayesian sparse spatial model to estimate the probabilities of these two categories. In Stage 2, we first obtain empirical estimates of the spatially varying mean and variance profiles across the spatial locations for the positive observations and then smooth those estimates using fixed-rank kriging. This approximate Bayesian inference method is employed to mitigate the high computational burden associated with modeling large spatial datasets using spatially varying coefficients. In Stage 3, we further model the standardized log-transformed positive observations from the second stage using a sparse bivariate spatial Gaussian process. The Gaussian distribution assumption for wildfire counts developed in the third stage is computationally effective but erroneous. Thus, in Stage 4, the predicted exceedance probabilities are post-processed using Random Forests. We draw posterior inference for Stages 1 and 3 using Markov chain Monte Carlo (MCMC) sampling. We then create a cross-validation scheme for the artificially generated gaps and compare the EVA 2021 prediction scores of the proposed model to those obtained using some competitors.