Research
Funding
My research threads have been funded by National Institute of Health (R01 grant - 1 percentile score), National Science Foundation and Office of Naval Research with me as PI/co-PI.
Research Codes
Research codes of many of my papers are available on my GitHub page (Raj Guhaniyogi GitHub)
Spatial and Spatial-temporal Modeling
Due to the recent advances in Geographical Information Systems (GIS) and other advanced measurement techniques with lower storage costs, the size and complexity of spatially geocoded data are growing over time. While classical geospatial models using Gaussian processes are ideal to capture complex local variations in spatial data, they are often inadequate in scaling to large datasets due to the massive computation and storage complexity. The issue is further exacerbated for Bayesian estimation of Gaussian process models using Markov Chain Monte Carlo (MCMC). I have been introduced to this problem during my early PhD years through my first-hand collaboration with environmental and soil scientists on the spatial/space-time analysis of massive remote sensing data. I have been developing computationally efficient low-rank Gaussian processes, their multiscale variants and distributed inference with Gaussain processes to offer general solutions to this problem.
Guhaniyogi, R., Finely, A.O., Banerjee, S. and Gelfand, A.E. (2011). Adaptive Gaussian predictive process models for large spatial datasets. Environmetrics, 22, 997-1007.
Guhaniyogi, R., Finely, A.O., Banerjee, S. and Kobe, R. (2013). Modeling Low-rank Spatially-Varying Cross-Covariances using Predictive Process with Application to Soil Nutrient Data. Journal of Agricultural, Biological and Environmental Statistics, 18, 274-298.
Guhaniyogi, R. (2017). Multivariate Bias Adjusted Tapered Predictive Process Models. Spatial Statistics, 21, 42-65.
Guhaniyogi, R. and Banerjee, S. (2018). Meta-Kriging: Scalable Bayesian Modeling and Inference for Massive Spatial Datasets. Technometrics, 60(4), 430-444
Guhaniyogi, R. and Banerjee, S. (2019). Multivariate Spatial Meta Kriging. Statistics and Probability Letters, 144, 3-8.
Heaton, M.J., Datta, A., Finley, A., Furrer, R., Guhaniyogi, R., Gerber, F., Gramacy, R. B., Hammerling, D., Katzfuss, M., Lindgren, F., Nychka, D. W., Sun, F. and Mangion, A. Z. (in alphabetical order) (2019). Methods for Analyzing Large Spatial Data: A Review and Comparison. Journal of Agricultural, Biological and Environmental Statistics, 24, 398-425.
Guhaniyogi, R. and Sanso, B. (2019). Large Multiscale Spatial Modeling using Tree Shrinkage Priors. Statistica Sinica, 30, 2023-2050.
Guhaniyogi, R., Baracaldo, L. and Banerjee, S. (2023+) Bayesian Data Sketching for Spatial Regression Models. Revision Requested, Journal of Machine Learning Research, Available at
Distributed Bayesian Inference for Stochastic Process Modeling of Massive Structured Data
While last decade has seen an increasing interest in developing practically efficient and theoretically optimal flexible high-dimensional and nonparametric Bayesian methods to model the complex dependencies in massive structured datasets (e.g., spatial data, temporal data, higher order functional data) that aid precise characterization of uncertainty from the underlying stochastic processes, most of this effort has been devoted to the development of novel modeling approaches. This literature has largely operated within a centralized data processing framework where all data are stored and analyzed on a single processor. However, many practical considerations often require data to be analyzed in a decentralized manner without any communication between the analysis for different subparts. While this scenario majorly arises in my collaboration with environmental scientists regarding efficient storage and computation of massive data, I have also encountered it during my collaboration with multiple groups in national laboratories which collect data on similar characteristics, but are not able to share data outside of their organization. In such cases, it is crucial to adopt a strategy of "federated" or "distributed" learning with complex stochastic process models for structured data. Most of the ML strategies on this topic is suboptimal for offering uncertainty quantification and lack theoretical gurantees. This thread comprehensively addresses this gap by offering a general distributed Bayesian algorithm for stochastic process models with dependent data.
Guhaniyogi, R. and Banerjee, S. (2018). Meta-Kriging: Scalable Bayesian Modeling and Inference for Massive Spatial Datasets. Technometrics, 60(4), 430-444.
Guhaniyogi, R. and Banerjee, S. (2019). Multivariate Spatial Meta Kriging. Statistics and Probability Letters, 144, 3-8.
Heaton, M.J., Datta, A., Finley, A., Furrer, R., Guhaniyogi, R., Gerber, F., Gramacy, R. B., Hammerling, D., Katzfuss, M., Lindgren, F., Nychka, D. W., Sun, F. and Mangion, A. Z. (in alphabetical order) (2019). Methods for Analyzing Large Spatial Data: A Review and Comparison. Journal of Agricultural, Biological and Environmental Statistics, 24, 398-425.
Baracaldo, L. and Guhaniyogi, R. (2021). Spatial Meta Kriging for Distributed Inference with Multivariate Spatial Generalized Linear Models for Binary Response. Journal of the Indian Statistical Association (Special Issue on Spatio-temporal Statistics), 59(2), 1-14.
Guhaniyogi, R., Li, C., Savitsky, T. and Srivastava, S. (2022). Distributed Bayesian Varying Coefficient Modeling Using a Gaussian Process Prior. Journal of Machine Learning Research, 23(84), 1-59.
Guhaniyogi, R., Li, C., Savitsky, T. and Srivastava, S. (2023). Distributed Bayesian Inference in Massive Spatial Data. Statistical Science, 38(2), 262-284.
Andros, J., Guhaniyogi, R., Francom, D. C., Pasqualini, D. (2023+). Use of Data Sketching for Robust Distributed Bayesian Inference with Large Functional Data. Under Review, Available at
Guhaniyogi, R. (2024). Strategies for Distributed Bayesian Inference with Independent and Correlated Data. Accepted, Wiley StatsRef: Statistical References Online, Available at
Bayesian Regression with Non-Euclidean Objects
Collaboration with neuroscientists sparked my interest in developing regression methods with non-euclidean objects, such as tensors or graphs. Consequently, my group developed the first Bayesian regression frameworks with tensor-valued predictors, as well as with tensor-valued responses. Of late, neuro-imaging data from multiple imaging modalities (e.g., fMRI, DTI, PET) haa opened the possibility of integrating information from different sources to study neuro-degenerative disorders, like Alzheimer's. To this end, my group has been actively pursuing development of regression methods with diverse objects having different but connected topologies. This is an extremely ripe yet largely unexplored area of research which will see many exciting developments.
Guhaniyogi, R., Qamar, S. and Dunson, D.B. (2017). Bayesian Tensor Regression. Journal of Machine Learning Research, 18, 1-31.
Guhaniyogi, R. (2017). Convergence Rate of Bayesian Supervised Tensor Modeling with Multiway Shrinkage Priors. Journal of Multivariate Analysis, 160, 157-168.
Guhaniyogi, R. and Rodriguez, A. (2020). Joint Modeling of Longitudinal Relational Data and Exogenous Variables. Bayesian Analysis, 15 (2), 477-503.
Guhaniyogi, R. (2020). Bayesian Methods for Tensor Regressions. Wiley StatsRef: Statistical References Online, https://doi.org/10.1002/9781118445112.stat08272
Guha, S. and Guhaniyogi, R. (2020). Bayesian Generalized Sparse Symmetric Tensor-on-Vector Regression. Technometrics, 63(2), 160-170.
Guhaniyogi, R. (2020). High Dimensional Bayesian Regularization in Regressions Involving Symmetric Tensors. Proceedings of 18th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, doi: 10.1007/978-3-030-50153-2_26
Spencer, D., Guhaniyogi, R. and Prado, R. (2020). Bayesian Mixed Effect Sparse Tensor Response Regression Model with Joint Estimation of Activation and Connectivity. Psychometrika, 85(4), 845-869.
Guhaniyogi, R. and Spencer, D. (2021). Bayesian Tensor Response Regression with an Application to Brain Activation Studies. Bayesian Analysis, 16(4): 1221-1249.
Spencer, D., Guhaniyogi, R., Shinohara, R. and Prado, R. (2023+). Bayesian Scalar-on-Tensor Regression using The Tucker Decomposition for Sparse Spatial Modeling Finds Promising Results Analyzing Neuroimaging Data. Second Revision Invited, Biostatistics, Available at
Guha, S. and Guhaniyogi, R. (2024). Covariate-Dependent Clustering of Undirected Networks with Brain-Imaging Data. In Press, Technometrics, https://doi.org/10.1080/00401706.2024.2321930.
Gutierrez, R., Scheffler, A., Guhaniyogi, R., Dickinson, A., DiStefano, C. and Jeste, S. (2023+). A Covariance Based Clustering for Tensor Objects. Under Review, Available at
Guhaniyogi, R. and Guha, S. (2023+). Convergence Rate for Predictive Densities of Bayesian Generalized Linear Model with Scalar Response and Symmetric Tensor Predictor. Under Review, Available at
Gutierrez, R., Scheffler, A., Guhaniyogi, R., Tempini, M.G., Mandelli, M.L. and Battistella, G. (2023+). Multi-object Data Integration in the Study of Primary Progressive Aphasia. Revision Requested, Annals of Applied Statistics, Available at
Gutierrez, R., Guhaniyogi, R., Scheffler, A. (2023+). Regression with Structured Features at Multiple Scales to the Study of General Cognition in Children. Under Review, Available at
Lei, B., Guhaniyogi, R., Chandra, K., Scheffler, A., Mallick, B.K., ADNI (2023+). InVA: Integrative Variational Autoencoder for Harmonization of Multi-Modal Neuroimaging Data. Under Review, Available at
Bayesian Data Sketching with Random Matrices
The three important aspects of modern statistical learning approaches in the era of complex and high dimensional data are accuracy, scale and privacy in inference. Modern data is increasingly complex and high dimensional, involving a large number of variables and large sample size, with complex relationships between different variables. Developing practically efficient (in terms of storage and analysis) and theoretically “optimal” Bayesian high dimensional parametric or nonparametric regression methods to draw accurate inference with valid uncertainties from such complex datasets is a very ripe area of research. Privacy for data samples is often an important consideration in developing inference with such methods, especially when a large amount of confidential data is handled within an organization. Developing a general solution for these problems, we propose approaches based on data compression using a small number of random linear transformations. Our approach either reduces a large number of records corresponding to each variable using compression, in which case it maintains feature interpretation for adequate inference, or, reduces the dimension of the covariate vector for each sample using compression, in which case the focus is only on prediction of the response. In either case, data compression facilitates drawing storage efficient, scalable and accurate Bayesian inference/prediction in presence of high dimensional data with sufficiently rich parametric and nonparametric regression models.
Guhaniyogi, R. and Dunson, D.B. (2015). Bayesian Compressed Regression. Journal of the American Statistical Association, Theory & Methods, 110, 1500-1514
Guhaniyogi, R. and Dunson, D.B. (2016). Compressed Gaussian Process for Manifold Regression. Journal of Machine Learning Research, 17, 1-26.
Guhaniyogi, R., Baracaldo, L. and Banerjee, S. (2023+) Bayesian Data Sketching for Spatial Regression Models. Revision Requested, Journal of Machine Learning Research, Available at
Guhaniyogi, R. and Scheffler, A. (2023+) Sketching in Bayesian High Dimensional Regression With Big Data Using Gaussian Scale Mixture Priors. Under Minor Revision, Journal of Machine Learning Research, Available at
Gailliot, S., Guhaniyogi, R., Peng, R. (2023+) Data Sketching and Stacking: A Confluence of Two Strategies for Predictive Inference in Gaussian Process Regressions with High-Dimensional Features. Under Review, Available at
Andos, J., Guhaniyogi, R., Francom, D., Pasqualini, D. (2023+) Robust Distributed Learning of Functional Data from Simulators through Data Sketching. Under Review, Available at
Bayesian High-Dimensional Regressions
In the current surge of high-throughput data exploration, we frequently encounter intricate outcomes presented as high-dimensional arrays, where the number of model parameters (p) massively surpasses the sample size (n), even in the case of the simplest parametric models. The challenge of making inferences in scenarios where the parameter count (p) far exceeds the sample size (n) urges us to delve into and leverage lower-dimensional structures inherent in the data generation process. However, Bayesian estimation of the data generation process and subsequent inference is computationally prohibitive and inferentially inaccurate when p far exceeds n. One of my long-term research threads is to develop novel approaches to draw predictive inference in such scenarios bypassing computationally inefficient MCMC. I have also offered computationally efficient strategies to draw inference for Bayesian high-dimensional regressions when both n and p are large.
Guhaniyogi, R. and Dunson, D.B. (2015). Bayesian Compressed Regression. Journal of the American Statistical Association, Theory & Methods, 110, 1500-1514
Guhaniyogi, R. and Dunson, D.B. (2016). Compressed Gaussian Process for Manifold Regression. Journal of Machine Learning Research, 17, 1-26.
Gutierrez, R. and Guhaniyogi, R. (2022). Bayesian Dynamic Feature Partitioning in High Dimensional Regression for Big Data. Technometrics, 64(2): 224-240.
Guhaniyogi, R. and Scheffler, A. (2023+). Sketching in Bayesian High Dimensional Regression With Big Data Using Gaussian Scale Mixture Priors. Under Minor Revision, Journal of Machine Learning Research, Available at.
Gailliot, S., Guhaniyogi, R., Peng, R. (2023+). Data Sketching and Stacking: A Confluence of Two Strategies for Predictive Inference in Gaussian Process Regressions with High-Dimensional Features. Under Review, Available at
Online Approximate Bayesian Learning with Guarantees on Uncertainty
In the past decade, there is a growing interest in Online Bayesian learning for complex datasets obtained sequentially over time. These interests are flurred by the high-dimensional sequential data obtained routinely in stock markets and from satellites. While there has been a growing literature in online Bayesian methods and algorithms, they are often not conducive to offer desirable inference for high-dimensional sequential data. Besides, these approaches often do offer theoretical guarantee in this scenario. To address this gap, a long-term research thread of mine develops online Bayesian learning algorithms which offers accurate inference even with high-dimensional sequential data, and offers asymptotically guaranteed inference.
Guhaniyogi, R., Qamar, S. and Dunson, D.B. (2018). Bayesian Conditional Density Filtering. Journal of Computational and Graphical Statistics, 27(3), 653-672.
Gutierrez, R. and Guhaniyogi, R. (2022). Bayesian Dynamic Feature Partitioning in High Dimensional Regression for Big Data. Technometrics, 64(2): 224-240.
Projects in Public Health
I have been involved in various public health projects in collaboration with epidemiologists, particularly in the area of detecting boundaries between counties with significant differences in health outcomes on either side.
Guhaniyogi, R. (2017). Bayesian Nonparametric Areal Wombling for Small Scale Maps with an Application to Urinary Bladder Cancer Data from Connecticut. Statistics in Medicine, 36, 4007-4027.
Belani, H.K., Sekar, P., Guhaniyogi, R., Abraham, A., Bohjanen, P.R. and Bohjanen, K. (2014). Human papillomavirus vaccine acceptance among young men in Bangalore, India. International Journal of Dermatology, 53, 486-491.