Research

Funding 

My research threads have been funded by National Institute of Health (R01 grant - 1 percentile score), National Science Foundation and Office of Naval Research with me as PI/co-PI.

Research Codes

Research codes of many of my papers are available on my GitHub page (Raj Guhaniyogi GitHub)

Spatial and Spatial-temporal Modeling

Due to the recent advances in Geographical Information Systems (GIS) and other advanced measurement techniques with lower storage costs, the size and complexity of spatially geocoded data are growing over time. While classical geospatial models using Gaussian processes are ideal to capture complex local variations in spatial data, they are often inadequate in scaling to large datasets due to the massive computation and storage complexity. The issue is further exacerbated for Bayesian estimation of Gaussian process models using Markov Chain Monte Carlo (MCMC). I have been introduced to this problem during my early PhD years through my first-hand collaboration with environmental and soil scientists on the spatial/space-time analysis of massive remote sensing data. I have been developing computationally efficient low-rank Gaussian processes, their multiscale variants and distributed inference with Gaussain processes to offer general solutions to this problem.


Distributed Bayesian Inference for Stochastic Process Modeling of Massive Structured Data

While last decade has seen an increasing interest in developing practically efficient and theoretically optimal flexible high-dimensional and nonparametric Bayesian methods to model the complex dependencies in massive structured datasets (e.g., spatial data, temporal data, higher order functional data) that aid precise characterization of uncertainty from the underlying stochastic processes, most of this effort has been devoted to the development of novel modeling approaches. This literature has largely operated within a centralized data processing framework where all data are stored and analyzed on a single processor. However, many practical considerations often require data to be analyzed in a decentralized manner without any communication between the analysis for different subparts. While this scenario majorly arises in my collaboration with environmental scientists regarding efficient storage and computation of massive data, I have also encountered it during my collaboration with multiple groups in national laboratories which collect data on similar characteristics, but are not able to share data outside of their organization. In such cases, it is crucial to adopt a strategy of "federated" or "distributed" learning with complex stochastic process models for structured data. Most of the ML strategies on this topic is suboptimal for offering uncertainty quantification and lack theoretical gurantees. This thread comprehensively addresses this gap by offering a general distributed Bayesian algorithm for stochastic process models with dependent data.


Bayesian Regression with Non-Euclidean Objects

Collaboration with neuroscientists sparked my interest in developing regression methods with non-euclidean objects, such as tensors or graphs. Consequently, my group developed the first Bayesian regression frameworks with tensor-valued predictors, as well as with tensor-valued responses. Of late, neuro-imaging data from multiple imaging modalities (e.g., fMRI, DTI, PET) haa opened the possibility of integrating information from different sources to study neuro-degenerative disorders, like Alzheimer's. To this end, my group has been actively pursuing development of regression methods with diverse objects having different but connected topologies. This is an extremely ripe yet largely unexplored area of research which will see many exciting developments.

Bayesian Data Sketching with Random Matrices

The three important aspects of modern statistical learning approaches in the era of complex and high dimensional data are accuracy, scale and privacy in inference. Modern data is increasingly complex and high dimensional, involving a large number of  variables and large sample size, with complex relationships between different variables. Developing practically efficient (in terms of storage and analysis) and theoretically “optimal” Bayesian high dimensional parametric or nonparametric regression methods to draw accurate inference with valid uncertainties from such complex datasets is a very ripe area of research. Privacy for data samples is often an important consideration in developing inference with such methods, especially when a large  amount of confidential data is handled within an organization. Developing a general solution for these problems, we propose approaches based on data compression using a small number of random linear transformations. Our approach either reduces a large number of records corresponding to each variable using compression, in which case it maintains feature interpretation for adequate inference, or, reduces the dimension of the covariate vector for each sample using compression, in which case the focus is only on prediction of the response. In either case, data compression facilitates drawing storage efficient, scalable and accurate Bayesian inference/prediction in presence of high dimensional data with sufficiently rich parametric and nonparametric regression models. 


Bayesian High-Dimensional Regressions

In the current surge of high-throughput data exploration, we frequently encounter intricate outcomes presented as high-dimensional arrays, where the number of model parameters (p) massively surpasses the sample size (n), even in the case of the simplest parametric models. The challenge of making inferences in scenarios where the parameter count (p) far exceeds the sample size (n) urges us to delve into and leverage lower-dimensional structures inherent in the data generation process. However, Bayesian estimation of the data generation process and subsequent inference is computationally prohibitive and inferentially inaccurate when p far exceeds n.  One of my long-term research threads is to develop novel approaches to draw predictive inference in such scenarios bypassing computationally inefficient MCMC. I have also offered computationally efficient strategies to draw inference for Bayesian high-dimensional regressions when both n and p are large.


Online Approximate Bayesian Learning with Guarantees on Uncertainty

In the past decade, there is a growing interest in Online Bayesian learning for complex datasets obtained sequentially over time. These interests are flurred by the high-dimensional sequential data obtained routinely in stock markets and from satellites. While there has been a growing literature in online Bayesian methods and algorithms, they are often not conducive to offer desirable inference for high-dimensional sequential data. Besides, these approaches often do offer theoretical guarantee in this scenario. To address this gap, a long-term research thread of mine develops online Bayesian learning algorithms which offers accurate inference even with high-dimensional sequential data, and offers asymptotically guaranteed inference.

Projects in Public Health

I have been involved in various public health projects in collaboration with epidemiologists, particularly in the area of detecting boundaries between counties with significant differences in health outcomes on either side.