High dimensional data workshop

See below for a detailed schedule and talk abstracts!

workshop-flyer.pdf

Morning Session:

8.30-9am - coffee, meet & greet

9-9.40am - Causal Message Passing for Experiments with Unknown and General Network Interference (Sadegh Shirani Faradonbeh, Stanford)

Randomized experiments are a powerful methodology for data-driven evaluation of decisions or interventions. Yet, their validity may be undermined by network interference. This occurs when the treatment of one unit impacts not only its outcome but also that of connected units, biasing traditional treatment effect estimations. Our study introduces a new framework to accommodate complex and unknown network interference, moving beyond specialized models in the existing literature. Our framework, termed causal message-passing, is grounded in high-dimensional approximate message passing methodology. It is tailored for multi-period experiments and is particularly effective in settings with many units and prevalent network interference. The framework models causal effects as a dynamic process where a treated unit’s impact propagates through the network via neighboring units until equilibrium is reached. This approach allows us to approximate the dynamics of potential outcomes over time, enabling the extraction of valuable information before treatment effects reach equilibrium. Utilizing causal message-passing, we introduce a practical algorithm to estimate the total treatment effect, defined as the impact observed when all units are treated compared to the scenario where no unit receives treatment. We demonstrate the effectiveness of this approach across five numerical scenarios, each characterized by a distinct interference structure.

9.40-10.20 am - Linear Cost Vecchia Approximation of Multivariate Normal Probabilities (Jian Cao, University of Houston)

Multivariate normal (MVN) probabilities arise in myriad applications, but they are analytically intractable and need to be evaluated via Monte Carlo-based numerical integration. For the state-of-the-art minimax exponential tilting (MET) method, we show that the complexity of each of its components can be greatly reduced through an integrand parameterization that utilizes the sparse inverse Cholesky factor produced by the Vecchia approximation, whose approximation error is often negligible relative to the Monte-Carlo error. Based on this idea, we derive algorithms that can estimate MVN probabilities and sample from truncated MVN distributions in linear time (and that are easily parallelizable) at the same convergence or acceptance rate as MET, whose complexity is cubic in the dimension of the MVN probability. We showcase the advantages of our methods relative to existing approaches using several simulated examples. We also analyze a groundwater-contamination dataset with over twenty thousand censored measurements to demonstrate the scalability of our method for partially censored Gaussian-process models.

10.20-10.40 - break

10.40-11.20am - Spatial Point Process Intensity Estimation on Complex Domains (Huiyan Sang, Texas A&M)

The increasing availability of geocoded spatial data with precise location information has sparked significant interest in spatial modeling and the analysis of point processes. This research focuses on intensity estimation for large spatial point patterns in 2-D complex domains and linear networks, addressing issues such as "leakage" and computational inefficiencies present in many existing spatial point process models. We present two models and inference algorithms for estimating spatially varying intensity functions and examining the nonlinear relationship between intensity and explanatory variables in complex domains. Numerical studies are provided to demonstrate the method's performance.

11.20am - 12pm - Drift versus Shift: Decoupling Trends and Changepoint Analysis (Toryn Schafer, Texas A&M)

We introduce a new approach for decoupling trends (drift) and changepoints (shifts) in time series. Our locally adaptive model-based approach for robustly decoupling combines Bayesian trend filtering and machine learning based regularization. An over-parameterized Bayesian dynamic linear model (DLM) is first applied to characterize drift. Then a weighted penalized likelihood estimator is paired with the estimated DLM posterior distribution to identify shifts. We show how Bayesian DLMs specified with so-called shrinkage priors can provide smooth estimates of underlying trends in the presence of complex noise components. However, their inability to shrink exactly to zero inhibits direct changepoint detection. In contrast, penalized likelihood methods are highly effective in locating changepoints. However, they require data with simple patterns in both signal and noise. The proposed decoupling approach combines the strengths of both, that is, the flexibility of Bayesian DLMs with the hard thresholding property of penalized likelihood estimators, to provide changepoint analysis in complex, modern settings. The proposed framework is outlier robust and can identify a variety of changes, including in mean and slope. It is also easily extended for analysis of parameter shifts in time-varying parameter models like dynamic regressions. We illustrate the flexibility and contrast the performance and robustness of our approach with several alternative methods across a wide range of simulations and application examples.

12-1pm - lunch

Afternoon Session:

1-1.40pm - Mixture of Directed Graphical Models for Discrete Spatial Random Fields (Kate Calder, UT Austin)

Current approaches for modeling discrete-valued outcomes associated with spatially-dependent areal units incur computational and theoretical challenges, especially in the Bayesian setting when full posterior inference is desired. As an alternative, we propose a novel statistical modeling framework for this data setting, namely a mixture of directed graphical models (MDGMs). The components of the mixture, directed graphical models, can be represented by directed acyclic graphs (DAGs) and are computationally quick to evaluate. The DAGs representing the mixture components are selected to correspond to an undirected graphical representation of an assumed spatial contiguity/dependence structure of the areal units, which underlies the specification of traditional modeling approaches for discrete spatial processes such as Markov random fields (MRFs). We introduce the concept of compatibility to show how an undirected graph can be used as a template for the structural dependencies between areal units to create sets of DAGs which, as a collection, preserve the structural dependencies represented in the template undirected graph. We then introduce three classes of compatible DAGs and corresponding algorithms for fitting MDGMs based on these classes. In addition, we compare MDGMs to MRFs and a popular Bayesian MRF model approximation used in high-dimensional settings in a series of simulations and an analysis of ecometrics data collected as part of the Adolescent Health and Development in Context Study. This presentation is based on joint work with Brandon Carter.

1.40 - 2.20pm - Refining Stochastic Optimization for High-Dimensional Data with Novel Noise Models (Farzad Sabzikar, Iowa State)

In this talk, we present advanced stochastic optimization techniques for high-dimensional data. Traditional methods often struggle in non-convex landscapes, facing challenges in escaping local minima. We introduce novel modifications to gradient-based algorithms using heavy-tailed noise models to enhance exploration and convergence. Through perturbation strategies and analysis of correlated noise, we demonstrate improvements in the robustness and efficiency of optimization processes. Our findings offer new insights into optimizing high-dimensional data, with applications in machine learning and scientific computing.

2.20 - 2.40pm - break

2.40-3.20pm - Dimension Reduction for Spatial Regression (Hossein Moradi Rekabdarkolaee, South Dakota State)

Natural sciences such as geology and forestry often utilize regression models for spatial data with many predictors and small to moderate sample sizes. In these settings, efficient estimation of the regression parameters is crucial for both model interpretation and prediction. We propose a dimension reduction approach for spatial regression that assumes certain linear combinations of the predictors are immaterial to the regression. The model and corresponding inference provide efficient estimation of regression parameters while accounting for spatial correlation in the data. We employed the maximum likelihood estimation approach to estimate the parameters of the model. The effectiveness of the proposed model is illustrated through simulation studies and the analysis of a geochemical data set, predicting rare earth element concentrations within an oil and gas reserve in Wyoming. Simulation results indicate that our proposed model offers a significant reduction in the mean square errors and variation of the regression coefficients. Furthermore, the method provided a 50\% reduction in prediction variance for rare earth element concentrations within our data analysis.

3.20-4pm - Guaranteed Estimation in Highly Noisy Environments with Strong Temporal Dependence, (Reza Sadeghihafshejani, SMU)

Multidimensional stochastic differential equations are ubiquitous models for highly noisy dynamic environments. We study a fairly assumption-less setting focusing on design of input experiments and analysis of estimating the dynamical model. Our theoretical results establish finite-time guarantees for identifying dynamics matrices by observing a single trajectory that is nonstationary and its temporal dependencies last over time. Consistency rates as the trajectory length grows will be presented for properly randomized input signals. We also provide effects of different parameters on the estimation error, including dimension, measures of stability, properties of continuous-time noise process, and the eigen-structure of the data generation mechanism.

Page updated

Google Sites

Report abuse