Green is our method.
Green is our method.
Point process modeling across spatial accidental data in forensic footwear images: with Neil Spencer, and Dipak Dey
Shoe print analysis is pivotal in forensic investigations, especially when shoe prints are recovered at crime scenes. These prints often display accidentals, such as cuts, scrapes, and wear patterns, which develop due to usage and vary based on factors like brand, model, and size. While some accidentals are common across shoe prints, others are highly distinctive and rare. Distinguishing these uncommon patterns from the regular ones is essential for strengthening forensic evidence. For example, a pattern with a probability of occurrence of 1/100 is relatively common, while one with a probability of 1/10\(^6\) is exceptionally rare and thus more significant as evidence. In this study, we develop a hierarchical Bayesian model with spatially varying coefficients to analyze shoe print data. Our model effectively removes common patterns to identify whether a suspect's shoe print exhibits significantly distinctive features compared to regular accidental patterns, by incorporating spatial information from a large collection of shoe prints. This probability serves as critical statistical evidence to aid the jury in making informed decisions. Our proposed method outperforms state-of-the-art techniques, enhancing accuracy and reliability in forensic shoe print analysis. We also developed a theory for posterior consistency for our proposed model.
2. Bayesian Models for Joint Selection of Features and Auto-Regressive Lags: Theory and Applications in Environmental and Financial Forecasting, with Sujit Ghosh
This paper introduces Bayesian variable selection methods for linear regression models with autocorrelated errors, using spike-and-slab priors in a two-step MCMC procedure. The approach enables the simultaneous selection of predictors, including lagged variables, and lagged error terms, ensuring consistency and scalability. Applications to finance (S&P 500 prediction) and environmental science demonstrate its superior predictive performance and accuracy in autoregressive time series models.
3. Prediction Interval Estimation in Penalized Regression Models of Insurance Data, with Aditya Vikram Sett, Dipak K. Dey, and Yuwen Gu. Link
This paper addresses prediction uncertainty quantification in Generalized Linear Models (GLMs), a critical area in scientific and business applications. It explores penalized regression for feature selection and de-biasing techniques for post-selection inference, tackling challenges in constructing valid confidence intervals due to model selection bias. By extending conformal prediction methods from linear models to GLMs, the study applies these techniques, particularly Tweedie regression, to insurance claim data, demonstrating their effectiveness in predictive accuracy and uncertainty estimation.
4. Interval Estimation of Coefficients in Penalized Regression Models of Insurance Data, with Zijian Huang, Dipak K. Dey, and Yuwen Gu
This study focuses on constructing confidence intervals for model parameters in the Tweedie regression, commonly used for zero-inflated semicontinuous insurance loss data. Post-selection inference is emphasized as a crucial step to ensure credibility in identifying important features, and addressing the bias in lasso estimates for large coefficients in GLMs. Traditional methods often lead to overly optimistic results, necessitating bias correction techniques for valid inference. Methodologies for post-selection confidence interval construction are discussed, with applications to insurance data providing practical insights.
5. Development of a Statistical Predictive Model for Daily Water Table Depth and Important Variables Selection for Inference, with Sushant Mehan and Amatya M. Devendra
Accurately predicting water table dynamics is crucial for managing groundwater resources, ecosystems, and human activities. This study employs an autoregressive model with sparse coefficients and lagged variables to estimate daily water table depth using hydroclimatic data from Santee Experimental Forest (SC) and D1 (NC) from 2006–2019 and 1988–2008, respectively. Key predictors include soil/air temperature, precipitation, and radiation, with RMSE values of 10.09 cm (dormant season) and 14.94 cm (daily testing). The model achieved high accuracy (R²: 0.93–0.96) and identified rainfall, solar radiation, and wind direction as critical factors, offering a valuable tool for hydrologic and ecological management.
6. Some clustering-based change-point detection methods applicable to high-dimension, low sample size data with Trisha Dawn, Angshuman Roy, and Anil K. Ghosh, publication: here, Journal of Statistical Planning and Inference, Volume 234, January 2025, 106212
Detecting change-points in high-dimensional data with limited sample sizes is a complex challenge. This study proposes clustering-based methods, leveraging kkk-means and suitable dissimilarity measures, to test for and estimate single change-points, with theoretical validation under high-dimensional settings. The approach is extended to detect multiple change-points, and its performance is evaluated through extensive simulations and real data analysis, demonstrating competitive results compared to existing state-of-the-art methods.