Green is our method.
Green is our method.
Point process modeling across spatial accidental data in forensic footwear images: with Neil Spencer, and Dipak Dey
Shoe print analysis is pivotal in forensic investigations, especially when shoe prints are recovered at crime scenes. These prints often display accidentals, such as cuts, scrapes, and wear patterns, which develop due to usage and vary based on factors like brand, model, and size. While some accidentals are common across shoe prints, others are highly distinctive and rare. Distinguishing these uncommon patterns from the regular ones is essential for strengthening forensic evidence. For example, a pattern with a probability of occurrence of 1/100 is relatively common, while one with a probability of 1/10\(^6\) is exceptionally rare and thus more significant as evidence. In this study, we develop a hierarchical Bayesian model with spatially varying coefficients to analyze shoe print data. Our model effectively removes common patterns to identify whether a suspect's shoe print exhibits significantly distinctive features compared to regular accidental patterns, by incorporating spatial information from a large collection of shoe prints. This probability serves as critical statistical evidence to aid the jury in making informed decisions. Our proposed method outperforms state-of-the-art techniques, enhancing accuracy and reliability in forensic shoe print analysis. We also developed a theory for posterior consistency for our proposed model.
2. Linguistic Insight into Errant Learning Trajectories of Large Language Models with Whitney Tabor and William Synder (LLM)
Large Language Models (LLMs) have achieved remarkable success in modeling natural language, internalizing many grammatical and semantic regularities that linguists identify as characteristic of human language. Nevertheless, they remain expensive to train and sometimes struggle to match specific aspects of human linguistic ability. In this project, we ask whether insights from language theory can reveal systematic patterns in where LLMs succeed or fail, and whether these insights can guide more effective training methods.
We study Meta’s OPT model trained on a BabyLM dataset (100M words), which is “developmentally more plausible” than state-of-the-art LLMs. We evaluate the model under controlled grammatical interventions using the BLiMP benchmark, which spans 67 syntactic categories, each defined by sentence pairs differing in a targeted grammatical rule violation. For example, the sentence “Who did Alan realize he liked?” is natural, whereas “Who did Alan realize who liked?” violates a grammatical constraint known as an Island Constraint, which restricts which parts of hierarchical sentence structures are available for question formation.
Using the BLiMP database, we track the model’s preference for grammatical over ungrammatical sentences across training iterations and grammatical categories. Our results show that in nearly one-third of BLiMP categories—including Island Constraints—OPT fails to consistently assign higher likelihoods to grammatical sentences, even after extensive training. Interestingly, when the model fails, it often establishes an early but clear (erroneous) separation of likelihoods at a stage when other structural behaviors are still developing. We are investigating this transition point as a potential locus where alternative training strategies could improve learning efficiency and model performance.
3. Bayesian Models for Joint Selection of Features and Auto-Regressive Lags: Theory and Applications in Environmental and Financial Forecasting, with Sujit Ghosh
This paper introduces Bayesian variable selection methods for linear regression models with autocorrelated errors, using spike-and-slab priors in a two-step MCMC procedure. The approach enables the simultaneous selection of predictors, including lagged variables, and lagged error terms, ensuring consistency and scalability. Applications to finance (S&P 500 prediction) and environmental science demonstrate its superior predictive performance and accuracy in autoregressive time series models.
4. Prediction Interval Estimation in Penalized Regression Models of Insurance Data, with Aditya Vikram Sett, Dipak K. Dey, and Yuwen Gu. Link
This paper addresses prediction uncertainty quantification in Generalized Linear Models (GLMs), a critical area in scientific and business applications. It explores penalized regression for feature selection and de-biasing techniques for post-selection inference, tackling challenges in constructing valid confidence intervals due to model selection bias. By extending conformal prediction methods from linear models to GLMs, the study applies these techniques, particularly Tweedie regression, to insurance claim data, demonstrating their effectiveness in predictive accuracy and uncertainty estimation.
5. Interval Estimation of Coefficients in Penalized Regression Models of Insurance Data, with Zijian Huang, Dipak K. Dey, and Yuwen Gu
This study focuses on constructing confidence intervals for model parameters in the Tweedie regression, commonly used for zero-inflated semicontinuous insurance loss data. Post-selection inference is emphasized as a crucial step to ensure credibility in identifying important features, and addressing the bias in lasso estimates for large coefficients in GLMs. Traditional methods often lead to overly optimistic results, necessitating bias correction techniques for valid inference. Methodologies for post-selection confidence interval construction are discussed, with applications to insurance data providing practical insights.
6. Development of a Statistical Predictive Model for Daily Water Table Depth and Important Variables Selection for Inference, with Sushant Mehan and Amatya M. Devendra
Accurately predicting water table dynamics is crucial for managing groundwater resources, ecosystems, and human activities. This study employs an autoregressive model with sparse coefficients and lagged variables to estimate daily water table depth using hydroclimatic data from Santee Experimental Forest (SC) and D1 (NC) from 2006–2019 and 1988–2008, respectively. Key predictors include soil/air temperature, precipitation, and radiation, with RMSE values of 10.09 cm (dormant season) and 14.94 cm (daily testing). The model achieved high accuracy (R²: 0.93–0.96) and identified rainfall, solar radiation, and wind direction as critical factors, offering a valuable tool for hydrologic and ecological management.
7. Some clustering-based change-point detection methods applicable to high-dimension, low sample size data with Trisha Dawn, Angshuman Roy, and Anil K. Ghosh, publication: here, Journal of Statistical Planning and Inference, Volume 234, January 2025, 106212
Detecting change-points in high-dimensional data with limited sample sizes is a complex challenge. This study proposes clustering-based methods, leveraging kkk-means and suitable dissimilarity measures, to test for and estimate single change-points, with theoretical validation under high-dimensional settings. The approach is extended to detect multiple change-points, and its performance is evaluated through extensive simulations and real data analysis, demonstrating competitive results compared to existing state-of-the-art methods.