Past Seminar Presentations


When non-conformity score functions are pretrained on an independent dataset, we propose a split-conformal–style algorithm that leverages drift detection to adaptively update calibration sets, which provably achieves minimax-optimal regret. When non-conformity scores are instead trained online, we develop a full-conformal–style algorithm that again incorporates drift detection to handle non-stationarity; this approach relies on stability—rather than permutation symmetry—of the model-fitting algorithm, which is often better suited to online learning under evolving environments. We establish non-asymptotic regret guarantees for our online full conformal algorithm, which match the minimax lower bound under appropriate restrictions on the prediction sets. Numerical experiments corroborate our theoretical findings.


This talk develops methods for constructing statistically meaningful intervals for ranks. For a single ranking task, confidence intervals for ranks can be derived from the pairwise tests, when these are adjusted to control the family-wise error rate. However, these intervals tend to be very conservative. We propose constructing rank intervals from false-discovery rate controlled pairwise test-families, and analyze the statistical properties and the efficiency gains of these new intervals.

When multiple ranking tasks are available for the same competitors, as in model leaderboards, we propose an aggregation framework based on prediction intervals. These intervals capture both within-task uncertainty and between-task variability, providing a unified way to quantify ranking uncertainty across tasks. We demonstrate the methods for measuring uncertainty when ranking model features by their importance, and when comparing model performances in public leaderboards.

The talk is based on joint work with Bitya Neuhof and Yoav Benjamini.



Under the first approach, we show quantitatively that high-dimensional conditional expectations under a random permutation prior admit a sharp mean-field approximation. Applied to the classical problem of distribution estimation, this analysis yields an estimator that achieves optimal instance-wise risk in a competitive framework and ultimately bests the classical Good--Turing estimator in both theory and practice. 

Under the second approach, we formalize recent empirical evidence that transformers pretrained on synthetic data perform strongly on empirical Bayes tasks. Focusing on the Poisson model, we establish the existence of universal priors under which a pretrained estimator achieves near-optimal regret uniformly over arbitrary test distributions. Our analysis interprets the pretrained estimator as performing hierarchical Bayesian inference: adaptation to unknown test priors arises through posterior contraction, and length generalization (when the test sequence exceeds the training length) corresponds to inference under a fractional posterior. Numerical experiments with pretrained transformers support these theoretical predictions.














This is joint work with Aldo Solari, Lasse Fischer, Rianne de Heide, Aaditya Ramdas, and Jelle Goeman.













The RRT method offers several advantages. First, it utilizes the added randomization to obtain an exact pivot using the full dataset, while accounting for the data-dependent structure of the fitted tree. Second, with a small amount of randomization, the RRT method achieves predictive accuracy similar to a model trained on the entire dataset. At the same time, it provides significantly more powerful inference than data splitting methods, which rely only on a held-out portion of the data for inference. Third, unlike data splitting approaches, it yields intervals that adapt to the signal strength in the data. Throughout this talk, I will demonstrate how RRT transforms a purely predictive algorithm into a method capable of performing reliable and powerful inference in the fitted tree model.








Previously, Candès et al. (2023) introduced a novel method based on CP to generate valid and efficient lower predictive bounds on survival times. This paper considers a different problem: that of generating an upper predictive bound (in addition to a lower predictive bound). We propose a new method using CP that generates two-sided or one-sided prediction intervals for survival times. Specifically, the method provides both lower and upper predictive bounds for individuals deemed sufficiently similar to the non-censored population, while returning only a lower bound for others. The prediction intervals offer finite-sample coverage guarantees, requiring no distributional assumptions other than the sampled data points are independent and identically distributed. The performance of the procedure is assessed using both synthetic and real-world datasets. Joint work with Chris Holmes (Dep. Of Statistics, Oxford University)


In this setting, our emphasis is on obtaining FDP confidence bounds that both have non-asymptotic coverage and are asymptotically accurate in a specific sense, as the number m of tested hypotheses grows. Namely, we introduce and study the property (which we call m-consistency) that the confidence bound converges to or below the desired level α when applied to a specific reference α-level false discovery rate (FDR) controlling procedure.

With this perspective in mind, we derive new bounds that provide improvements over existing ones, both theoretically and practically, and are suitable for situations where at least a moderate number of rejections is expected. In particular, the improvement is significant for knockoff p-values, which shows the impact of the method for a practical use. These improvements are illustrated with numerical experiments and real data examples.






























In this paper, we study the general problem of efficiently estimating target population risk under various dataset shift conditions, leveraging semiparametric efficiency theory. We consider a general class of dataset shift conditions, which includes three popular conditions---covariate, label and concept shift---as special cases. We allow for partially non-overlapping support between the source and target populations. We develop efficient and multiply robust estimators along with a straightforward specification test of these dataset shift conditions. We also derive efficiency bounds for two other dataset shift conditions, posterior drift and location-scale shift. Simulation studies support the efficiency gains due to leveraging plausible dataset shift conditions. This is joint work with Hongxiang David Qiu and Eric Tchetgen Tchetgen.



























Our theory and experiments suggest that conformal prediction with noisy labels and commonly used score functions conservatively covers the clean ground truth labels except in adversarial cases.



















While tests based on average coverage intervals do not control size in the usual frequentist sense, certain results on false discovery rate (FDR) control of multiple testing procedures continue to hold when applied to such tests.  In particular, the Benjamini and Hochberg (1995) step-up procedure still controls FDR in the asymptotic regime with many weakly dependent $p$-values, and certain adjustments for dependent $p$-values such as the Benjamini and Yekutieli (2001) procedure continue to yield FDR control in finite samples.


Then, I will present a simple, yet powerful, idea: using e-values as unnormalized weights in multiple testing. Most standard weighted multiple testing methods require the weights to deterministically add up to the number of hypotheses being tested (equivalently, the average weight is unity). But this normalization is not required when the weights are e-values obtained from independent data. This could result in a massive increase in power, especially if the non-null hypotheses have e-values much larger than one. More broadly, we study how to combine an e-value and a p-value, and design multiple testing procedures where both e-values and p-values are available for some hypotheses. A case study with RNA-seq and microarray data will demonstrate the practical power benefits.

These are joint works with Ruodu Wang, Neil Xu and Nikos Ignatiadis.






This is joint work with Will Fithian and Lihua Lei.













This is joint work with Daniel Xiang and Will Fithian.


Our work also has implications for multiple testing in sequential settings, since it applies at stopping times to continuously-monitored confidence sequences and multi-armed bandit sampling.





















Joint work with Luella Fu, Alessio Saretto, and Wenguang Sun.



This talk is based upon joint work with Peter W. Macdonald and  Daniel Kessler.






The FDA gave Accelerated Approval to Aduhelm^{TM} (aducanumab) for Alzheimer's Disease (AD) on 8 June 2021, based on its reduction of beta-amyloid plaque (a surrogate biomarker endpoint). When clinical efficacy of a treatment for the overall population is not shown, genome-wide association studies (GWAS) are often used to discover SNPs that might predict efficacy in subgroups. In the process of working on GWAS with real data, we came to realization that, if one causal SNP makes its zero-null hypothesis false, then all other zero-null hypotheses are statistically false as well. While the majority of no-association null hypotheses might well be true biologically, statistically they are false (if one is false) in GWAS. I will indeed illustrate this with a causal SNP for the ApoE gene which is involved in the clearance of beta-amyloid plaque in AD. We suggest our confidence interval CE4 approach instead.

Targeted therapies such as OPDIVO and TECENTRIQ naturally have patient subgroups, already defined by the extent to which the drug target is present or absent in them, subgroups that may derive differential efficacy. An additional danger of testing equality nulls in the presence of subgroups is that the illusory logical relationships among efficacy in subgroups and their mixtures created by exact quality nulls leads to too drastic a stepwise multiplicity reduction, resulting in inflated directional error rates, as I will explain. Instead, Partition Tests, which would be called Confident Direction methods in the language of Tukey, might be safer to use.










(a) The stationary points of the objective are automatically sparse (i.e. performs selection) -- no explicit ℓ1 penalization is needed.

(b) All stationary points of the objective exclude noise variables with high probability.

(c) Guaranteed recovery of all signal variables without needing to reach the objective's global maxima or special stationary points.

The second and third properties mean that all our theoretical results apply in the practical case where one uses gradient ascent to maximize the metric learning objective. While not all metric learning objectives enjoy good statistical power, we design an objective based on ℓ1 kernels that does exhibit favorable power: it recovers (i) main effects with n∼logp samples, (ii) hierarchical interactions with n∼logp samples and (iii) order-s pure interactions with n∼p^{2(s−1)}logp samples.













































 



Standard textbook confidence intervals are only valid at fixed sample sizes, but scientific datasets are often collected sequentially and potentially stopped early, thus introducing a critical selection bias. A "confidence sequence” is a sequence of intervals, one for each sample size, that are uniformly valid over all sample sizes, and are thus valid at arbitrary data-dependent sample sizes. One can show that constructing the former at every time step guarantees false coverage rate control, while constructing the latter at each time step guarantees post-hoc familywise error rate control. We show that at a price of about two (doubling of width), pointwise asymptotic confidence intervals can be extended to uniform nonparametric confidence sequences. The crucial role of some beautiful nonnegative supermartingales will be made transparent in enabling “safe anytime-valid inference".
This talk will mostly feature joint work with Steven R. Howard (Berkeley, Voleon), Jon McAuliffe (Berkeley, Voleon), Jas Sekhon (Berkeley, Bridgewater) and recently Larry Wasserman (CMU) and Sivaraman Balakrishnan (CMU). I will also cover interesting historical and contemporary contributions to this area.


We introduce a method to rigorously draw causal inferences inferences immune to all possible confounding — from genetic data that include parents and offspring. Causal conclusions are possible with these data because the natural randomness in meiosis can be viewed as a high-dimensional randomized experiment. We make this observation actionable by developing a novel conditional independence test that identifies regions of the genome containing distinct causal variants. The proposed Digital Twin Test compares an observed offspring to carefully constructed synthetic offspring from the same parents to determine statistical significance, and it can leverage any black-box multivariate model and additional non-trio genetic data to increase power. Crucially, our inferences are based only on a well-established mathematical model of recombination and make no assumptions about the relationship between the genotypes and phenotypes.