This schedule is only provisional. The final schedule will be confirmed in the days before the workshop.
On the second day of the workshop, there will be sessions on cybersecurity and AI safety.
8.30 - 9.10
9.10 - 9.15
Francesco Sanna Passino, Patrick Rubin-Delanchy, Nick Heard
9.15 - 10.00
Scalable community detection in massive networks via predictive assignment
Massive network datasets are becoming increasingly common in scientific applications. Existing community detection methods encounter significant computational challenges for such massive networks due to two reasons. First, the full network needs to be stored and analyzed on a single server, leading to high memory costs. Second, existing methods typically use matrix factorization or iterative optimization using the full network, resulting in high runtimes. We propose a strategy called predictive assignment to enable computationally efficient community detection while ensuring statistical accuracy. The core idea is to avoid large-scale matrix computations by breaking up the task into a smaller matrix computation plus a large number of vector computations that can be carried out in parallel. Under the proposed method, community detection is carried out on a small subgraph to estimate the relevant model parameters. Next, each remaining node is assigned to a community based on these estimates. We prove that predictive assignment achieves strong consistency under the stochastic block model and its degree-corrected version. We also demonstrate the empirical performance of predictive assignment on simulated networks and two large real-world datasets: DBLP (Digital Bibliography & Library Project), a computer science bibliographical database, and the Twitch Gamers Social Network. Joint work with Subhankar Bhadra and Srijan Sengupta.
10.00 - 10.45
Performance Guarantees for LLMs via Data Kernel Perspective Space
Performance guarantees for LLMs are essential to myriad critical applications - enabling for any application for which quantifiable confidence in LLM performance is nonnegotiable. The Data Kernel Perspective Space (DKPS) provides a framework to generate empirically validated theoretical predictions, allowing guaranteed performance quantification for these transformative models. I will present the probability, statistics and linear algebra for DKPS, and illustrative examples. The probability, statistics and linear algebra for DKPS is analagous to that introduced for network time series in Avanti Athreya, Zachary Lubberts, Youngser Park, CEP, “Euclidean mirrors and dynamics in network time series”, Journal of the American Statistical Association (JASA), 2025. Spirited discussion will ensue...
10.45 - 11.00
11.00 - 11.45
Statistically and Computationally Optimal Estimation and Inference of Common Subspaces
Given multiple data matrices, many problems in statistics and data science rely on estimating a common subspace that captures certain structure shared by all the data matrices. In this talk, we investigate the statistical and computational limits for the common subspace model in which one observes a collection of symmetric low-rank matrices perturbed by noise, where each low-rank matrix shares the same common subspace. Our main results identify several regimes of the signal-to-noise ratio (SNR) such that estimation and inference is statistically or computationally optimal, and we refer to these regimes as weak SNR, moderate SNR, strong estimation SNR, and strong inference SNR. First, we propose an estimator based on projected gradient descent initialized via spectral sum of squares and show that it achieves the optimal sin(Θ) error rate under strong inference SNR, further attaining the asymptotically optimal constant under Gaussian noise. These results are complemented by both statistical and computational lower bounds identifying the weak and moderate SNR regimes. Next, we turn to statistical inference for the sin(Θ) distance itself, and we show that our estimator has an asymptotically Gaussian distribution under strong inference SNR. Based on this limiting result we propose confidence intervals and show that they are minimax optimal in all regimes, though the resulting confidence intervals require knowledge of the SNR. However, we then show that in SNR-adaptive confidence intervals are information-theoretically impossible below the strong inference SNR. Consequently, our results unveil a novel phenomenon: despite the SNR being "above" the computational limit for estimation, adaptive statistical inference may still be information-theoretically impossible.
11.45 - 12.30
Compressed Bayesian Tensor Regression
To address the common problem of high dimensionality in tensor regressions, we introduce a generalized tensor random projection method that embeds high-dimensional tensor-valued covariates into low-dimensional subspaces with minimal loss of information about the responses. The method is flexible, allowing for tensor-wise, mode-wise, or combined random projections as special cases. A Bayesian inference framework is provided featuring the use of a hierarchical prior distribution and a low-rank representation of the parameter. Strong theoretical support is provided for the concentration properties of the random projection and posterior consistency of the Bayesian inference. An efficient Gibbs sampler is developed to perform inference on the compressed data. To mitigate the sensitivity introduced by random projections, Bayesian model averaging is employed, with normalising constants estimated using reverse logistic regression. An extensive simulation study is conducted to examine the effects of different tuning parameters. Simulations indicate, and the real data application confirms, that compressed Bayesian tensor regression can achieve better out-of-sample prediction while significantly reducing computational cost compared to standard Bayesian tensor regression. Joint work with R. Craiu and Q. Wang.
12.30 - 13.45
13.45 - 14.30
Common Drivers in Sparsely Interacting Multiple Hawkes Processes
We study actors in a network who can cast instantaneous events, e.g., posts on a social media platform. These events will be modelled as multivariate Hawkes process allowing for mutual excitation, that is, events cast by one actor are allowed to increase the activity of other actors. Such influences can, in turn, be interpreted as edges of a network. Our primary target is estimation of this influence network. However, we believe that a second relevant driver of such events are globally available covariates, e.g., time of the day, or information about the actors. We therefore study a model that also includes such common drivers. We derive rates of convergence for all of the model parameters when both the number of actors and the time horizon tends to infinity. To prevent an exploding network, sparseness is assumed, and enforced by a LASSO-type estimator. In order to allow for inference, we study a de-biasing procedure. We will also discuss numerical aspects, and discuss in this context an extension to multiple Hawkes processes, that is, if several classes of interactions are observed that follow the same dynamics. To conduct inference on the network, we use a permutation type test. This is joint work with Enno Mammen, Eckehard Olbrich, and Wolfgang Polonik.
14.30 - 15.15
Generalized Grade-of-Membership Estimation for High-dimensional Locally Dependent Data
This work focuses on the mixed membership models for multivariate categorical data widely used for analyzing survey responses and population genetics data. These grade of membership (GoM) models offer rich modeling power but present significant estimation challenges for high-dimensional polytomous data. Popular existing approaches, such as Bayesian MCMC inference, are not scalable and lack theoretical guarantees in high-dimensional settings. To address this, we first observe that data from this model can be reformulated as a three-way (quasi-)tensor, with many subjects responding to many items with varying numbers of categories. We introduce a novel and simple approach that flattens the three-way quasi-tensor into a "fat" matrix, and then perform a singular value decomposition of it to estimate parameters by exploiting the singular subspace geometry. Our fast spectral method can accommodate a broad range of data distributions with arbitrarily locally dependent noise, which we formalize as the generalized-GoM models. We establish finite-sample entrywise error bounds for the generalized-GoM model parameters. This is supported by a new sharp two-to-infinity singular subspace perturbation theory for locally dependent and flexibly distributed noise, a contribution of independent interest. Simulations and applications to data in political surveys, population genetics, and single-cell sequencing demonstrate our method's superior performance.
15.15 - 15.30
15.30 - 16.15
Hierarchical clustering with dot products
We offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure. We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance. We demonstrate that the tree output by this algorithm provides an estimate of generative hierarchical structure in data, under a generic probabilistic graphical model. The key technical innovations are to understand how hierarchical information in this model translates into tree geometry which can be recovered from data, and to characterise the benefits of simultaneously growing data dimension. Joint work with Annie Gray, Alex Modell and Patrick Rubin-Delanchy.
16.15 - 17.00
Efficient Analysis of Latent Spaces in Heterogeneous Networks
In this talk, we introduce a unified framework for efficient estimation under latent space modeling of heterogeneous networks. We consider a class of latent space models that decompose latent vectors into shared and network-specific components across networks. We develop a novel procedure that first identifies the shared latent vectors and further refines estimates through efficient score equations to achieve statistical efficiency. Oracle error rates for estimating the shared and heterogeneous latent vectors are established simultaneously. The analysis framework offers remarkable flexibility, accommodating various types of edge weights under general distributions.
17.00 - 17.45
Community detection via curvature gaps
We consider clustering in stochastic blockmodel graphs from the perspective of Ollivier's Ricci Curvature, an extension of Ricci Curvature on manifolds to this discrete setting. The gap between the distributions of edge curvatures for within-cluster edges and between-cluster edges allows us to identify these two groups of edges by their curvature, guaranteeing effective clustering. This curvature gap is studied under multiple signal strength regimes, identifying the limiting distributions and exploring the limits of curvature-based clustering. These distributional limits for edge curvatures are the first of their kind in the literature, and show that curvature-based clustering can be an effective competitor to traditional clustering methods, even in low signal strength settings.
18.30 - 20.30
8.30 - 9.15
9.15 - 10.00
Security of text embeddings
Vector databases store collections of high-dimensional data generated from text corpora by embedding models. As these models are increasingly used in applications such as semantic search, translation, and retrieval-augmented generation (RAG), understanding whether private data can be recovered from them has become critical. It was once assumed that text encoding inherently obscured the original content; however, recent work has shown that vector databases may be vulnerable to inversion attacks, in which adversaries can partially reconstruct the original text from embeddings. At the same time, as embedding models evolve from being trained for specific tasks toward more general-purpose or universal embeddings, researchers have observed that many models appear to capture similar geometric structures. This observation has led some to propose the Platonic Representation Hypothesis: the idea that embeddings may be converging toward a universal representation of language. If true, this convergence would have profound implications for the security of vector databases. In this talk, I will introduce the emerging field of embedding security, present evidence for a universal geometry of language, and discuss its consequences for data privacy.
10.00 - 10.45
Data Science for Cyber Security and National Resilience
Cyber security offers rich opportunities for statistics and machine learning to deliver real‑world impact, but there are also significant challenges. This talk will explore how we apply statistical analysis and ML/AI to diverse cyber data — from DMARC and DNS records, attack surface scans and host‑based telemetry, to penetration test and threat intelligence reports. Our work spans modelling networks of organisations and supply chains, inferring inter‑organisational links from DNS flows, and classifying domains and web content by maliciousness or sector. We will also review wider trends and challenges in cyber security, aiming to spark ideas for novel statistical and ML applications in this critical domain.
10.45 - 11.00
11.00 - 11.45
Statistical Methods for Modelling and Monitoring Enterprise Authentication Systems
Network authentication data provides a rich source of information about user behaviour within enterprise systems, yet presents significant challenges for operational anomaly detection. The scale and complexity of authentication logs, combined with temporal variation and the absence of comprehensive labelling, necessitate robust statistical approaches that can adapt to evolving network conditions. This talk explores statistical methods for modelling authentication activity within cyber security settings, with emphasis on the translation of statistical models into reliable detection systems that operate effectively in real-world enterprise environments. Key challenges include high dimensional data, managing continuously shifting data characteristics and maintaining manageable false positive rates in operational settings. Through examples drawn from authentication monitoring, we illustrate both methodological developments and practical considerations for deployment.
11.45 - 12.30
Graph Neural Networks to Learn Graph Representations
Graph Neural Networks (GNNs) have celebrated many academic and industrial successes in the past years; providing a rich ground for theoretical analysis and achieving state-of-the-art results in several learning tasks. In this talk, I will give an accessible introduction to the area of Graph Representation Learning with a focus on GNNs. I will then highlight two projects in which we progressed our understanding of informative graph representations obtained from GNNs. Firstly, we explore the question of how to optimally represent graphs by learning a parametrised matrix representation of graphs in GNNs (ICLR'21). Then, we will discuss more recent work (ICLR'25) in which we theoretically analysed how the inclusion of virtual nodes, i.e., newly introduced nodes that are connected to all existing nodes in an input graph, into GNNs affects the learned graph representations. I will conclude each of the two projects by briefly reviewing our ongoing follow-up work to these projects.
12.30 - 13.45
13.45 - 14.30
Title TBD
Abstract TBD
14.30 - 15.15
Extreme value theory for singular subspace estimation in the matrix denoising model
In this talk, we study fine-grained singular subspace estimation in the matrix denoising model where a deterministic low-rank signal matrix is additively perturbed by a stochastic matrix of Gaussian noise. We establish that the maximum Euclidean row norm (i.e., the two-to-infinity norm) of the aligned difference between the leading sample and population singular vectors approaches the Gumbel distribution in the large-matrix limit, under suitable signal-to-noise conditions and after appropriate centering and scaling. We apply our novel asymptotic distributional theory to test hypotheses of low-rank signal structure encoded in the leading singular vectors and their corresponding principal subspace. We provide de-biased estimators for the corresponding nuisance signal singular values and show that our proposed plug-in test statistic has desirable properties. Notably, compared to using the Frobenius norm subspace distance, our test statistic based on the two-to-infinity norm has higher power to detect structured alternatives that differ from the null in only a few matrix entries or rows. Our main results are obtained by a novel synthesis of and technical analysis involving entrywise matrix perturbation analysis, extreme value theory, saddle point approximation methods, and random matrix theory. Our contributions complement the existing literature for matrix denoising focused on minimaxity, mean squared error analysis, unitarily invariant distances between subspaces, component-wise asymptotic distributional theory, and row-wise uniform error bounds. Numerical simulations illustrate our main results and demonstrate the robustness properties of our testing procedure to non-Gaussian noise distributions.
15.15 - 15.30
15.30 - 16.15
Minority representation and fairness in network ranking
Considerations of bias, fairness and representation are a prerequisite of responsible modern statistics. In statistical network analysis, observed networks are often incomplete or systematically biased, which can lead to systematic underrepresentation of protected groups, and affect any downstream ranking or decision based on the observed network. In this talk introduce a formal measure of minority representation, defined as the proportion of minority nodes among the top-ranked individuals. We model systematic bias through group-dependent missing edge mechanisms and develop statistical methods to estimate and test for such bias. When bias is detected, we propose a re-ranking procedure based on an asymptotic approximation that improves group representation. Applying the framework to the high school contact network reveals systematic underreporting of cross-group contacts consistent with recall bias. These findings highlight the importance of modeling and correcting systematic bias in social networks with heterogeneous groups.
16.15 - 17.00
Testing for repeated motifs and hierarchical structure in stochastic block models
The rise in complexity of network data in neuroscience, social networks, and protein-protein interaction networks has been accompanied by several efforts to model and understand these data at different scales. A key multiscale network modeling technique posits a hierarchical structure in the network, and by treating networks as multiple levels of subdivisions with shared statistical properties, smaller subgraph primitives can be efficiently discovered with manageable complexity. One such example of hierarchical modeling is the hierarchical stochastic block model, which seeks to model complex networks as being composed of community structures repeated across the network. Incorporating repeated structure allows for parameter tying across communities in the SBM, reducing the model complexity compared to the traditional block model. A framework is formulated for testing for repeated motif hierarchical structure in the stochastic block model framework. A model is described that naturally expresses networks as a hierarchy of sub-networks with a set of motifs repeating across them, and the practical utility of the test is demonstrated through theoretical analysis, extensive simulation, and real data.
17.00 - 17.15
Francesco Sanna Passino, Patrick Rubin-Delanchy, Nick Heard