The workshop will be held at Imperial College London, in 170 Queen's Gate. All talks will be held in the Council Room, whereas light breakfast, coffee breaks and lunch will be served in the Drawing Room in the same venue.
The workshop will be in-person only. Talks will not be recorded or broadcasted live.
On the second day of the workshop, some of the talks will be focused on applications in cybersecurity and AI safety.
8.30 - 9.10
9.10 - 9.15
Francesco Sanna Passino, Patrick Rubin-Delanchy, Nick Heard
9.15 - 10.00
Scalable community detection in massive networks via predictive assignment
Massive network datasets are becoming increasingly common in scientific applications. Existing community detection methods encounter significant computational challenges for such massive networks due to two reasons. First, the full network needs to be stored and analyzed on a single server, leading to high memory costs. Second, existing methods typically use matrix factorization or iterative optimization using the full network, resulting in high runtimes. We propose a strategy called predictive assignment to enable computationally efficient community detection while ensuring statistical accuracy. The core idea is to avoid large-scale matrix computations by breaking up the task into a smaller matrix computation plus a large number of vector computations that can be carried out in parallel. Under the proposed method, community detection is carried out on a small subgraph to estimate the relevant model parameters. Next, each remaining node is assigned to a community based on these estimates. We prove that predictive assignment achieves strong consistency under the stochastic block model and its degree-corrected version. We also demonstrate the empirical performance of predictive assignment on simulated networks and two large real-world datasets: DBLP (Digital Bibliography & Library Project), a computer science bibliographical database, and the Twitch Gamers Social Network. Joint work with Subhankar Bhadra and Srijan Sengupta.
10.00 - 10.45
Performance Guarantees for LLMs via Data Kernel Perspective Space
Performance guarantees for LLMs are essential to myriad critical applications - enabling for any application for which quantifiable confidence in LLM performance is nonnegotiable. The Data Kernel Perspective Space (DKPS) provides a framework to generate empirically validated theoretical predictions, allowing guaranteed performance quantification for these transformative models. I will present the probability, statistics and linear algebra for DKPS, and illustrative examples. The probability, statistics and linear algebra for DKPS is analagous to that introduced for network time series in Avanti Athreya, Zachary Lubberts, Youngser Park, CEP, “Euclidean mirrors and dynamics in network time series”, Journal of the American Statistical Association (JASA), 2025. Spirited discussion will ensue...
10.45 - 11.00
11.00 - 11.45
Statistically and Computationally Optimal Estimation and Inference of Common Subspaces
Given multiple data matrices, many problems in statistics and data science rely on estimating a common subspace that captures certain structure shared by all the data matrices. In this talk, we investigate the statistical and computational limits for the common subspace model in which one observes a collection of symmetric low-rank matrices perturbed by noise, where each low-rank matrix shares the same common subspace. Our main results identify several regimes of the signal-to-noise ratio (SNR) such that estimation and inference is statistically or computationally optimal, and we refer to these regimes as weak SNR, moderate SNR, strong estimation SNR, and strong inference SNR. First, we propose an estimator based on projected gradient descent initialized via spectral sum of squares and show that it achieves the optimal sin(Θ) error rate under strong inference SNR, further attaining the asymptotically optimal constant under Gaussian noise. These results are complemented by both statistical and computational lower bounds identifying the weak and moderate SNR regimes. Next, we turn to statistical inference for the sin(Θ) distance itself, and we show that our estimator has an asymptotically Gaussian distribution under strong inference SNR. Based on this limiting result we propose confidence intervals and show that they are minimax optimal in all regimes, though the resulting confidence intervals require knowledge of the SNR. However, we then show that in SNR-adaptive confidence intervals are information-theoretically impossible below the strong inference SNR. Consequently, our results unveil a novel phenomenon: despite the SNR being "above" the computational limit for estimation, adaptive statistical inference may still be information-theoretically impossible.
11.45 - 12.30
Compressed Bayesian Tensor Regression
To address the common problem of high dimensionality in tensor regressions, we introduce a generalized tensor random projection method that embeds high-dimensional tensor-valued covariates into low-dimensional subspaces with minimal loss of information about the responses. The method is flexible, allowing for tensor-wise, mode-wise, or combined random projections as special cases. A Bayesian inference framework is provided featuring the use of a hierarchical prior distribution and a low-rank representation of the parameter. Strong theoretical support is provided for the concentration properties of the random projection and posterior consistency of the Bayesian inference. An efficient Gibbs sampler is developed to perform inference on the compressed data. To mitigate the sensitivity introduced by random projections, Bayesian model averaging is employed, with normalising constants estimated using reverse logistic regression. An extensive simulation study is conducted to examine the effects of different tuning parameters. Simulations indicate, and the real data application confirms, that compressed Bayesian tensor regression can achieve better out-of-sample prediction while significantly reducing computational cost compared to standard Bayesian tensor regression. Joint work with R. Craiu and Q. Wang.
12.30 - 13.45
13.45 - 14.30
Common Drivers in Sparsely Interacting Multiple Hawkes Processes
We study actors in a network who can cast instantaneous events, e.g., posts on a social media platform. These events will be modelled as multivariate Hawkes process allowing for mutual excitation, that is, events cast by one actor are allowed to increase the activity of other actors. Such influences can, in turn, be interpreted as edges of a network. Our primary target is estimation of this influence network. However, we believe that a second relevant driver of such events are globally available covariates, e.g., time of the day, or information about the actors. We therefore study a model that also includes such common drivers. We derive rates of convergence for all of the model parameters when both the number of actors and the time horizon tends to infinity. To prevent an exploding network, sparseness is assumed, and enforced by a LASSO-type estimator. In order to allow for inference, we study a de-biasing procedure. We will also discuss numerical aspects, and discuss in this context an extension to multiple Hawkes processes, that is, if several classes of interactions are observed that follow the same dynamics. To conduct inference on the network, we use a permutation type test. This is joint work with Enno Mammen, Eckehard Olbrich, and Wolfgang Polonik.
14.30 - 15.15
Generalized Grade-of-Membership Estimation for High-dimensional Locally Dependent Data
This work focuses on the mixed membership models for multivariate categorical data widely used for analyzing survey responses and population genetics data. These grade of membership (GoM) models offer rich modeling power but present significant estimation challenges for high-dimensional polytomous data. Popular existing approaches, such as Bayesian MCMC inference, are not scalable and lack theoretical guarantees in high-dimensional settings. To address this, we first observe that data from this model can be reformulated as a three-way (quasi-)tensor, with many subjects responding to many items with varying numbers of categories. We introduce a novel and simple approach that flattens the three-way quasi-tensor into a "fat" matrix, and then perform a singular value decomposition of it to estimate parameters by exploiting the singular subspace geometry. Our fast spectral method can accommodate a broad range of data distributions with arbitrarily locally dependent noise, which we formalize as the generalized-GoM models. We establish finite-sample entrywise error bounds for the generalized-GoM model parameters. This is supported by a new sharp two-to-infinity singular subspace perturbation theory for locally dependent and flexibly distributed noise, a contribution of independent interest. Simulations and applications to data in political surveys, population genetics, and single-cell sequencing demonstrate our method's superior performance.
15.15 - 15.30
15.30 - 16.15
Hierarchical clustering with dot products
We offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure. We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance. We demonstrate that the tree output by this algorithm provides an estimate of generative hierarchical structure in data, under a generic probabilistic graphical model. The key technical innovations are to understand how hierarchical information in this model translates into tree geometry which can be recovered from data, and to characterise the benefits of simultaneously growing data dimension. Joint work with Annie Gray, Alex Modell and Patrick Rubin-Delanchy.
16.15 - 17.00
Efficient Analysis of Latent Spaces in Heterogeneous Networks
In this talk, we introduce a unified framework for efficient estimation under latent space modeling of heterogeneous networks. We consider a class of latent space models that decompose latent vectors into shared and network-specific components across networks. We develop a novel procedure that first identifies the shared latent vectors and further refines estimates through efficient score equations to achieve statistical efficiency. Oracle error rates for estimating the shared and heterogeneous latent vectors are established simultaneously. The analysis framework offers remarkable flexibility, accommodating various types of edge weights under general distributions.
17.00 - 17.45
Community detection via curvature gaps
We consider clustering in stochastic blockmodel graphs from the perspective of Ollivier's Ricci Curvature, an extension of Ricci Curvature on manifolds to this discrete setting. The gap between the distributions of edge curvatures for within-cluster edges and between-cluster edges allows us to identify these two groups of edges by their curvature, guaranteeing effective clustering. This curvature gap is studied under multiple signal strength regimes, identifying the limiting distributions and exploring the limits of curvature-based clustering. These distributional limits for edge curvatures are the first of their kind in the literature, and show that curvature-based clustering can be an effective competitor to traditional clustering methods, even in low signal strength settings.
18.30 - 20.30
8.30 - 9.15
9.15 - 10.00
Security of text embeddings
Vector databases store collections of high-dimensional data generated from text corpora by embedding models. As these models are increasingly used in applications such as semantic search, translation, and retrieval-augmented generation (RAG), understanding whether private data can be recovered from them has become critical. It was once assumed that text encoding inherently obscured the original content; however, recent work has shown that vector databases may be vulnerable to inversion attacks, in which adversaries can partially reconstruct the original text from embeddings. At the same time, as embedding models evolve from being trained for specific tasks toward more general-purpose or universal embeddings, researchers have observed that many models appear to capture similar geometric structures. This observation has led some to propose the Platonic Representation Hypothesis: the idea that embeddings may be converging toward a universal representation of language. If true, this convergence would have profound implications for the security of vector databases. In this talk, I will introduce the emerging field of embedding security, present evidence for a universal geometry of language, and discuss its consequences for data privacy.
10.00 - 10.45
Data Science for Cyber Security and National Resilience
Cyber security offers rich opportunities for statistics and machine learning to deliver real‑world impact, but there are also significant challenges. This talk will explore how we apply statistical analysis and ML/AI to diverse cyber data — from DMARC and DNS records, attack surface scans and host‑based telemetry, to penetration test and threat intelligence reports. Our work spans modelling networks of organisations and supply chains, inferring inter‑organisational links from DNS flows, and classifying domains and web content by maliciousness or sector. We will also review wider trends and challenges in cyber security, aiming to spark ideas for novel statistical and ML applications in this critical domain.
10.45 - 11.00
11.00 - 11.45
Constructing Directed Graphs for Multivariate Point Processes: Applications in Cyber Security
Security Operations Centers (SOCs) face a critical operational challenge where analysts must triage high volumes of security alerts daily from noisy detection systems. These detectors are configured for broad coverage, resulting in high rates of benign alerts that obscure genuine threats and create severe alert fatigue. This work introduces an approach for characterising normal detector behaviour in complex security environments to enable identification of anomalous and potentially malicious patterns. We present a graph-based framework that models relationships between security detectors, capturing how they typically activate during benign operations. By establishing baseline interaction patterns, deviations such as unusual alert sequences or isolated activations of typically correlated detectors can be identified. We also consider practical deployment requirements including interpretable results, behavioural evidence for investigation, and modular integration with existing infrastructure. A case study from a production security environment demonstrates the practical utility of this approach in distinguishing signal from noise and prioritising analyst investigation efforts toward high-confidence threats.
11.45 - 12.30
Graph Neural Networks to Learn Graph Representations
Graph Neural Networks (GNNs) have celebrated many academic and industrial successes in the past years; providing a rich ground for theoretical analysis and achieving state-of-the-art results in several learning tasks. In this talk, I will give an accessible introduction to the area of Graph Representation Learning with a focus on GNNs. I will then highlight two projects in which we progressed our understanding of informative graph representations obtained from GNNs. Firstly, we explore the question of how to optimally represent graphs by learning a parametrised matrix representation of graphs in GNNs (ICLR'21). Then, we will discuss more recent work (ICLR'25) in which we theoretically analysed how the inclusion of virtual nodes, i.e., newly introduced nodes that are connected to all existing nodes in an input graph, into GNNs affects the learned graph representations. I will conclude each of the two projects by briefly reviewing our ongoing follow-up work to these projects.
12.30 - 13.45
13.45 - 14.30
Efficient concentration inequalities via Gaussian approximation and universality of data augmentation
Abstract TBD
14.30 - 15.15
Extreme value theory for singular subspace estimation in the matrix denoising model
In this talk, we study fine-grained singular subspace estimation in the matrix denoising model where a deterministic low-rank signal matrix is additively perturbed by a stochastic matrix of Gaussian noise. We establish that the maximum Euclidean row norm (i.e., the two-to-infinity norm) of the aligned difference between the leading sample and population singular vectors approaches the Gumbel distribution in the large-matrix limit, under suitable signal-to-noise conditions and after appropriate centering and scaling. We apply our novel asymptotic distributional theory to test hypotheses of low-rank signal structure encoded in the leading singular vectors and their corresponding principal subspace. We provide de-biased estimators for the corresponding nuisance signal singular values and show that our proposed plug-in test statistic has desirable properties. Notably, compared to using the Frobenius norm subspace distance, our test statistic based on the two-to-infinity norm has higher power to detect structured alternatives that differ from the null in only a few matrix entries or rows. Our main results are obtained by a novel synthesis of and technical analysis involving entrywise matrix perturbation analysis, extreme value theory, saddle point approximation methods, and random matrix theory. Our contributions complement the existing literature for matrix denoising focused on minimaxity, mean squared error analysis, unitarily invariant distances between subspaces, component-wise asymptotic distributional theory, and row-wise uniform error bounds. Numerical simulations illustrate our main results and demonstrate the robustness properties of our testing procedure to non-Gaussian noise distributions.
15.15 - 15.30
15.30 - 16.15
Minority representation and fairness in network ranking
Considerations of bias, fairness and representation are a prerequisite of responsible modern statistics. In statistical network analysis, observed networks are often incomplete or systematically biased, which can lead to systematic underrepresentation of protected groups, and affect any downstream ranking or decision based on the observed network. In this talk introduce a formal measure of minority representation, defined as the proportion of minority nodes among the top-ranked individuals. We model systematic bias through group-dependent missing edge mechanisms and develop statistical methods to estimate and test for such bias. When bias is detected, we propose a re-ranking procedure based on an asymptotic approximation that improves group representation. Applying the framework to the high school contact network reveals systematic underreporting of cross-group contacts consistent with recall bias. These findings highlight the importance of modeling and correcting systematic bias in social networks with heterogeneous groups.
16.15 - 17.00
Detection of Model-based Planted Pseudo-cliques in Random Dot Product Graphs by the Adjacency Spectral Embedding and the Graph Encoder Embedding
We explore the capability of both the Adjacency Spectral Embedding (ASE) and the Graph Encoder Embedding (GEE) for capturing an embedded pseudo-clique structure in the random dot product graph setting. In both theory and experiments, we demonstrate that, in the absence of additional clean (i.e., without the implanted pseudo-clique) network data, this pairing of model and methods can yield worse results than the best existing spectral clique detection methods. However, these methods can be used to asymptotically localize the pseudo-cliques if additional clean, independent network data is provided. This demonstrates at once the methods' potential ability/inability to capture modestly sized pseudo-cliques and the methods' robustness to the model contamination giving rise to the pseudo-clique structure. To further enrich our analysis, we also consider the Variational Graph Auto-Encoder (VGAE) model in our simulation and real data experiments.
17.00 - 17.15
Francesco Sanna Passino, Patrick Rubin-Delanchy, Nick Heard