Matthieu Jonckheere
Distance learning using Euclidean percolation: Following Fermat's principle
In unsupervised statistical learning tasks such as clustering, recommendation, or dimension reduction, a notion of distance or similarity between points is crucial but usually not directly available as an input. We proposed a new density-based estimator for weighted geodesic distances that takes into account the underlying density of the data, and that is suitable for nonuniform data lying on a manifold of lower dimension than the ambient space. The consistency of the estimator is proven using tools from first passage percolation. We then illustrate its properties and implementation and evaluate its performance for clustering tasks on toy models. Finally, we discuss the choice of the (unique) critical parameter involved.
(Joint work with P. Groisman, University of Buenos Aires and F. Sapienza, UC Berkeley and more recently with F. Pascal and M. El Baz, CentraleSupelec).
David E Tyler
Robustness properties of regularized M-estimates of covariance: A statistician's view
M-estimates of covariance matrices have received much attention and development within the area of signal processing within the past 20 years or so. Within the area of robust statistics, however, they have been out of favor due to their low breakdown point in higher dimensions. In this talk, the breakdown problem for these M-estimates and its relevance to applications in signal processing is to be first discussed. Within robust statistics, the recognition of the low breakdown point of the M-estimates of covariance lead to the development of a variety of high breakdown point alternatives, which like the M-estimates and the sample covariance matrix are affine equivariant. Unlike the M-estimators, though, these affine equivariant high breakdown point estimators tend to be computationally intensive, and are typically computed via approximate or probabilistic algorithms. To circumvent these shortcomings, covariance estimators that have high breakdown points but sacrifice affine equivariance have been proposed, with the spatial sign covariance matrix (SSCM) being perhaps the most popular. The SSCM is an appealing candidate because of its simplicity. Not only does it have a high breakdown point, it is also computationally feasible regardless of dimension. Motivated by an understanding of the breakdown issue for the M-estimates, it is then shown how the M-estimates can be regularized so that the resulting estimate has a high breakdown point. These regularized or penalized M-estimates of covariance remain computationally feasible in high dimensions. It turns out that the SSCM itself can be viewed as extreme element of this class of regularized estimators. Unlike the SSCM, though which is based on Euclidean distances, the more general class of regularized estimators can take into account the shape of the contours of the data cloud when down-weighing observations, as well as tuned for other statistical properties while maintaining a high breakdown point. Finally, other robustness properties of these penalized M-estimators, such as their influence functions and consistency issues are discussed.
Jasin Machkour
The Terminating-Knockoff Filter: Fast High-Dimensional Variable Selection with False Discovery Rate Control
In many real-world applications only a few variables among tens of thousands, or even a few millions, are truly associated with the response. For example, in genome-wide association studies (GWAS) only a few common genetic variations called single nucleotide polymorphisms (SNPs) are associated with a phenotype of interest. In this application, a low false discovery rate (FDR) is desired for two main reasons: First, for every potentially discovered influential SNP further extensive and expensive functional genomics studies and molecular biological laboratory experiments are needed to investigate whether the reported genetic associations are causal and to understand their functional implications. Second, the discoveries in a GWAS need to be reproducible in multiple studies conducted for other cohorts and/or populations before proceeding towards functional studies and laboratory experiments. This talk introduces the Terminating-Knockoff (T-Knock) filter, a fast variable selection method for high-dimensional data. The T-Knock filter provably controls a user-defined target FDR while maximizing the number of selected true positives. This is achieved by fusing the solutions of multiple efficiently terminated random experiments conducted on a combination of the original data and multiple sets of randomly generated knockoff variables. Numerical simulations show that the FDR is controlled at the target level while allowing for a high power. We prove under mild conditions that the knockoffs for the T-Knock filter can be easily sampled from any univariate distribution, e.g., the univariate standard normal distribution. The combination of early terminated random experiments and a simple knockoff generation process drastically reduces the computation time. The computational complexity of the proposed method is multiple orders of magnitude lower than that of its competitors in sparse high-dimensional settings. Additionally, a massive speedup is achieved by executing the independent random experiments in parallel on multicore computers or high-performance clusters. The T-Knock filter outperforms state-of-the-art methods for FDR control on a simulated genome-wide association study (GWAS) while its computation time is more than two orders of magnitude lower than that of its strongest competitors.
Arnaud Breloy
Riemannian geometry in elliptical distributions
When dealing with an estimation problem, a Riemannian manifold can naturally appear by endowing the parameter space with the Fisher information metric. Interestingly, this yields a point of view that allows leveraging many tools from differential geometry. After a brief introduction about these concepts, we will present some recent works about the geometry of elliptical distributions and their practical implications. The exposition will be divided into three main axes: Riemannian optimization, Intrinsic Cramér-Rao bounds, and classification/clustering using Riemannian distances.