Research

Our group focuses on fundamental research in statistics and machine learning.

Recent publications

Cappello, L., Madrid Padilla, O. H., Palacios, J. A. (2023). Bayesian Change Point Detection with Spike-and-Slab Priors. Journal of Computational and Graphical Statistics, 1-13. arxiv 2106.10383
Carter JS, Rossell D, Smith JQ. Partial correlation graphical LASSO (2023). Scandinavian Journal of Statistics. Open access version
G. Mesters, P. Zwiernik, Non-independent component analysis (2024+). to appear in Annals of Statistics arXiv:2206.13668
Jewson J, Rossell D. Loss function selection and the use of improper models. Journal of the Royal Statistical Society B 2022 84, 1640-1665. Online version
Rossell D. Concentration of posterior probabilities and normalized L0 criteria (2022). Bayesian Analysis, 17, 2, 565-591. Open access version
Cappello, L., Palacios, J. A., Adaptive Preferential Sampling in Phylodynamics. Journal of Computational and Graphical Statistics, 31(2): 541-552, 2022. Open access version
Cappello, L., Kim, J. , Liu, S. , Palacios, J. A., Statistical Challenges in Tracking the Evolution of SARS-CoV-2. Statistical Science, 37(2): 162-182, 2022. Open access version
Semken C, Rossell D. Specification analysis for technology use and teenager well-being. Statistical validity and a Bayesian proposal (2022). Journal of the Royal Statistical Society C
Avalos-Pacheco A., Rossell D., Savage R (2022). Heterogeneous large datasets integration using Bayesian factor regression. Bayesian Analysis. 17(1): 33-66. arXiv.1810.09894

Selected publications in theory/methodology (the last 10 years)

2023

F. Röttger, S. Engelke, P. Zwiernik, Total positivity in multivariate extremes. Annals of Statistics 2023, Vol. 51, No. 3, 962-1004.

2022

Jewson J, Rossell D. Loss function selection and the use of improper models. Journal of the Royal Statistical Society B 2022 84, 1640-1665. Online version
S. Lauritzen, P. Zwiernik. Locally associated graphical models and mixed convex exponential families. Annals of Statistics 2022, Vol. 50, No. 5, 962-1004.
G. Lugosi and S. Mendelson. Multivariate mean estimation with direction-dependent accuracy. Journal of the European Mathematical Society, 2022.
L. Addario-Berry, L. Devroye, G. Lugosi, and V. Velona. Broadcasting on random recursive trees. Annals of Applied Probability, 32(1):497-528, 2022.
Rossell D. Concentration of posterior probabilities and normalized L0 criteria (2022). Bayesian Analysis, 17, 2, 565-591. Open access version

2021

M. Greenacre. Compositional Data Analysis. Annual Reviews in Statistics and its Application, to appear, 2021.
G. Lugosi, J. Truszkowski, V. Velona, and P. Zwiernik. Learning partial correlation graphs and graphical models by covariance queries. Journal of Machine Learning Research, 22(203):1--41, 2021.
Rossell D, Abril O, Bhattacharya A. Approximate Laplace approximations for scalable model selection (2021). Journal of the Royal Statistical Society B, 83, 4, 853-879. Online version (open access)
G. Lugosi, and S. Mendelson. Robust multivariate mean estimation: the optimality of trimmed mean. Annals of Statistics, 2021.
S. Lauritzen, C. Uhler and P. Zwiernik, Total positivity in exponential families with application to binary variables. Annals of Statistics, 2021, Vol. 49, No. 3, 1436-1459.
Rossell D, Rubio FJ. Additive Bayesian variable selection under censoring and misspecification (2021). Statistical Science, 38, 1,13-29 Open access
Rossell D, Zwiernik P. Dependence in elliptical partial correlation graphs (2021). Electronic Journal of Statistics, 15, 2, 4236-4263. Open access version

2020

C. Bordenave, G. Lugosi, and N. Zhivotovskiy. Noise sensitivity of the top eigenvector of a Wigner matrix. Probability Theory and Related Fields, 2020.
G. Lugosi, and S. Mendelson. Risk minimization by median-of-means tournaments. Journal of the European Mathematical Society, 2020.
P. Bartlett, P.L. Long, G. Lugosi, and A. Tsigler. Benign overfitting in linear regression. PNAS, 117.48 (2020): 30063-30070.
A. Corral, F. Udina and E. Arcaute, Truncated lognormal distributions and scaling in the size of naturally defined population clusters. Physical Review E, 2020, 101, No. 4.

2019

G. Lugosi, and S. Mendelson, Near-optimal mean estimators with respect to general norms. Probability Theory and Related Fields, 2019.
J. Fúquene, M.F.J. Steel, and D. Rossell, On choosing mixture components via non-local priors. Journal of the Royal Statistical Society B, 2019, 81, 5, 809-837.
S. Lauritzen, C. Uhler, and P. Zwiernik, Maximum likelihood estimation in Gaussian models under total positivity. Annals of Statistics, 2019, Vol. 47, No. 4, 1835-1863.
Sub-Gaussian estimators of the mean of a random vector by G. Lugosi, and S. Mendelson. Annals of Statistics, 2019, Vol. 47, No. 2, pp 783-794.

2018

Variable selection in compositional data analysis using pairwise logratios. M. Greenacre. Mathematical Geosciences, 2018, 1-34. doi: 10.1007/s11004-018-9754-x
Tractable Bayesian variable selection: beyond normality by D. Rossell and F.J. Rubio. Journal of the American Statistical Association, 2018, pp 1-17.

2017

Nonlocal priors for high-dimensional estimation by D. Rossell and D. Telesca. Journal of the American Statistical Association, 2017, 112.517, pp 254-265.
Maximum likelihood estimation for linear Gaussian covariance models by P. Zwiernik, C. Uhler, and D. Richards. Journal of the Royal Statistical Society: Series B, 79(4), 2017, 1269–1292.
S. Fallat, S. Lauritzen, K. Sadeghi, C. Uhler, N. Wermuth, and P. Zwiernik, Total positivity in Markov structures. Annals of Statistics 2017, Vol. 45, No. 3, 1152-1184.
"Size" and "shape" in the meansurement of multivariate proximity by M. Greenacre. Methods in Ecology and Evolution 2017, 8:1415-1424. doi: 10.1111/2041-210X.12776 with video abstract.

2016

Set estimation from reflected Brownian motion by A. Cholaquidis, R. Fraiman, G. Lugosi, and B. Pateiro-López. Journal of the Royal Statistical Society: Series B, 2016, 78:1057–1078.
Sub-Gaussian mean estimators by L. Devroye, M. Lerasle, G. Lugosi, and R. Imbuzeiro Oliveira. Annals of Statistics, 2016, 44:2695-2725.
Almost optimal sparsification of random geometric graphs by N. Broutin, L. Devroye, and G. Lugosi, Annals of Applied Probability, 2016, 26:5, 3078-3109.
Weighted Euclidean biplots by M. Greenacre and P. Groenen. Journal of Classification, 33:442-459.
On probability laws of solutions of differential systems driven by fractional Brownian motion by F. Baudoin, E. Nualart, C. Ouyang, and S. Tindel, Annals of Probability, 2016, 44, pp 2554-2590.
Exponential varieties by M. Michałek, B. Sturmfels, C. Uhler, and P. Zwiernik, Proceedings of the London Mathematical Society (3) 112 (2016), no. 1, 27–56.

2015

Empirical risk minimization for heavy-tailed losses by C. Brownlees, E. Joly and G. Lugosi, Annals of Statistics, 2015, 43(6), 2507-2536.

Selected publications in applications (the last 10 years)

Parikh, V., Ioannidis, ... Cappello, L. ,..., Rivas, M., Ashley, E. (2022) Deconvoluting complex correlates of COVID19 severity with a multi-omic pandemic tracking strategy. Nature Communications, 13, 5107
L. Beauchemin, M. Slifker, D. Rossell, and J. Font-Burgada (2020). Characterizing MHC-I genotype predictive power for oncogenic mutation probability in cancer patients. Immunoinformatics, Methods and Protocols. Springer.
Graeve M, Greenacre M. (2020). The selection and analysis of fatty acid ratios: A new approach for the univariate and multivariate analysis of fatty acid trophic markers in marine pelagic organisms. Limnology and Oceanographic Methods, 18, 196-210. doi: 10.1002/lom3.10360 with video abstract
Greenacre M (2020) . Amalgamations are valid in compositional data analysis, can be used in agglomerative clustering, and their logratios have an inverse transformation. Applied Computing and Geosciences, 5, doi: 10.106/j.acags.2019.100017
Gavard R, Jones H, Palacio Lozano D, Thomas M, Rossell D, Spencer S, Barrow M (2020). KairosMS: A new solution for the processing of hyphenated ultrahigh resolution mass spectrometry data. Analytical Chemistry, 92.5 3775-86
Gavard R, Palacio Lozano D, Guzman A, Rossell D, Spencer S, Barrow M (2019). Rhapso: Automatic stitching of mass segments from Fourier transform ion cyclotron resonance mass spectra. Analytical Chemistry, 91:15130-37
Greenacre M (2019). Use of correspondence analysis in clustering a mixed-scale data set with missing data. Archives of Data Science, doi: 10.5445/KSP/1000085952/04
Korneliussen T, Greenacre M (2018). Information sources used by European tourists: a cross-cultural study. Journal of Travel Research, 57, 193-205.
Greenacre M (2017). Ordination with any dissimilarity measure: a weighted Euclidean solution. Ecology, 98:2293-2300.
Marty R, Kaabinejadian S, van de Haar J, Rossell D, Ideker T, Hildebrand W, Engin HB, Font-Burgada J, Carter H. (2017) MHC-I genotype restricts the oncogenic mutational landscape. Cell, 171, 1272-1283
Greenacre M (2016). Data reporting and visualization in ecology. Polar Biology, 39:2189-2205.
Font-Burgada J, Shalapour S, Ramaswamy S, Hsueh B, Rossell D, Umemura A, Taniguchi K, Nakagawa H, Valasek MA, Ye L, Kopp JL, Sander M, Carter H, Deisseroth K, Verma IM, Karin M. (2015) Hybrid Periportal Hepatocytes Regenerate the Injured Liver without Giving Rise to Cancer. Cell, 162(4):766-79.
Calon A, Lonardo E, Berenguer A, Espinet E, Hernando-Momblona X, Iglesias M, Sevillano M, Palomo-Ponce S, Tauriello DVF, Byrom D, Cortina C, Morral C, Barceló C, Tosi S, Riera A, Stephan-Otto Attolini C, Rossell D, Sancho E, Batlle E. (2015) Stromal gene expression defines poor prognosis subtypes in colorectal cancer. Nature Genetics, 47, 320-329. doi:10.1038/ng.3225

Books

D. Nualart and E. Nualart, Introduction to Malliavin Calculus, IMS Textbooks, Cambridge University Press, 2018.

This textbook offers a compact introductory course on Malliavin calculus, an active and powerful area of research. It covers recent applications, including density formulas, regularity of probability laws, central and non-central limit theorems for Gaussian functionals, convergence of densities and non-central limit theorems for the local time of Brownian motion. The book also includes a self-contained presentation of Brownian motion and stochastic calculus, as well as Lévy processes and stochastic calculus for jump processes. Accessible to non-experts, the book can be used by graduate students and researchers to develop their mastery of the core techniques necessary for further study.

M. Greenacre. Compositional Data Analysis in Practice. Chapman&Hall, 2018.

Compositional Data Analysis in Practice is a user-oriented practical guide to the analysis of data with the property of a constant sum, for example percentages adding up to 100%. Compositional data can give misleading results if regular statistical methods are applied, and are best analysed by first transforming them to logarithms of ratios. This book explains how this transformation affects the analysis, results and interpretation of this very special type of data. All aspects of compositional data analysis are considered: visualization, modelling, dimension-reduction, clustering and variable selection, with many examples in the fields of food science, archaeology, sociology and biochemistry, and a final chapter containing a complete case study using fatty acid compositions in ecology. The applicability of these methods extends to other fields such as linguistics, geochemistry, marketing, economics and finance.

P. Zwiernik. Semialgebraic Statistics and Latent Tree Models. Chapman&Hall, 2017.

Semialgebraic Statistics and Latent Tree Models explains how to analyze statistical models with hidden (latent) variables. It takes a systematic, geometric approach to studying the semialgebraic structure of latent tree models. The first part of the book gives a general introduction to key concepts in algebraic statistics, focusing on methods that are helpful in the study of models with hidden variables. The author uses tensor geometry as a natural language to deal with multivariate probability distributions, develops new combinatorial tools to study models with hidden data, and describes the semialgebraic structure of statistical models. The second part illustrates important examples of tree models with hidden variables. The book discusses the underlying models and related combinatorial concepts of phylogenetic trees as well as the local and global geometry of latent tree models. It also extends previous results to Gaussian latent tree models. This book shows you how both combinatorics and algebraic geometry enable a better understanding of latent tree models. It contains many results on the geometry of the models, including a detailed analysis of identifiability and the defining polynomial constraints.

S. Boucheron, G. Lugosi, and P. Massart, Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013.

Concentration inequalities for functions of independent random variables is an area of probability theory that has witnessed a great revolution in the last few decades, and has applications in a wide variety of areas such as machine learning, statistics, discrete mathematics, and high-dimensional geometry. Roughly speaking, if a function of many independent random variables does not depend too much on any of the variables then it is concentrated in the sense that with high probability, it is close to its expected value. This book offers a host of inequalities to illustrate this rich theory in an accessible way by covering the key developments and applications in the field. The authors describe the interplay between the probabilistic structure (independence) and a variety of tools ranging from functional inequalities to transportation arguments to information theory. Applications to the study of empirical processes, random projections, random matrix theory, and threshold phenomena are also presented.

N. Cesa-Bianchi, and G. Lugosi, Prediction, Learning, and Games. Cambridge University Press, 2006.

This book offers the first comprehensive treatment of the problem of predicting "individual sequences". Unlike standard statistical approaches to forecasting, prediction of individual sequences does not impose any probabilistic assumption on the data-generating mechanism. Yet, prediction algorithms can be constructed that work well for all possible sequences, in the sense that their performance is always nearly as good as the best forecasting strategy in a given reference class. The central theme is the model of "prediction using expert advice", a general framework within which many related problems can be cast and discussed. Repeated game playing, adaptive data compression, sequential investment in the stock market, sequential pattern analysis, and several other problems are viewed as instances of the experts' framework and analyzed from a common nonstochastic standpoint that often reveals new and intriguing connections. Old and new forecasting methods are described in a mathematically precise way with the purpose of characterizing their theoretical limitations and possibilities.

L. Devroye and G. Lugosi, Combinatorial Methods in Density Estimation. Springer, 2000.

Density estimation has evolved enormously since the days of bar plots and histograms, but researchers and users are still struggling with the problem of the selection of the bin widths. This book is the first to explore a new paradigm for the data-based or automatic selection of the free parameters of density estimates in general so that the expected error is within a given constant multiple of the best possible error. The paradigm can be used in nearly all density estimates and for most model selection problems, both parametric and nonparametric.

L. Devroye, L. Györfi, G. Lugosi, A Probabilistic Theory of Pattern Recognition. Springer, 1996.

A self-contained and coherent account of probabilistic techniques, covering: distance measures, kernel rules, nearest neighbour rules, Vapnik-Chervonenkis theory, parametric classification, and feature extraction. Each chapter concludes with problems and exercises to further the readers understanding. Both research workers and graduate students will benefit from this wide-ranging and up-to-date account of a fast- moving field.

M. Greenacre. Correspondence Analysis in Practice. Chapman&Hall, 1993.

Correspondence analysis is a multivariate method for exploring cross-tabular data by converting such tables into graphical displays, called 'maps', and related numerical statistics. Since cross-tabulations are so often produced in the course of social science research, correspondence analysis is valuable in understanding the information contained in these tables. This book fills the gab in the literature between the theory and practice of this method. Various theoretical aspects are presented in a language accessible to both social scientists and statisticians and a wide variety of applications are given which demonstrate the versatility of the method to interpret tabular data in a unique graphical way. The first part of the book deals with basic concepts of correspondence analysis and related methods for analyzing cross-tabulations. It then looks at the multivariate case when there are several variables of interest, including the relationship to cluster analysis, factor analysis and reliability of measurement. Applications to longitudinal data: event history data, panel data and trend data are demonstrated. Finally, it examines further applications in the social sciences, including the analysis of textual data, lifestyle data and data on product descriptions in marketing research. Correspondence Analysis in the Social Sciences gives lecturers, researchers and students a detailed introduction to help them teach the method and apply it to their own research problems. Researchers in psychology, sociology, business, marketing and statistics will all find this book particularly useful.

Current editorial services

Christian Brownlees:

Annals of Financial Economics, Econometrics, Journal of Network Theory in Finance, Journal of Risk and Financial Management

Gábor Lugosi:

Annals of Applied Probability, Journal of Machine Learning Research, Probability Theory and Related Fields

Eulàlia Nualart:

Stochastic Processes and their Applications

David Rossell:

Bayesian Analysis

Piotr Zwiernik:

Journal of Royal Statistical Society: Series B, Biometrika, Algebraic Statistics, Scandinavian Journal of Statistics

Research projects

"Prediccion, Inferencia y Computacion en Modelos Estructurados de Alta Dimension"

Reference: PGC2018-101643-B
Financing entity: Ministerio de Economía y Competitividad (MINECO)
Dates: 2019-2022
Principle investigators: Gábor Lugosi, Omiros Papaspiliopoulos
Amount: € 141,812

"Algorithms and Learning for AI"

Financing entity: Google
Dates: 2018-2020
Principle investigator: Gábor Lugosi
Amount: USD 150,000

“High-dimensional problems in structured probabilistic models”

Financing entity: Fundación BBVA
Dates: 2018-2020
Principle investigator: Gabor Lugosi
Amount: € 100,000

“Estimación de redes latentes”

Reference: MTM2015-67304-P
Financing entity: Ministerio de Economía y Competitividad (MINECO)
Dates: 2016-2018
Principle investigators: Gabor Lugosi, Omiros Papaspiliopoulos
Amount: € 52,998