Statistical Inference from Multiscale Biological Data: Theory, Algorithms, Applications
Following the advent of high-throughput techniques in the life sciences, systems across multiple scales –from single molecules to organisms up to entire populations– can now be probed quantitatively at high spatial and temporal resolutions. Examples include
Families of homologous proteins;
High-resolution maps of the microenvironment of a population of cells;
High-dimensional data on complex heterogeneous patient groups and their clinical progression;
Large-scale surveys of the epidemiological state of a population through combinations of public health, social, behavioural and genomic indicators.
Besides continuously improving our knowledge of the mechanisms that regulate biological processes, these data also foster the hope of deducing previously unknown quantitative laws, levels of organisation or design principles. Statistical inference aims at using data to learn an underlying probability distribution which can serve as a generative model for prediction, classification or design purposes. Unfortunately, the effectiveness of standard statistical inference protocols --like Maximum Likelihood (ML) or Maximum a Posteriori (MAP) inference-- for analysing biological datasets is often hampered (i) by the high dimensionality and strong scale heterogeneity of data, and (ii) by the fact that data vastly under-sample the space of possible states of a biological system. SIMBAD aims at building better algorithms, and hence more powerful generative models, by achieving a deeper mathematical understanding of inference under these conditions. We shall apply our methods to open problems from four specific application areas at different scales (both spatial and temporal): protein-sequence spaces, cellular metabolism, digital contact tracing and medical data analytics.
Our goals in a nutshell
To devise innovative inference algorithms and model selection tools
To explore new mathematical routes to handling biological heterogeneity
To enable the efficient reconstruction of the fitness landscapes of biological sequence
To develop new models of protein and pathogen evolution
To achieve a better characterisation of the feasible space of cellular metabolic networks
To improve inference tools for epidemic risks using mobility and proximity data
To analyse and correct overfitting effects in epidemiological models
To establish robust survival analysis methods for heterogeneous populations with competing risks
To design efficient collaborative learning tools for rare diseases