Lieu : IHP, amphi Yvonne Choquet-Bruhat (second étage du bâtiment Perrin)
14.00 : Juhyun Park (ENSIIE et LaMME)
Titre : A Geometric Functional Data Analysis for Multi-dimensional Curves
Résumé : Data in the form of curves or function values are increasingly common in modern scientific applications. Viewing these types of data as an infinite dimensional object, functional data analysis offers a suite of statistical methods and tools to handle such types of data. Traditional approaches have been developed around the idea of an efficient function representation and dimension reduction. However its extension to multi-dimensional curves poses new challenges due to its inherent geometric features that are difficult to capture with the classical approaches. We propose an alternative notion of mean that reflects shape variation of the curves. Based on a geometric representation of the curves through the Frenet-Serret ordinary differential equations, we introduce a new definition of mean curvature and mean shape through the mean ordinary differential equation. This new formulation of the mean for multi-dimensional curves allows us to integrate the parameters for the shape features into the unified functional data modelling framework. In addition, it can be interpreted as a generalization of the elastic mean of the curves based on the square root velocity function representation. We formulate the estimation problem of the functional parameters in a regularized regression framework and develop an efficient algorithm. We demonstrate our proposed method with both simulated data and a real data example.
[This is a joint work with Nicolas Brunel and Perrine Chassat.]
15.00 : Arthur Leroy (INRAE, MIAPS et GABI)
Titre : Multi-Mean Gaussian Processes: A probabilistic learning framework for multi-correlated functional data
Résumé : Modelling and forecasting functional data (time series, spatial measurements, ...), even with a probabilistic flavour, is a common and well-handled problem nowadays. However, suppose that one is collecting data from hundreds of individuals, each of them gathering thousands of biological measurements, all evolving continuously over time. Such a context, frequently arising in biological or medical studies, quickly leads to highly correlated datasets where dependencies come from different sources (for instance, temporal trends or individual similarities). Explicit modelling of overly large covariance matrices accounting for these underlying correlations is generally unreachable due to theoretical and computational limitations. Therefore, practitioners often need to restrict their analysis by working on subsets of data or making arguable assumptions (fixing time, studying genes or individuals independently, …). To tackle these issues, we proposed a framework for multi-task Gaussian processes, tailored to handle multiple functional data simultaneously. By sharing information between tasks through a mean process instead of an explicit covariance structure, this method leads to a learning and forecasting procedure with linear complexity in the number of tasks. The resulting predictions remain Gaussian distributions and thus offer an elegant probabilistic approach to deal with correlated measurements. Group structures can also be exploited through clustering within the learning procedure to enhance prediction performances. We will finally present an extended framework in which as many sources of correlation as desired can be considered while maintaining linear complexity scaling. Several applicative examples are explored, coming from various fields like epidemiology, biology, or sports sciences.
16.00 : Toby Dylan Hocking (Université de Sherbrooke, Département d'informatique)
Titre : Finite Sample Complexity Analysis of Binary Segmentation
Résumé : Binary segmentation is the classic greedy algorithm which recursively splits a sequential data set by optimizing some loss or likelihood function. Binary segmentation is widely used for change-point detection in data sets measured over space or time, and as a sub-routine for decision tree learning. In theory, using a simple loss function like Gaussian or Poisson, its asymptotic time complexity should be extremely fast: for N data and K segments, O(N K) in the worst case, and O(N \log K) in the best case. In practice, other implementations can be asymptotically slower, and can sometimes return incorrect results. We present the R package binsegRcpp, which provides an efficient and correct C++ implementation of binary segmentation. We propose new methods for analyzing the best/worst case number of iterations of the algorithm, as well as empirical analyses which indicate that binsegRcpp has asymptotically optimal speed in practice.