Séance du 14 janvier 2013

Lundi 14 janvier 2013

Organisateurs: Cécile Durot et Ghislaine Gayraud

14h00 Ana Karina Fermin (Université Paris Ouest)

Titre: Apprentissage Actif et Sélection de Modèles pour le problème de Régression

Résumé : Nous considérons un problème de Sélection de Modèles et Apprentissage Actif (Model Selection and Active Learning) pour la régression. Dans un modèle de régression, on souhaite estimer une fonction inconnue à partir d'un échantillon d'apprentissage (t1,y1), ...(tn,yn) . Dans de nombreux cas, en particulier dans le milieu industriel, la mesure de yi au point ti est coûteuse. On étudie ici, comment choisir des points ti les plus informatifs possibles (selon un certain critère de décision, sans connaitre les valeurs observées yi) .

Dans un premier temps, nous proposons une nouvelle technique pour la sélection d'un sous-échantillon optimal (à modèle fixé). Nous proposons ensuite des stratégies pour sélectionner simultanément un échantillon optimal et un modèle optimal en combinant apprentissage actif et sélection de modèles.

Nous étudions deux approches: une approche de type batch qui choisit une stratégie d'échantillonnage à partir d'une première estimation et une approche séquentielle qui fonctionne de manière itérative (similaire aux techniques d'apprentissage actif en classification récemment proposées par Beygelzimer, Dagupsta et al.). Dans les deux cas, en utilisant des inégalités de concentration, nous obtenons de bonnes garanties théoriques pour l'estimateur choisi.

15h00 Rik Lopuhaä (Delft Institute of Applied Mathematics. The Netherlands)

Titre: Central limit theorem and influence function for the MCD estimators at general multivariate distributions

Résumé: The minimum covariance determinant (MCD) estimators of multivariate location and scatter are robust alternatives to the ordinary sample mean and sample covariance matrix. Nowadays they are used to determine robust Mahalanobis distances in a reweighting procedure, and are used as robust plug-ins in all sorts of multivariate statistical techniques which need a location and/or covariance estimate, such as principal component analysis, factor analysis, discriminant analysis and linear multivariate regression. For this reason, the distributional and the robustness properties of the MCD estimators are essential for conducting inference and perform robust estimation in several statistical models. Butler, Davies and Jhun (1993) prove asymptotic normality only for the MCD location estimator, whereas the MCD covariance estimator is only shown to be consistent. Croux and Haesbroeck (1999) give the expression for the influence function of the MCD covariance functional and use this to compute limiting variances of the MCD covariance estimator. However, the expression is obtained under the assumption of existence, continuity and differentiability of the MCD-functionals at perturbed distributions, which is not proven. Moreover, the computation of the limiting variances relies on the von Mises expansion of the estimator, which has not been established. In this presentation we define the MCD functional by means of trimming functions which are in a wide class of measurable functions. The class is very flexible and allows a uniform treatment at general probability measures, including empirical measures and perturbed measures. We prove existence of the MCD functional for any multivariate distribution P and provide a separating ellipsoid property for the functional. Furthermore, we prove continuity of the functional, which also yields strong consistency of the MCD estimators. Finally, we derive an asymptotic expansion of the functional, from which we rigorously derive the influence function, and establish a central limit theorem for both MCD-estimators. All results are obtained under very mild conditions on P and essentially all conditions are automatically satisfied for distributions with a density. For distributions with an elliptically contoured density that is unimodal we do not need any extra condition and one may recover the results in Butler, Davies and Jhun (1993) and Croux and Haesbroeck (1999)

16h00 Camille Charbonnier (LITIS, Université de Rouen)

Titre : Homogeneity Tests for High-dimensional Linear Regression

(joint work with Nicolas Verzelen and Fanny Villers)

Résumé : In this talk, we consider a two-sample linear regression model. The objective is to test whether the sample-specific models are the same. The difficulty of the task comes from the high-dimensional setting where the number of covariates p is larger than the numbers of observations n_1 and n_2 in the two samples. To tackle this issue, we adapt the one-sample testing procedure described in [1] to the two-sample framework and provide corresponding theoretical controls on type-I error and power. We also investigate an adaptation of higher-criticism to the two-sample testing problem. If powerful enough, this strategy would present a clear advantage in terms of computing time when facing high-dimensional datasets. We provide numerical experiments comparing the performances of those testing strategies under orthonormal and correlated random design settings.

[1] N. Verzelen, F. Villers. Goodness-of-fit Tests for high-dimensional Gaussian linear models. The Annals of Statistics, Vol. 38 : Number 2, 704-752, 2010.