Séance du 16 septembre 2019

Séance organisée par Claire Lacour et Thanh Mai Pham Ngoc.

Lieu : IHP, salle 201

14.00 : Zacharie Naulet (Université Paris Sud)

Titre : Optimal disclosure risk assessment.

Résumé : Protection against disclosure is a legal and ethical obligation for agencies releasing microdata files for public use. Consider a microdata sample of size $n$ from a finite population of size $\bar{n}=n+\lambda n$, with $\lambda>0$, such that each record contains two disjoint types of information: identifying categorical information and sensitive information. Any decision about releasing data is supported by the estimation of measures of disclosure risk, which are functionals of the number of sample records with a unique combination of values of identifying variables. The most common measure is arguably the number $\tau_{1}$ of sample unique records that are population uniques. In this paper, we first study nonparametric estimation of $\tau_{1}$ under the Poisson abundance model for sample records. We introduce a class of linear estimators of $\tau_{1}$ that are simple, computationally efficient and scalable to massive datasets, and we give uniform theoretical guarantees for them. In particular, we show that they provably estimate $\tau_{1}$ all of the way up to the sampling fraction $(\lambda+1)^{-1}\propto (\log n)^{-1}$, with vanishing normalized mean-square error (NMSE) for large $n$. We then establish a lower bound for the minimax NMSE for the estimation of $\tau_{1}$, which allows us to show that: i) $(\lambda+1)^{-1}\propto (\log n)^{-1}$ is the smallest possible sampling fraction; ii) estimators' NMSE is near optimal, in the sense of matching the minimax lower bound, for large $n$. This is the main result of our paper, and it provides a precise answer to an open question about the feasibility of nonparametric estimation of $\tau_{1}$ under the Poisson abundance model and for a sampling fraction $(\lambda+1)^{-1}<1/2$.

Joint work with Federico Camerlenghi, Stefano Favaro and Francesca Panero.

15.00 : Claire Boyer (Sorbonne Université)

Titre : On the structure of solutions of convex regularization

Résumé : We establish a general principle which states that regularizing an inverse problem with a convex function yields solutions which are convex combinations of a small number of atoms. These atoms are identified with the extreme points and elements of the extreme rays of the regularizer level sets. As a side result, we characterize the minimizers of the total gradient variation, which was still an unresolved problem. This can be viewed to be in the same vein of the representer theorem in machine learning.

16.00 : Mohamed Hebiri (Université Paris-Est Marne-la-Vallée)

Titre : Minimax semi-supervised confidence sets for multi-class classification

Résumé : Multiclass classification problems such as image annotation can involve a large number of classes. In this context, confusion between classes can occur, and a single label classification may fail. In this talk, I will present a general device to build a confidence set classifier, instead of a single label classifier. In our framework the goal is to build the best confidence set classifier having a given expected size and the attractive feature of our approach is its semi-supervised nature - the construction of the confidence set classifier takes advantage of unlabeled data. Our study of the minimax rates of convergence under the combination of the margin and non parametric assumptions reveals that there is NO supervised method that outperforms the semi-supervised estimator proposed in this work. To further highlight the fundamental difference of supervised and semi-supervised methods, we establish that the best achievable rate for ANY supervised method is n^{-1/2}, even if the margin assumption is extremely favourable. On the contrary, by using a sufficiently large unlabelled sample we are able to significantly improve this rate.