Active and batch Segmentation, Clustering, and seriation: toward unified foundations in AI
The project ASCAI is funded by the ANR (Agence Nationale de la Recherche) and the DFG (Deutsche Forschungsgemeinschaft), through the PRCI program AAPG 2021, and runs from Feb. 2022 to to Jan 2026.
Unsupervised Learning is one of the most fundamental problem of machine learning, and more generally, of artificial intelligence. In a broad sense, it amounts to learning some unobserved latent structure over data. This structure may be of interest per se, or may serve as an important stepping stone integrated in a complex data analysis pipe-line - since large amounts of unlabeled data are more common than costly labeled data. Arguably, one the cornerstones of unsupervised learning is clustering, where the aim is to recover a partition of the data into homogeneous groups. Beside vanilla clustering, unsupervised learning encompasses a large variety of related other problems such as hierarchical clustering, where the group structure is more complex and reveals both the backbone and fine-grain organization of the data, segmentation where the shape of the clusters is constrained by side information, or ranking or seriation problems where where no actual cluster structure exists, but where there is some implicit ordering between the data. All these problems have already found countless applications and interest in these methods is even strengthening due to the amount of available unlabelled data. We can for instance cite crowdsourcing - where individuals answer to a subset of questions, and where, depending on the context, one might want to e.g. cluster them depending on their field of expertise, rank them depending on their performances, or seriate them depending on their affinities. Such problems are extremely relevant for recommender systems - where individuals are users, and questions are items - and for social network analyses.
The analysis of unsupervised learning procedures has a long history that takes its roots both in the computer science and in mathematical communities. In response to recent bridges between these two communities, groundbreaking advances have been made in the theoretical foundations of vanilla clustering. We believe that these recent advances hold the key for deep impacts on the broader field of unsupervised learning because of the pervasive nature of clustering. In this proposal, we first aim at propagating these recent ground-breaking advances in vanilla clustering to problems where the latent structure is either more complex or more constrained. We will consider problems of increasing latent structure complexity - starting from hierarchical clustering and heading toward ranking, seriation, and segmentation - and propose new algorithms that will build on each other, focusing on the interfaces between these problems. As a result, we expect to provide new methods that are valid under weaker assumptions in comparison to what is usually done - e.g. parametric assumptions - while being also able to adapt to the unknown intrinsic difficulty of the problem.
Moreover, many modern unsupervised learning applications are essentially of an online nature - and sometimes decisions have to be made sequentially on top of that. For instance, consider a recommender systems that sequentially recommends items to users. In this context where sequential, active recommendations are made, it is important to leverage the latent structure underlying the individuals. While both the fields of unsupervised learning, and sequential, active learning, are thriving, research at the crossroad has been conducted mostly separately by each community - leading to procedures that can be improved. A second aim of this proposal will then be to bring together the fields of unsupervised learning and active learning, in order to propose new algorithms that are more efficient at leveraging sequentially the unknown latent structure. We will consider the same unsupervised learning problems as in the batch learning side of the proposal. We will focus on developing algorithms that fully take advantage of new advances in clustering, and of our own future work in batch learning.
ASCAI is a ANR-DFG PRCI project, that spans over four research institutes (see below). The speakers are Nicolas Verzelen (French side) and Alexandra Carpentier (German side).
1st ASCAI (Kick-Off) Workshop in Montpellier: March 1-2 2022 in INRAE Montpellier: Institut Agro campus, 2 place Pierre Viala - we will meet in the Building 11 Room 204. See the campus map). Here is the program.
2nd ASCAI Workshop in Munich: February 28 - March 2, 2023 in Technical University of Munich: School of Computation, Information and Technology (Garching campus). Here is the program (including location information).
3rd ASCAI Workshop in Potsdam: February 20 -21, 2024 in University of Potsdam: More details here (including program and location information).
4th ASCAI (closing) Workshop in Orsay: June 18 -20, 2024 in Orsay: More details here (including program and location information).
Nicolas Verzelen (speaker)
El Mehdi Saad
Alexandre Lecestre
Julie Josse
Etienne Roquain
Isabelle Sanchez
Emmanuel Pilliat
1. K. Abraham et al. "Frontiers to the learning of nonparametric hidden Markov models" Journal of Machine Learning Research 26.155, pp. 1–75.
2. E. Arias-Castro et al. "Theoretical Foundations of Ordinal Multidimensional Scaling, Including Internal Unfolding and External Unfolding" SIAM Journal on Mathematics of Data Science 7.3, pp. 1422–1440.
3. J. Capitao-Miniconi et al. "Deconvolution of repeated measurements corrupted by unknown noise" Electronic Journal of Statistics 19.2, pp. 4141–4173.
4. A. Carpentier "A simple and improved algorithm for noisy, convex, zeroth-order optimisation" Mathematical Statistics and Learning.
5. J. Chhor et al. "Local goodness-of-fit testing for Hölder-continuous densities: Minimax rates" Bernoulli 31.4, pp. 2747–2771.
6. M. Eppert et al. "Recovering imbalanced clusters via gradient-based projection pursuit" arXiv preprint arXiv:2502.
7. P. Esser et al. "Theoretical Foundations of Representation Learning using Unlabeled Data: Statistics and Optimization" arXiv preprint arXiv:2509.18997.
8. B. Even et al. "Computational barriers for permutation-based problems, and cumulants of weakly dependent random variables" arXiv preprint arXiv:2507.07946.
9. M. Graf et al. "Clustering Items through Bandit Feedback: Finding the Right Feature out of Many" arXiv preprint arXiv:2503.11209.
10. V. Thuot et al. "Clustering with bandit feedback: breaking down the computation/information gap" 36th International Conference on Algorithmic Learning Theory.
1. E. Aamari et al. "A theory of stratification learning" arXiv preprint arXiv:2405.20066.
2. C. Arnal et al. "Learning with Hidden Factorial Structure" arXiv preprint arXiv:2411.01375.
3. C. Berenfeld et al. "Seriation of Toeplitz and latent position matrices: optimal rates and computational trade-offs" Accepted in Bernoulli, arXiv preprint arXiv:2408.10004.
4. C. Berenfeld et al. "Estimating a density near an unknown manifold: a Bayesian nonparametric approach" The Annals of Statistics 52.5, pp. 2081–2111.
5. B. Even et al. "Computation-information gap in high-dimensional clustering" The Thirty Seventh Annual Conference on Machine Learning, PMLR, pp. 1646–1712.
6. M. Fleissner et al. "Decision trees for interpretable clusters in mixture models and deep representations" arXiv preprint arXiv:2411.01576.
7. M. Fleissner et al. "Explaining kernel clustering via decision trees" arXiv preprint arXiv:2402.09881.
8. M. Iqraa et al. "False discovery proportion envelopes with m-consistency" Journal of Machine Learning Research 25.270, pp. 1–52.
9. Y. Issartel et al. "Minimax optimal seriation in polynomial time" arXiv preprint arXiv:2405.08747.
10. E. Pilliat et al. "Optimal rates for ranking a permuted isotonic matrix in polynomial time" Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), SIAM, pp. 3236–3273.
11. E. M. Saad et al. "On weak regret analysis for dueling bandits" Advances in Neural Information Processing Systems 37, pp. 40752–40791.
12. S. Thépaut et al. "Optimal estimation of Schatten norms of a rectangular matrix" The Annals of Statistics 52.4, pp. 1334–1359.
1. E. Aamari et al. "Optimal reach estimation and metric learning" The Annals of Statistics 51.3, pp. 1086–1108.
2. J. Capitao-Miniconi et al. "Deconvolution of spherical data corrupted with unknown noise" Electronic Journal of Statistics 17.1, pp. 607–649.
3. P. Esser et al. "Improved representation learning through tensorized autoencoders" International Conference on Machine Learning, PMLR, pp. 8294–8307.
4. P. Esser et al. "Representation learning dynamics of self-supervised models" arXiv preprint arXiv:2309.02011.
5. E. Gassiat et al. "Clustering and Classification Risks in Non-parametric Hidden Markov and IID Models" arXiv e-prints, arXiv–2309.
6. T. Kocák et al. "Online learning with feedback graphs: The true shape of regret" International Conference on Machine Learning, PMLR, pp. 17260–17282.
7. S. Mukherjee et al. "Wasserstein Projection Pursuit of Non-Gaussian Signals" arXiv preprint arXiv:2302.12693.
8. M. Perrot-Dockès et al. "Post hoc false discovery proportion inference under a hidden Markov model" TEST 32, pp. 1365–1391.
9. E. Pilliat et al. "Optimal permutation estimation in crowdsourcing problems" The Annals of Statistics 51.3, pp. 935–961.
10. E. M. Saad et al. "Active ranking of experts based on their performances in many tasks" International Conference on Machine Learning, PMLR, pp. 29490–29513.
11. E. M. Saad et al. "Covariance-adaptive best arm identification" Advances in Neural Information Processing Systems 36, pp. 73287–73298.
1. K. Abraham et al. "Fundamental limits for learning hidden Markov model parameters" IEEE Transactions on Information Theory 69.3, pp. 1777–1794.
2. K. Abraham et al. "Multiple testing in nonparametric hidden Markov models: An empirical Bayes approach" Journal of Machine Learning Research 23.94, pp. 1–57.
3. E. Gassiat et al. "Deconvolution with unknown noise distribution is possible for multivariate signals" The Annals of Statistics 50.1, pp. 303–323.
4. S. Gaucher et al. "The price of unfairness in linear bandits with biased feedback" Advances in Neural Information Processing Systems 35, pp. 18363–18376.
5. O. Hacquard et al. "Topologically penalized regression on manifolds" Journal of Machine Learning Research 23.161, pp. 1–39.
6. A. Mandal et al. "A revenue function for comparison-based hierarchical clustering" arXiv preprint arXiv:2211.164.
Acknowledgements: The project ASCAI is funded by the ANR (Agence Nationale de la Recherche) and the DFG (Deutsche Forschungsgemeinschaft), through the PRCI program AAPG 2021, and runs from Feb. 2022 to to Jan 2026.