Workshop on Recent Developments in
Geometric Clustering
Clustering is a fundamental problem in machine learning and theoretical computer science, which involves grouping similar data points together. Clustering algorithms have been widely used in various domains, including image segmentation, natural language processing, bioinformatics, and recommendation systems, and is a classic topic in computational geometry. In recent years, there have been significant developments in clustering algorithms that address various new considerations, such as fairness and explainability, and compatibility with large-scale systems.
The research on clustering problems is relevant and important for the computational geometry community. In this workshop, we aim to present a subset of recent algorithmic developments in the mentioned themes, when clustering algorithms deal with high-dimensional data. We hope this workshop embarks on future research in advanced aspects of clustering for geometric data. This workshop will be held as a part of CG Week 2023, which combines a number of events, most notably the International Symposium on Computational Geometry (SoCG). The workshop consists of invited talks and contributed talks. We plan to hold this event in person to increase interaction between the participants. However, we will run a virtual session for the speakers who would not be able to attend in person.
Schedule:
1:30 - 2:20. Vincent Cohen-Addad (Google Research) – Recent Developments in Euclidean Clustering: Sketching and Approximation Algorithms.
Abstract: Clustering points in Euclidean space is a classic problem that has emerged in a variety of communities from statistics to data compression, including operations research, computational geometry. A common formulation is the k-means objective which, given n input points, asks to identify k representative points so as to minimize the sum of the squared distances from each input point to its closest representative.
From a computational standpoint, understanding the complexity of the k-means objective has been a challenge, with little progress made until recently, where new upper and lower bounds on the approximation guarantees we can obtain in polynomial time have been proven. From a sketching perspective, namely for the problem of compressing the input data while preserving its k-means cost, a tremendous amount of work has been put over the last 20 years to develop the best possible sketching techniques. Over the last 3 years, an important progress has been made on these sketching techniques which are now nearly optimal.
I'll review the key ingredients that enabled these recent advances and present the most promising research directions.
2:20 - 3:00. Samson Zhou (UC Berkeley & Rice U.) – Learning-Augmented Algorithms for k-means Clustering
Abstract: Although k-means clustering is a well-studied problem, there exist strong theoretical limits on worst-case inputs. To overcome this barrier, we consider a scenario where “advice” is provided to help perform clustering. We consider the k-means problem augmented with a predictor that, given any point, returns its cluster label in an approximately optimal clustering up to some error. We present an algorithm whose performance improves along with the accuracy of the predictor, even though naively following the accurate predictor can still lead to a high clustering cost. Thus if the predictor is sufficiently accurate, we can retrieve a close to optimal clustering with nearly optimal runtime, breaking known computational barriers for algorithms that do not have access to such advice.
Based on joint work with Jon Ergun, Zhili Feng, Sandeep Silwal, and David P. Woodruff
3:00 - 3:30. Break
3:30 - 4:20. Deeparnab Chakrabarty (Dartmouth College) – Round-or-Cut Technique for designing Approximation Algorithms for Clustering Problems
Abstract: Many clustering problems are NP-hard and therefore extensive research has gone into designing approximation algorithms for these problems. Indeed, many techniques in approximation algorithms have been honed in the study of many of these fundamental problems. In this talk, I will talk about a technique called the “round-or-cut technique” (also called the “cutting plane technique” in integer programming parlance) which is probably not as well-known as many other techniques in approximation algorithms. In the last five years, however, this technique has led to tractable approximation algorithms for many clustering problems. I would like to present this general technique and illustrate it on two clustering problems. The first problem would be a clustering problem with the k-center objective but in the presence of outliers. The second problem would be an application to a problem in clustering in continuous metric spaces.
This talk plans to be self-contained and no assumption of rounding or cutting will be assumed.
4:20 - 5:00. Liren Shan (Northwestern U.) – Explainable 𝑘-Means: Don’t Be Greedy, Plant Bigger Trees!
Abstract: We provide a new bi-criteria Õ(log^2 k) competitive algorithm for explainable k-means clustering. Explainable k-means was recently introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian (ICML 2020). It is described by an easy to interpret and understand (threshold) decision tree or diagram. The cost of the explainable k-means clustering equals to the sum of costs of its clusters; and the cost of each cluster equals the sum of squared distances from the points in the cluster to the center of that cluster. The best non bi-criteria algorithm for explainable clustering Õ (k) competitive, and this bound is tight. Our randomized bi-criteria algorithm constructs a threshold decision tree that partitions the data set into (1+δ)k clusters (where delta is a parameter of the algorithm). The cost of this clustering is at most Õ (1/δ ⋅ log^2 k) times the cost of the optimal unconstrained k-means clustering. We show that this bound is almost optimal.
The 7th Workshop on Geometry and Machine Learning, which has a broader theme, will be collocated at CG Week.