IDEAL Workshop on Clustering

Speakers

Vaggos Chatziafratis (UC Santa Cruz, Northwestern, MIT/Northeastern), Eden Chlamtáč (Ben-Gurion University, visiting TTIC), Vincent Cohen-Addad (Google Research), Sanjoy Dasgupta (UC San Diego), Jafar Jafarov (U Chicago), Shi Li (University at Buffalo), Lunjia Hu (Stanford), Liren Shan (Northwestern), Ola Svensson (EPFL), Ali Vakilian (TTIC)

Logistics

  • Dates: Friday, April 22 and Saturday, April 23
  • Location: Northwestern University
  • Rooms: Mudd Library 3514 on Friday and Saturday
  • Streaming: Please, use Zoom link above

Schedule for Friday, April 22 — Remote Talks

All times are in the Central Time Zone (CDT; Chicago time).

  • 8:40-9:10: Breakfast
  • 9:10-9:15: Opening Remarks
  • 9:15: Ola Svensson, Nearly-Tight and Oblivious Algorithms for Explainable Clustering (video)
  • 10:15: Vincent Cohen-Addad, Recent Progress on Correlation Clustering: Theory and Practice (video)
  • 11:15 Lunch
  • 12:30: Sanjoy Dasgupta, Statistical consistency in clustering (video)
  • 1:30: Shi Li, Clustering with Outliers: Approximation and Distributed Algorithms (video)
  • 2:30 Coffee Break
  • 3:00: Lunjia Hu, Near-Optimal Explainable k-Means for All Dimensions (video)

Schedule for Saturday, April 23 — In-person Talks

All times are in the Central Time Zone (CDT; Chicago time).

  • 9:00: Breakfast
  • 9:15: Liren Shan, Explainable k-means. Don’t be greedy, plant bigger trees! (video)
  • 10:15: Vaggos Chatziafratis, Hierarchical Clustering: Upper Bounds, Lower Bounds and Some Open Questions (video)
  • 11:15 Lunch
  • 12:30: Jafar Jafarov, Correlation Clustering with Local and Global Objectives (video)
  • 1:30: Eden Chlamtáč, Approximating Fair Clustering with Cascaded Norm Objectives (video)
  • 2:30 Coffee Break
  • 2:45: Ali Vakilian, Individual Fairness for k-Clustering (video)
Titles and Abstracts

Title: Nearly-Tight and Oblivious Algorithms for Explainable Clustering
Speaker: Ola Svensson (EPFL)
Abstract: An important topic in current machine learning research is to explain and/or interpret how models actually make their decisions. Motivated by this, Moshkovitz, Dasgupta, Rashtchian, and Frost recently formalized the problem of explainable clustering. A k-clustering is said to be explainable if it is given by a decision tree where each internal node splits data points with a threshold cut in a single dimension (feature), and each of the k leaves corresponds to a cluster.

In this talk, we see an algorithm that outputs an explainable clustering that loses at most a factor of O(log^2 k) compared to an optimal (not necessarily explainable) clustering for the k-medians objective, and a factor of O(k log^2 k) for the k-means objective. This improves over the previous best upper bounds of O(k) and O(k^2), respectively, and nearly matches the previous Ω(log k) lower bound for k-medians and our new Ω(k) lower bound for k-means. Moreover, the algorithm is remarkably simple and, given an initial not necessarily explainable clustering, it is oblivious to the data points and runs in time O(dk log^2 k), independent of the number of data points n.

Joint work with Buddhima Gamlath, Xinrui Jia, and Adam Polak.

Title: Recent Progress on Correlation Clustering: Theory and Practice
Speaker: Vincent Cohen-Addad (Google Research)
Abstract: Clustering naturally arise from unsupervised machine learning and data analysis applications. In this talk, we will focus on the famous correlation clustering objective: Given a set of elements where each pair is labeled either + or -, the goal is to find a partition of the elements so as to minimized the number of + pairs that are in different parts plus the number of – pairs that are in the same part.
We will present a new perspective on the problem that led to the first approximation algorithms for the problem in different practically relevant settings (distributed, streaming, online, differentially-privacy). We moreover show that the algorithm obtained yields good clusterings in practice. We will conclude by showing that the Sherali-Adams hierarchy can be used to obtain a 1.994-approximation algorithm for correlation clustering, bypassing the LP integrality gap of 2 and improving upon the 2.05 LP-rounding algorithm of Chawla, Makarychev, Schramm, and Yaroslavtsev STOC 2015.

Title: Statistical consistency in clustering
Speaker: Sanjoy Dasgupta (UC San Diego)
Abstract: When n data points are drawn from a distribution, a clustering of those points would ideally converge to characteristic sets of the distribution, suitably defined, as n increases. We will look at two such results: for linkage-based hierarchical clustering and for an online version of the k-means algorithm.
Joint work with Kamalika Chaudhuri, Gaurav Mahajan, and Geelon So.

Title: Clustering with Outliers: Approximation and Distributed Algorithms
Speaker: Shi Li (University at Buffalo)
Abstract: Clustering is a fundamental problem in unsupervised learning and data analytics: Given a set of data points in some metric space, the problem asks for a partition of the points into k clusters so as to minimize some objective, such as k-center, k-median and k-means objectives. A shortcoming the formulations face is that a few outliers can significantly influence the quality and structure of the clustering. To overcome the issue, clustering with outliers problems were introduced and studied extensively in previous work. In the problem, one is allowed to discard a specified number of points in the clustering.

In this talk, we present several algorithms for clustering with outliers in two settings. We give the first efficient true O(1)-approximation algorithm for the k-median and k-means with outliers problems. In the distributed algorithm setting, we present many bi-criteria O(1)-approximation algorithms with a communication complexity of O(km/epsilon), where m is the number of machines in the distributed system and epsilon controls how much the outlier requirement is violated.
Based on the joint work with Ravishankar Krishnaswamy and Sai Sandeep, and with Xiangyu Guo.

Title: Near-Optimal Explainable k-Means for All Dimensions
Speaker: Lunjia Hu (Stanford)
Abstract: Explainable clustering is a recent notion of clustering that aims to improve clustering explainability by adding an explainability constraint. The constraint requires that the cluster boundaries are axis-parallel hyperplanes and the clustering is obtained by applying a decision tree to the data. Since the introduction of the notion, multiple research groups worked on determining the price of explainability, a key quantity characterizing the increase in the clustering cost due to the explainability constraint.

In this talk, I will give an overview of these recent results and focus on my joint work with Moses Charikar which shows near-optimal bounds on the price of explainability for k-means for all dimensions. I will also talk about the key ideas in our algorithms and analysis used to obtain this result.

Title: Explainable k-means. Don’t be greedy, plant bigger trees!
Speaker: Liren Shan (Northwestern University)
Abstract: We provide a new bi-criteria Õ(log^2 k) competitive algorithm for explainable k-means clustering. Explainable k-means was recently introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian (ICML 2020). It is described by an easy to interpret and understand (threshold) decision tree or diagram. The cost of the explainable k-means clustering equals to the sum of costs of its clusters; and the cost of each cluster equals the sum of squared distances from the points in the cluster to the center of that cluster. The best non bi-criteria algorithm for explainable clustering Õ (k) competitive, and this bound is tight. Our randomized bi-criteria algorithm constructs a threshold decision tree that partitions the data set into (1+δ)k clusters (where delta is a parameter of the algorithm). The cost of this clustering is at most Õ (1/δ ⋅ log^2 k) times the cost of the optimal unconstrained k-means clustering. We show that this bound is almost optimal.
Joint work with Konstantin Makarychev

Title: Hierarchical Clustering: Upper Bounds, Lower Bounds and Some Open Questions
Speaker: Vaggos Chatziafratis (UC Santa Cruz, Northwestern, MIT/Northeastern)
Abstract: Hierarchical Clustering is an important tool for unsupervised learning whose goal is to construct a hierarchical decomposition of a given dataset describing relationships at all levels of granularity simultaneously. Despite its long history, Hierarchical Clustering was underdeveloped from a theoretical perspective, partly because of a lack of suitable objectives and algorithms with guarantees.
In this talk, I want to tell you about the recent progress in the area with an emphasis on connections to approximation algorithms like Sparsest Cut and Balanced Cut, some hardness of approximation results, and we shall also highlight some interesting open problems in the area.

Title: Correlation Clustering with Local and Global Objectives
Speaker: Jafar Jafarov (U Chicago)
Abstract: In the Correlation Clustering problem, we are given a graph with its edges labeled as “similar” and “dissimilar” by a noisy binary classifier, and the goal is to produce a clustering of the vertex set which matches with the edge labels as much as possible. Correlation Clustering has been mainly studied under two models where the input graph is (i) complete and unweighted, and (ii) arbitrary and weighted.

In this talk, we introduce a new model of Correlation Clustering that better captures real life instances. In this model the input graph is complete with bounded edge weights. We examine the model under a Lp objective which is a generalization of the standard Correlation Clustering objective, MinDisagree. We give an approximation algorithm and show an almost matching integrality gap. For this we introduce a new metric space partitioning scheme which might be of independent interest.

Joint work with Sanchit Kalhan, Konstantin Makarychev, and Yury Makarychev

Title: Cascaded Norms in Clustering
Speaker: Eden Chlamtáč (Ben-Gurion University, visiting TTIC)
Abstract: Clustering is one of the most fundamental tasks in various areas such as machine learning and optimization. In theoretical computer science, we are interested in the complexity of finding a “good” clustering, given a data set with some distance function, and a target number of centers to choose from among the input points. Our goal is to find a set of centers (of the required cardinality) which minimizes some cost function which aggregates the distances of all input points from their respective nearest centers. This includes well-studied notions such as k-Medians Clustering and k-Means Clustering.

More recently, there has been a focus on ‘fairness’ in clustering, in which we want to take into consideration not only the global cost but also to counteract structural bias against marginalized groups. To this end, one first aggregates the costs incurred within the given groups of interest, before aggregating the costs incurred by these groups.

We focus on a very general notion of fairness – the input consists of data points, a target number of centers, and a collection of groups represented by different weight functions. The objective we wish to minimize is the ell_q norm of the group costs, where each group cost is computed as the (weighted) ell_p norm of distances of points in the group to their respective nearest centers. We study this problem from the point of view of approximation algorithms, giving algorithms for all values of p and q that smoothly interpolate between optimal and near-optimal approximations for fundamental parameter settings of (p,q), such as (infinity, q), (p, infinity), and (p,p).

Based on joint work with Ali Vakilian and Yury Makarychev

Title: Individual Fairness for k-Clustering
Speaker: Ali Vakilian (TTIC)
Abstract: We present approximation algorithms for k-median and k-means (and more generally for any k-clustering with Lp-norm objective) from the perspective of individual fairness. More precisely, for a point x in a point set P of size n, let r(x) be the minimum radius such that the ball of radius r(x) centered at x has at least n/k points from P. Intuitively, if a set of k random points are chosen from P as centers, every point x in P expects to have a center within radius r(x). An individually fair clustering provides such a guarantee for every point in P. This notion of fairness was introduced in [Jung et al., 2019] where they showed how to get an approximately feasible k-clustering with respect to this fairness condition. In this talk, we show how to get bicriteria approximation algorithms for fair k-clustering: The clustering cost of our solution is within a constant factor of the cost of an optimal fair k-clustering, and our solution approximately satisfies the fairness condition (also within a constant factor).
Based on joint works with Sepideh Mahabadi and Mustafa Yalcıner.

About the Series

The IDEAL workshop series brings in experts on topics related to the foundations of data science to present their perspective and research on a common theme. Chicago area researchers with an interest in the foundations of data science.

This workshop is part of the Special Quarter on High-Dimensional Data Analysis. It is organized by Konstantin Makarychev (NU) and Yury Makarychev (TTIC).