Modeling & Exploiting Data Heterogeneity under Distribution Shifts


1:45 ~4:15 PM, December 11, 2023

Ernest N. Morial Convention Center, Hall B1, New Orleans, USA

Data heterogeneity is a key determinant of the performance of ML systems. Standard algorithms that optimize for average-case performance do not consider the presence of diversity within data. As a result, variations in data sources, data generation mechanisms, and sub-populations lead to unreliable decision-making, poor generalization, unfairness, and false scientific discoveries. Carefully modeling data heterogeneity is a necessary step in building reliable data-driven systems. Its rigorous study is a nascent field of research spanning several disciplines, including statistics, causal inference, machine learning, economics, and operations research. 

In this tutorial, we develop a unified view of the disparate intellectual threads developed by different communities. We aim to foster interdisciplinary research by providing a unified view based on a shared language. Drawing upon several separate literatures, we establish a taxonomy of heterogeneity and present quantitative measures and learning algorithms that consider heterogeneous data. To spur empirical progress, we conclude by discussing validation protocols and benchmarking practices.

Outline


 Speakers

Peng Cui 

Tsinghua University

Hongseok Namkoong 

Columbia University

Jiashuo Liu 

Tsinghua University

Tiffany (Tianhui) Cai 

Columbia University

 Invited Panelists

Shalmali Joshi 

Columbia University

Aditi Raghunathan 

Carnegie Mellon University

Dominik Rothenhäusler 

Stanford University