Modeling & Exploiting Data Heterogeneity under Distribution Shifts
1:45 ~4:15 PM, December 11, 2023
Ernest N. Morial Convention Center, Hall B1, New Orleans, USA
Data heterogeneity is a key determinant of the performance of ML systems. Standard algorithms that optimize for average-case performance do not consider the presence of diversity within data. As a result, variations in data sources, data generation mechanisms, and sub-populations lead to unreliable decision-making, poor generalization, unfairness, and false scientific discoveries. Carefully modeling data heterogeneity is a necessary step in building reliable data-driven systems. Its rigorous study is a nascent field of research spanning several disciplines, including statistics, causal inference, machine learning, economics, and operations research.
In this tutorial, we develop a unified view of the disparate intellectual threads developed by different communities. We aim to foster interdisciplinary research by providing a unified view based on a shared language. Drawing upon several separate literatures, we establish a taxonomy of heterogeneity and present quantitative measures and learning algorithms that consider heterogeneous data. To spur empirical progress, we conclude by discussing validation protocols and benchmarking practices.
Outline
Tutorial Part
Introduction & Opening Remark
Data Heterogeneity Problem
A Critical View on Existing Approaches
A New Philosophy: Data Heterogeneity is Application-Specific
Towards Heterogeneity-Aware Machine Learning
Future Directions & QA
Panel Part:
Share Insights on Data Heterogeneity
Biomedical Informatics
Statistics & Causal Inference
Operations Research
Application
Discussion on the Overlaps between Different Fields
Speakers
Tsinghua University
Columbia University
Tsinghua University
Columbia University
Invited Panelists
Columbia University
Carnegie Mellon University
MIT
Stanford University