Modeling & Exploiting Data Heterogeneity under Distribution Shifts

1:45 ~4:15 PM, December 11, 2023

Ernest N. Morial Convention Center, Hall B1, New Orleans, USA

Data heterogeneity is a key determinant of the performance of ML systems. Standard algorithms that optimize for average-case performance do not consider the presence of diversity within data. As a result, variations in data sources, data generation mechanisms, and sub-populations lead to unreliable decision-making, poor generalization, unfairness, and false scientific discoveries. Carefully modeling data heterogeneity is a necessary step in building reliable data-driven systems. Its rigorous study is a nascent field of research spanning several disciplines, including statistics, causal inference, machine learning, economics, and operations research.

In this tutorial, we develop a unified view of the disparate intellectual threads developed by different communities. We aim to foster interdisciplinary research by providing a unified view based on a shared language. Drawing upon several separate literatures, we establish a taxonomy of heterogeneity and present quantitative measures and learning algorithms that consider heterogeneous data. To spur empirical progress, we conclude by discussing validation protocols and benchmarking practices.

Outline

Tutorial Part
- Introduction & Opening Remark
- Data Heterogeneity Problem
- A Critical View on Existing Approaches
- A New Philosophy: Data Heterogeneity is Application-Specific
- Towards Heterogeneity-Aware Machine Learning
- Future Directions & QA

Panel Part:
- Share Insights on Data Heterogeneity
  - Biomedical Informatics
  - Statistics & Causal Inference
  - Operations Research
  - Application
- Discussion on the Overlaps between Different Fields