RLSbench: Domain Adaptation Under Relaxed Label Shift

(To appear at) International Conference of Machine Learning (ICML), 2023

TL;DR -- A large scale study of domain adaptation methods under scenarios where both label distribution and conditionals p(x|y) may shift, highlights brittleness of existing methods and simple fixes that improves the performance.


Despite the emergence of principled methods for domain adaptation under label shift, the sensitivity of these methods for minor shifts in the class conditional distributions remains precariously under explored. Meanwhile, popular deep domain adaptation heuristics tend to falter when faced with shifts in label proportions. While several papers attempt to adapt these heuristics to accommodate shifts in label proportions, inconsistencies in evaluation criteria, datasets, and baselines, make it hard to assess the state of the art. In this paper, we introduce RLSbench, a large-scale relaxed label shift benchmark, consisting of >500 distribution shift pairs that draw on 14 datasets across vision, tabular, and language modalities and compose them with varying label proportions. First, we evaluate 13 popular domain adaptation methods, demonstrating more widespread failures under label proportion shifts than were previously known. Next, we develop an effective two-step meta-algorithm that is compatible with most deep domain adaptation heuristics: (i) pseudo-balance the data at each epoch; and (ii) adjust the final classifier with (an estimate of) target label distribution. The meta-algorithm improves existing domain adaptation heuristics often by 2--10% accuracy points under extreme label proportion shifts and has little (i.e., < 0.5\%) effect when label proportions do not shift. We hope that these findings and the availability of RLSbench will encourage researchers to rigorously evaluate proposed methods in relaxed label shift settings.

Motivation and Setup

RLSbench: Relaxed Label Shift Benchmark

A standardized test bed of >500 distribution shift pairs with varying severity of shift in target class proportions across 14 multi-domain datasets. 

We evaluate a collection of 12 popular DA methods: (i) Domain invariant learning, e.g., DANN, CDANN, IW-CDANN; (ii) Self-training, e.g., PseudoLabel, FixMatch, NoisyStudent, SENTRY; (iii) Test-time adaptation, e.g., TENT, BN-adapt, CORAL.

RLSbench Datasets

Example of RLSbench settings on CIFAR10

Popular DA Methods Falter When Faced With Shifts in Target Label Proportions

 Meta Algorithm Summary

We implement two simple general-purpose corrections: (i) Re-Sampling (RS) and (ii) Re-Weighting (RW).

Performance of DA methods Improves When Paired With our Meta-Algorithm (RS+RW) 

We show results with three algorithms on vision modality. Results with other methods and other modalities are in the paper.

To cite this paper, please use the following reference: 


Nick Erickson


James Sharpnack


Alex Smola


Siva Balakrishnan


Zack Lipton


For questions, please contact us at: sgarg2@andrew.cmu.edu