Evaluating Models Beyond the Textbook: Out-of-distribution and Without Labels

Course Description

Evaluating the performance of a trained model is a crucial step in both machine learning research and practice. The predominant evaluation protocol is to evaluate a model on a held-out test set that is (i) fully labeled and (ii) drawn from the same distribution as the training set. This paradigm has led to tremendous progress on a wide range of benchmarks such as ImageNet, MS COCO, or Pascal VOC.

As computer vision is increasingly deployed in challenging settings such as autonomous vehicles or healthcare, there are also many scenarios where the standard evaluation protocol is either not applicable or offers only limited insights into the model of interest. For instance, relevant test data often have only few or even no ground truth labels when data annotation is expensive. Moreover, in-distribution accuracy may only be a weak predictor of performance on future data if a model is deployed on out-of-distribution data or exposed to adversarial attacks. Hence it is important to develop new evaluation schemes that can underwrite model performance in a more comprehensive manner and extend the applicability of vision models to more real-world scenarios where annotated data is scarce. 

Our tutorial will give a broad overview of machine learning evaluation with a focus on the two aforementioned issues: evaluation without labels and out-of-distribution. Motivated by the obvious lack of robustness in image classification, researchers have proposed a myriad of evaluation settings over the past five years such as adversarial perturbations, distribution shifts arising in videos, dataset shifts, image corruptions, geometric perturbations, etc. We will survey the current landscape of out-of-distribution evaluation, highlighting differences and similarities between the various robustness notions. Moreover, we will discuss how to evaluate models when few or no labels on the test data are available, which is particularly important when the performance of a model needs to be assessed on new out-of-distribution data for which no labels are available yet.

Topics to be covered

Tutorial Materials


Time: Monday, 6/20/2022; Central Daylight Time (CDT); (Please click to check time)

Location: CVPR 2022, New Orleans, Louisiana, USA

Tutorial Recording

Lecture 1: The robust optimization framework for evaluation [Slides]

Lecture 2: A data-centric view on robustness [Slides]

Lecture 3: Unsupervised Model Evaluation [Slides]


Dr. Liang Zheng (Australian National University)

Dr. Ludwig Schmidt (University of Washington)

Dr. Aditi Raghunathan (Carnegie Mellon University)

Dr. Weijian Deng (Australian National University)