Evaluating Models Beyond the Textbook: Out-of-distribution and Without Labels

Course Description

Evaluating the performance of a trained model is a crucial step in both machine learning research and practice. The predominant evaluation protocol is to evaluate a model on a held-out test set that is (i) fully labeled and (ii) drawn from the same distribution as the training set. This paradigm has led to tremendous progress on a wide range of benchmarks such as ImageNet, MS COCO, or Pascal VOC.

As computer vision is increasingly deployed in challenging settings such as autonomous vehicles or healthcare, there are also many scenarios where the standard evaluation protocol is either not applicable or offers only limited insights into the model of interest. For instance, relevant test data often have only few or even no ground truth labels when data annotation is expensive. Moreover, in-distribution accuracy may only be a weak predictor of performance on future data if a model is deployed on out-of-distribution data or exposed to adversarial attacks. Hence it is important to develop new evaluation schemes that can underwrite model performance in a more comprehensive manner and extend the applicability of vision models to more real-world scenarios where annotated data is scarce.

Our tutorial will give a broad overview of machine learning evaluation with a focus on the two aforementioned issues: evaluation without labels and out-of-distribution. Motivated by the obvious lack of robustness in image classification, researchers have proposed a myriad of evaluation settings over the past five years such as adversarial perturbations, distribution shifts arising in videos, dataset shifts, image corruptions, geometric perturbations, etc. We will survey the current landscape of out-of-distribution evaluation, highlighting differences and similarities between the various robustness notions. Moreover, we will discuss how to evaluate models when few or no labels on the test data are available, which is particularly important when the performance of a model needs to be assessed on new out-of-distribution data for which no labels are available yet.

Topics to be covered

  • A brief review and history of classical model evaluation.

  • Representing and evaluating models on unlabeled in-distribution data (e.g., spectral norms of network layers)

  • Evaluating models on unlabeled out-of-distribution data.

  • Dataset representations and dataset-dataset similarities (e.g., set representation and its extensions, Frechet distance and learned similarities).

  • Robustness to natural distribution shifts / dataset shifts.

  • Relationships between different robustness notions (adversarial, image corruptions, dataset shift, etc.).

  • Robust optimization framework for evaluation.

  • Fine-grained evaluation on sub-populations.

  • Evaluation on instance-wise perturbations.

Tutorial Materials

  1. A. Raghunathan, J. Steinhardt, and P. Liang. Certified Defenses against Adversarial Examples. In ICLR 2018 [Paper][Certificates of robustness for neural networks]

  2. B. Recht, R. Roelofs, L. Schmidt and V. Shankar, Do ImageNet classifiers generalize to ImageNet? In ICML 2019 [Paper][ImageNet V2 (new test set)]

  3. A. Raghunathan*, S. M. Xie*, F. Yang, J. Duchi, P.Liang. Understanding and Mitigating the Tradeoff Between Robustness and Accuracy. In ICML 2020 [Paper][Tradeoff between adversarial robustness and standard accuracy]

  4. R. Taori, A. Dave, V. Shankar, N. Carlini, B. Recht, and L. Schmidt, Measuring Robustness to Natural Distribution Shifts in Image Classification. In NeurIPS 2020 [Paper][ImageNet Testbed]

  5. W. Deng, S. Gould and L. Zheng, What does rotation prediction tell us about classifier accuracy under varying testing environments? In ICML 2021 [Paper][Self-supervision for model evaluation]

  6. E. Liu*, B. Haghgoo*, A. Chen*, A. Raghunathan, P. W. Koh*, S. Sagawa*, P. Liang, C. Finn. Just Train Twice: Improving Group Robustness without Training Group Information. In ICML 2021 [Paper][Group Robustness]

  7. V. Shankar, A. Dave, R. Roelofs, D. Ramanan, B. Recht, and L. Schmidt, Do Image Classifiers Generalize Across Time? In ICCV 2021 [Paper][Robustness of image classifiers to temporal perturbations]

  8. D. Guillory, V. Shankar, S. Ebrahimi, T. Darrell, and L. Schmidt, Predicting With Confidence on Unseen Distributions. In ICCV 2021 [Average confidence for accuracy prediction]

  9. X. Sun, Y. Hou, W. Deng, H. Li, and L. Zheng, Ranking models in unlabeled new environments. In ICCV 2021 [Paper][Proxy Test set for ranking models]

  10. W. Deng and L. Zheng, Are labels always necessary for classifier accuracy evaluation? In CVPR 2021&TPAMI 2021 [Paper][Unsupervised model evaluation]


Time: Monday, 6/20/2022; Central Daylight Time (CDT); (Please click to check time)

Location: CVPR 2022, New Orleans, Louisiana, USA

Tutorial Recording

Lecture 1: The robust optimization framework for evaluation [Slides]

Lecture 2: A data-centric view on robustness [Slides]

Lecture 3: Unsupervised Model Evaluation [Slides]


Dr. Liang Zheng (Australian National University)

Dr. Ludwig Schmidt (University of Washington)

Dr. Aditi Raghunathan (Carnegie Mellon University)

Mr. Weijian Deng (Australian National University)