When Does Contrastive Visual Representation Learning Work?


Self-supervised learning (SSL) has made remarkable progress. However, we have mostly pretrained on ImageNet. Does our progress hold up in settings that are not ImageNet-like? Under what conditions do methods like SimCLR learn good representations?

We investigate:

  1. Dataset size: "How many unlabeled images do I need for pretraining? And how many labeled images do I need for linear classifier training or fine-tuning?"

  2. Data domain: "How similar do my pretraining and downstream domains need to be?"

  3. Data quality: "What if my pretraining images are corrupted or distribution shifted relative to my downstream domain?"

  4. Task granularity: "Are SSL representations useful if my downstream task is fine-grained?"

We study patterns in SSL behavior across four large-scale datasets:


1.2M Images

Object Classification

iNaturalist 2021

2.7M Images

Fine-Grained Classification


1.8M Images

Scene Classification

GeoLifeCLEF 2020

1.1M Images

Land Cover Classification


  • [Figure 2] If you want to get close to supervised performance, then you still need lots of labeled data after SSL.

  • [Table 2] SSL on a larger, more diverse dataset can actually hurt performance.

  • [Table 2, Table 3] Combining representations from different domains does not lead to better generalization.

  • [Figure 4] SSL is much more sensitive to corrupted pretraining images than supervised learning.

  • [Figure 5] There is a surprisingly large gap between SSL and supervised performance for fine-grained classification.

  • Much more in the paper!