When Does Contrastive Visual Representation Learning Work?

Elijah Cole Xuan Yang Kimberly Wilber Oisin Mac Aodha Serge Belongie

Overview

Self-supervised learning (SSL) has made remarkable progress. However, we have mostly pretrained on ImageNet. Does our progress hold up in settings that are not ImageNet-like? Under what conditions do methods like SimCLR learn good representations?

We investigate:

Dataset size: "How many unlabeled images do I need for pretraining? And how many labeled images do I need for linear classifier training or fine-tuning?"
Data domain: "How similar do my pretraining and downstream domains need to be?"
Data quality: "What if my pretraining images are corrupted or distribution shifted relative to my downstream domain?"
Task granularity: "Are SSL representations useful if my downstream task is fine-grained?"

We study patterns in SSL behavior across four large-scale datasets:

ImageNet

1.2M Images

Object Classification

iNaturalist 2021

2.7M Images

Fine-Grained Classification

Places365

1.8M Images

Scene Classification

GeoLifeCLEF 2020

1.1M Images

Land Cover Classification

Highlights

[Figure 2] If you want to get close to supervised performance, then you still need lots of labeled data after SSL.
[Table 2] SSL on a larger, more diverse dataset can actually hurt performance.
[Table 2, Table 3] Combining representations from different domains does not lead to better generalization.
[Figure 4] SSL is much more sensitive to corrupted pretraining images than supervised learning.
[Figure 5] There is a surprisingly large gap between SSL and supervised performance for fine-grained classification.
Much more in the paper!