When Does Contrastive Visual Representation Learning Work?
Self-supervised learning (SSL) has made remarkable progress. However, we have mostly pretrained on ImageNet. Does our progress hold up in settings that are not ImageNet-like? Under what conditions do methods like SimCLR learn good representations?
Dataset size: "How many unlabeled images do I need for pretraining? And how many labeled images do I need for linear classifier training or fine-tuning?"
Data domain: "How similar do my pretraining and downstream domains need to be?"
Data quality: "What if my pretraining images are corrupted or distribution shifted relative to my downstream domain?"
Task granularity: "Are SSL representations useful if my downstream task is fine-grained?"
We study patterns in SSL behavior across four large-scale datasets:
[Figure 2] If you want to get close to supervised performance, then you still need lots of labeled data after SSL.
[Table 2] SSL on a larger, more diverse dataset can actually hurt performance.
[Table 2, Table 3] Combining representations from different domains does not lead to better generalization.
[Figure 4] SSL is much more sensitive to corrupted pretraining images than supervised learning.
[Figure 5] There is a surprisingly large gap between SSL and supervised performance for fine-grained classification.
Much more in the paper!