We adopt the BEiTv2 pipeline and train a Vision Transformer (ViT) model from scratch on the target dataset in a self-supervised manner. The base BEiT model consists of:
12 transformer blocks,
12 attention heads per block,
Hidden dimension of 768.
The model utilizes a patch size of 16 × 16 and resizes input images to 224 × 224. The vocabulary size is set to 8K, and the tokenizer is adopted from the original BEiTv2 paper, which was pretrained on ImageNet-1K to generate patch tokens with compact semantic information.
Datasets: CIFAR-100, CIFAR-80N, Animals10N, WebVision.
Training Duration: 200 epochs
Optimizer: AdamW with a weight decay of 0.05.
Learning Rate Schedule: Cosine annealing with an initial learning rate of 1 × 10⁻³.
Warm-up Phase: First 10K iterations.
Regularization: Stochastic depth with a drop path rate of 0.1.
Stabilization: Layer-wise learning rate decay.
To identify clean samples, we follow these steps:
Attach a randomly initialized linear layer to the pretrained backbone.
Conduct 15 epochs of linear probing.
Record model parameters every 5 epochs for subsequent gradient-based analysis.
The reference set size is set to be 10 samples per class by default. For detailed analysis, please refer to the ablation study in the paper.
The threshold delta_IF for reference set augmentation is set to be 0.8 for the IF score after normalization.
Optimization: Adam optimizer with a learning rate of 1 × 10⁻⁴ using cosine decay.
Training Duration: 5 epochs.
Data Augmentation: RandAugment with parameters n=2, m=10.
Regularization: MixUp loss with an interpolation coefficient of α = 0.4.
Optimization: Adam optimizer with a learning rate of 1 × 10⁻⁴ using cosine decay.
Training Duration Per Iteration: 5 epochs for each refinement iteration.
Refinement Iteration Number: N = 2.
Data Augmentation: RandAugment with parameters n=2, m=10.