Implementation Details

Masked Image Modeling (MIM) Model Architecture

We adopt the BEiTv2 pipeline and train a Vision Transformer (ViT) model from scratch on the target dataset in a self-supervised manner. The base BEiT model consists of:

12 transformer blocks,
12 attention heads per block,
Hidden dimension of 768.

The model utilizes a patch size of 16 × 16 and resizes input images to 224 × 224. The vocabulary size is set to 8K, and the tokenizer is adopted from the original BEiTv2 paper, which was pretrained on ImageNet-1K to generate patch tokens with compact semantic information.

Training Setup

Phase 1: General embedding space construction

MIM Pretraining Configuration

- Datasets: CIFAR-100, CIFAR-80N, Animals10N, WebVision.
- Training Duration: 200 epochs
- Optimizer: AdamW with a weight decay of 0.05.
- Learning Rate Schedule: Cosine annealing with an initial learning rate of 1 × 10⁻³.
- Warm-up Phase: First 10K iterations.
- Regularization: Stochastic depth with a drop path rate of 0.1.
- Stabilization: Layer-wise learning rate decay.

Influence Referencing for Clean Sample Identification

To identify clean samples, we follow these steps:

- Attach a randomly initialized linear layer to the pretrained backbone.
- Conduct 15 epochs of linear probing.
- Record model parameters every 5 epochs for subsequent gradient-based analysis.
  - - - The reference set size is set to be 10 samples per class by default. For detailed analysis, please refer to the ablation study in the paper.
      - The threshold delta_IF for reference set augmentation is set to be 0.8 for the IF score after normalization.

Phase 2: Specialized embedding space construction

Specialized Training

Optimization: Adam optimizer with a learning rate of 1 × 10⁻⁴ using cosine decay.
Training Duration: 5 epochs.
Data Augmentation: RandAugment with parameters n=2, m=10.
Regularization: MixUp loss with an interpolation coefficient of α = 0.4.

Phase 3: Specialized embedding space construction

Training Stage

Optimization: Adam optimizer with a learning rate of 1 × 10⁻⁴ using cosine decay.
Training Duration Per Iteration: 5 epochs for each refinement iteration.
Refinement Iteration Number: N = 2.
Data Augmentation: RandAugment with parameters n=2, m=10.

Page updated

Google Sites

Report abuse