We have taken four primary preprocessing steps:
(1) Resizing: adjusted to a standard size (224, 224, 3)
(2) Deduplication: eliminating some images that were taken at the same place and time
(3) Balancing: oversampling the categories with fewer than average samples.
(4) Normalization: adjusting the range of pixel intensity values in an image to a standardized scale to enhance contrast and ensure consistency.
(5) Splitting: data were split into train, validation and test sets. Special care was taken to prevent similar images from going into different set to avoid data leakage.
(6) Augmentation: each image in the train set is rotated, translated and cropped by a random amount to create synthetic images that could help train the model. For each image, 20 augmented images were generated.
Transfer learning is helpful in increasing model performance, given limited data. Inspired by previous research, we explored two primary families of models:
(1) Pretrained CNN models including VGG16 and Efficient Net
(2) Pretrained Vision Transformer models including ViT by Google
We keep the pretrained weights for the fundamental layers while fine-tuning the last layers.