Fine-grained image classification is a cutting-edge challenge in computer vision that involves distinguishing between very similar categories within a broader class. Unlike general image classification, where the goal is to categorize images into high-level categories (e.g., dogs vs. cats), fine-grained classification requires identifying subtle differences between subcategories (e.g., different breeds of birds).
The motivation behind tackling fine-grained image classification stems from its significant potential to enhance various real-world applications. In wildlife monitoring, accurately classifying species can aid in conservation efforts by providing precise data on biodiversity. In the medical field, fine-grained classification can improve diagnostic accuracy by distinguishing between similar-looking diseases. Despite its importance, fine-grained classification remains challenging due to the
high intra-class variability
low inter-class variance, andÂ
absence of annotated features like bounding boxes, making it an intriguing problem for further research and development.
This project aims to evaluate and enhance two state-of-the-art deep learning models to classify fine-grained images without part-based annotations, using only raw images and labels of benchmark and evaluation datasets.
We opted for comprehensive datasets of images, not only for our target category. We have experimented with End-to-End Models and have chosen no-box or part-annotation datasets. With different datasets, we were able to use techniques like transfer learning.
We adjusted existing models, leveraging state-of-the-art convolutional neural networks (CNNs) like ResNet-36 or -50. We implemented the drop-out method for the first model and used transfer learning for the second model, respectively. This technique allowed us to adapt the pre-trained models to focus specifically on the task of distinguishing fine-grained categories and enhance their performance.
We implemented, trained, and compared two state-of-the-art models, modifying them for efficiency and compatibility.
🔷 Model 1 — Song et al. (Feature Suppression & Diversification)
Backbone: ResNet-36
Modules:
FBSM: Highlights important image features, suppresses background noise
FDM: Diversifies part-specific representations
Modifications - Model 1
We used the drop-out method. The drop-out method is a regularization technique designed to prevent overfitting. It works by randomly ”dropping out” (setting to zero) a fraction of the neurons during training in each iteration. This means that each iteration of training updates a different subset of neurons, which helps the network to learn more robust features that generalize better to unseen data.Â
🔶 Model 2 — Zhang et al. (Attention-Based MMAL-Net)
Backbone: ResNet-50
Modules:
AOLM: Locates main object in image (without annotations)
APPM: Proposes object parts using sliding windows
Modifications - Model 2
We used transfer learning. Transfer learning is a machine learning technique where a pre-trained model developed for a task is reused as the starting point for a model on a second task. It involves leveraging a pre-trained model’s learned features and adapting them to a new, often related, task. This method is highly effective for improving model performance and efficiency, particularly when dealing with limited data.
We tracked model performance using Weights & Biases (WandB) and conducted comparative experiments:
Training Phase:
We have monitored Model 1 throughout its training steps. The training was done on the competition challenge dataset. The graph in 4.1a shows on the y-axis the accuracy percentage from 0 % to 100 % and on the x-axis represents the training steps. Initially, the training accuracy starts from around 30 % and reaches a stable 95+ % after about 15 epochs. With the drop-out method used the accuracy was not increasing as steadily as without drop-out but achieved the same accuracy early and a higher one overall, as can be seen in 4.1b. The high training accuracy suggests that the model could be overftted.
The graph in 4.2a shows that the training loss is steadily decreasing, close to zero. This is expected as the model learns.Â
The graph in 4.3a shows a rapid increase in validation accuracy over the training process. The early steps show noticeable fluctuations in 4.3b as the model fine-tunes its parameters and copes with the eliminated neurons. Both graphs show a very high, close-to-perfect validation accuracy, which is an indicator that the model is overfitted. This can be confirmed in Figure 4.2.Â
The gap between training and validation accuracy, along with the plateauing validation accuracy, strongly suggests that the model is not generalizing well to unseen data. The model achieves a nearly perfect level of accuracy and minimal loss, which suggests that the model has learned the training data very well. Fluctuations in validation accuracy suggest instability in generalizing with new unseen data
Testing Phase
After sending our model to the provided server to test its performance, we achieved a test accuracy of 71 %. The accuracy is not on par with current state-of-the-art models, but it is still good compared to the other teams. The ”low” accuracy on the testing set and the high accuracy in training and validation suggest that Model 1 is overfitted. This issue should be tackled in future work
Training Phase:
When testing Model 2, we can observe that, overall, it exhibits a lower accuracy on test data compared to Model 1's accuracy rate, which suggests that it presents a better generalization. Decreasing loss values indicate an improved performance on test data as training progresses. From the graph in 4.5b, we have a hint at how badly Model 2 generalises unseen data. Accuracy on test data tends to fluctuate but generally trends upwards, indicating that the model is improving its performance on every training step. The test accuracy is lower compared to Model 1, since it should perform better on the training data. Similar to the local loss, the total average loss on the test dataset starts high and then gradually decreases close to a plateau with some fluctuation with the increase of the training steps.
Testing Phase
Because of a downtime of the test server, we couldn’t test the performance in its current state. We reckon that it won’t perform better, because of results in the early stages that only achieved values around 66 % after epoch number 14.
Comparison Between the Models
Model 1 shows a better learning curve. Accuracy and loss during training are not rapidly changing but steadily increasing or decreasing, respectively. While probably being overfitted, it still accomplished a good accuracy on the test set. Although we haven’t tracked the time per epoch, this model has shown better performance regarding computational load. Beating the time per epoch by almost half regarding Model 2.
Model 2 shows a more rugged response to the testing of patterns on the CUB-200-2011 dataset. Reasons for that could be a small batch size. It also shows that the MMAL-Net has issues adapting to new images. Perhaps this originated in the transfer learning process because the model was pre-trained on the FGVC-Aircraft dataset beforehand. But on the other side, the same model performed more smoothly during training on the competition challenge dataset.
All in all, we would prefer Model 1 because of better results and performance benefits. Also, it is a bit easier to understand and implement into other CNNs.
During the training phase, both models exhibited promising performance on benchmark datasets, such as CUB-200-2011 and FGVC-Aircraft. Model 1 incorporated modules to enhance crucial features, suppress background noise and diversify part-specifc representations. Model 2 adopted a region-based approach, generating multi-scale proposal windows and employing attention modules to locate objects and identify informative parts without manual annotations. Both models demonstrated promising performance on benchmark datasets but Model 1 would be recommended.Â
However, limitations were identifed, including increased model complexity, overftting and generalization issues and computational overhead. Future work could focus on exploring new techniques to improve efcient architectures and explore alternative loss functions or regularization methods to enhance generalization capabilities.
References
Jianwei Song and Ruoyu Yang. Feature Boosting, Suppression, and Diversification for Fine-Grained Visual Classification. 2021. arXiv: 2103.02782 [cs.CV].
Fan Zhang et al. Multi-branch and Multi-scale Attention Learning for Fine-Grained Visual Categorization. 2020. arXiv: 2003.09150 [cs.CV].
Report Document