This page includes:
Reliable soft labels: soft labels generated by TrainRef for ambiguous samples.
Reliable confidence: more reliable model prediction after fine-tuning the model on the soft labels obtained from TrainRef.
Out-of-distribution(OOD) detection: Some examples detected as OOD samples by TrainRef.
We present examples of soft labels obtained from TrainRef (the Data Curation Module) for ambiguous training samples. These results are derived from CIFAR-100 under an 80% symmetric noise setting. The model is subsequently refined using these soft labels to improve calibration and reliability.
For simplicity, we only display the top five classes with the highest probabilities. In each example, the training sample is shown on the left, the soft label obtained is in the middle, and the four most influential supporters from the reference set are illustrated on the right.
In this example, the training sample is assigned the label willow tree. However, its shape is unclear, making it ambiguous and difficult to distinguish between a willow tree, an oak tree, and other similar trees.
The image is assigned the label baby, but it is difficult to distinguish, as the similarities from the supporters in the reference set suggest that it could also resemble a boy or girl.
In this example, the training sample is assigned the label whale. However, due to the blurriness of the image, it is difficult to distinguish whether the object is indeed a whale or a dolphin.
In this example, the training sample is assigned the label worm. However, due to its curved shape and appearance, it is somewhat ambiguous and could be confused with other elongated organisms.
In this example, the training sample is assigned the label streetcar. However, due to the blurry nature of the image and the presence of passengers and only partial object given, the distinction between a streetcar and a bus is unclear.
The animal in the given training example only shows its back, which leads to ambiguities. Our soft label indicates a high probability of the animal being a kangaroo but also indicates a small possibility of being other species with similarity.
We give more comparison of the confidence we obtained from the model refined on soft labels with one of the traditional denoising SOTAs DISC, note that DISC also uses Mixup loss for better generalization and robustness. For simplicity, we only display the top 5 classes with the highest probability.
The test sample is shown on the left, and the confidence from the TrainRef and DISC is given on the right for comparison.
The given test image is a rabbit, the highest probability of TrainRef is aligned with the ground truth and also indicates its similarity with a kangaroo and an otter, while confidence from DISC has a wrong prediction of raccoon which is distant in semantics. TrainRef demonstrates superior performance by (1) aligning with the correct class, (2) preserving meaningful relationships with similar categories, and (3) reducing overconfidence in incorrect predictions.
TrainRef correctly predicts train (0.60) while recognizing streetcar (0.22) as a reasonable alternative. Its remaining predictions, though minor, remain contextually relevant. DISC, despite correctly predicting train (0.77), assigns butterfly (0.17) as the second most probable class, along with unrelated labels (can, camel, apple). This suggests susceptibility to noise and weaker semantic alignment. TrainRef outperforms DISC by providing more accurate classification, better semantic consistency, and reduced overconfidence in irrelevant labels.
TrainRef correctly predicts ray (0.60) and identifies shark (0.34) as a reasonable alternative due to their similar shapes. Other minor predictions remain contextually relevant. DISC, however, misclassifies the image as kangaroo (0.40) and assigns high probabilities to unrelated classes like skyscraper (0.17) and bee (0.15). This highlights its weaker ability to capture meaningful semantic relationships. TrainRef outperforms DISC by providing more accurate classification, stronger semantic alignment, and reduced confusion with unrelated categories.
TrainRef provides a reasonable prediction, assigning the highest probability to woman (0.28), followed by table (0.20) due to background influence, and girl (0.11), which aligns with the ground truth. DISC, however, misclassifies the image as rabbit (0.41) and lion (0.33), which are semantically distant. Its other predictions (tractor, camel, bee) are also unrelated, highlighting its difficulty in correctly identifying human figures. TrainRef outperforms DISC by providing a more contextually relevant classification and avoiding extreme misclassifications with unrelated objects.
TrainRef assigns the highest probability to possum (0.79), which, while incorrect, is at least semantically similar to a small mammal. It also recognizes mouse (0.08) and hamster (0.07) as possible alternatives, demonstrating some understanding of the image. DISC, however, incorrectly predicts butterfly (0.31) as the most probable class, followed by possum (0.20) and shrew (0.11). Its prediction for mouse is low (0.10), and it even includes unrelated categories like aquarium fish (0.07). TrainRef performs better by prioritizing semantically similar categories and avoiding extreme misclassifications.
TrainRef correctly predicts castle (0.63) with high confidence while recognizing house (0.10) and streetcar (0.07) as visually similar alternatives. Other minor predictions remain reasonable. DISC also predicts castle (0.49) but assigns a significant probability to unrelated classes like caterpillar (0.14) and willow tree (0.11), indicating higher confusion with semantically distant objects. TrainRef outperforms DISC by providing a more accurate classification and stronger semantic alignment, reducing confusion with irrelevant objects.
TrainRef correctly predicts house (0.78) with high confidence, followed by castle (0.10) as a reasonable alternative. Other predictions remain relevant. DISC, however, fails to recognize the image, misclassifying it as beetle (0.31), lobster (0.27), and streetcar (0.25), none of which are semantically related. TrainRef outperforms DISC by providing accurate classification and avoiding extreme misclassifications with unrelated objects.
Here, we illustrate how OOD detection works in our framework. We chose WebVision as the demonstration dataset because it is crawled from the web using ImageNet class keywords, leading to many OOD samples due to [mention reason, e.g., noisy data, misclassification, domain shift]. Note that we focus on the first 50 classes, all of which belong to the animal category. 100 randomly sampled OOD examples are shown below.