Suspicious examples identified by EIF -- Noisy data in training set

Noisy data in training set

We investigate the noisy data in training set by looking at cases when our relabelling algorithm suggest a label that is inconsistent with its original ones.

Based on the relabel suggestions distribution, we categorise each training class into:

  • Centralized Relabelling: The training classes with more than 10% samples recommended to be relabelled uniformly as another label

  • Diversified Relabelling: The training classes with more than 10% samples recommended to be relabelled, to diversified labels.

  • Individual Relabelling: The training classes with less than 5% samples recommended to be relabelled


Centralized Relabelling suggests class merge. Diversified Relabelling suggests class split. Individual Relabelling suggests data cleansing.

Noisy data in training set -- Centralized relabelling examples

First row show examples from one class, Second row show examples from another class.

According to our relabelling algorithm, the samples from first row are recommended to be relabelled as the second class, and they composite more than 10% of all the samples inside this class, which suggests these two classes are highly similar and should be merged.

Noisy data in training set -- Diversified relabelling examples

In each image, first row show examples from one class, second row show examples from another class.

According to our relabelling algorithm, the samples from first row are recommended to be relabelled as the second class. However, their relabelling suggestions are inconsistent, i.e. suggest different classes. Therefore, we believe the first class has high intra-class variance and should be split into multiple classes.

Noisy data in training set -- Individual relabelling examples

First row show examples from one class, Second row show examples from another class.

According to our relabelling algorithm, the samples from first row are recommended to be relabelled as the second class, and they composite less than 5% of all the samples inside this class. Data cleansing is needed to correct those noisy samples.