The tutorial will be given primarily using Jupyter Notebook examples with proper explanations, pointers, and discussions.
Module 1: Setups
This module will set up the basic settings and provide examples that show how one can simulate noisy labels for controlled experiments.
Module 2: Learning the noise rate in the labels without knowing the ground truth
This module will go through examples showing how one can estimate the hidden noise transition matrix that controls the generation of noisy labels, without using ground truth annotations. The knowledge of noise rate plays a central role in understanding the quality of data and in building robust training approaches. The approaches we will cover include a data-centric approach based on clusterability [paper 1, paper 2] and a learning-centric approach based on deep neural networks [paper]. See the following figure for an illustration of inferring the transition matrix.
Module 3: Learning algorithms that handle noisy labels
This module will provide examples showing how one would implement a learning algorithm specifically designed for handling noisy labels. We will introduce robust loss functions that are ready to be plugged into the empirical risk minimization (ERM) framework. The specific loss functions include forward/backward loss correction [paper], loss reweighting [paper 1, paper 2], peer loss [paper 1, paper 2], bi-tempered loss [paper], and label smoothing [paper 1, paper 2]. We will also discuss other approaches, including Co-Teaching [paper], and DivideMix [paper], which employ two neural networks to supervise each other.
Module 4: Noise label detection
This section will go through examples that explain how one can implement a detection algorithm to identify the wrong labels in a training dataset. The approaches we will cover include a data-centric method (CORES) [paper] and a learning-centric method (SimiFeat) [paper]. See the figure below for an illustration.
Module 5: Real-world datasets
This module will go through examples using real-world datasets that contain noisy human annotations. We will primarily focus on using the CIFAR-N dataset.