Class Imbalance
How class imbalance affects the learning process.
Things that we already know:
A Theoretical Analysis of the Learning Dynamics under Class Imbalance, Emanuele Francazi, Marco Baity-Jesi, Aurelien Lucchi, ICML 2023 [Conference paper ] [arXiv link][5-minutes talk ] [GitHub project page]
We present a theoretical analysis of how class imbalance affects (S)GD and its variants, proving convergence and identifying conditions for improved per-class performance, highlighting how imbalance affects GD and SGD differently.
Some of the questions we want to address:
Imbalance in multi-class problems:
While in the binary case the level of imbalance in a dataset can be quantified in a simple way, when multiple classes are present, extending the definition is nontrivial. Finding an appropriate measure would make it possible to set an unambiguous scale for quantifying imbalance, facilitating comparison between different works
An efficient solution to counter class imbalance in SGD:
our analysis sheds light on how class imbalance impacts SGD dynamics. However, the solution proposed in the paper is for didactic purposes only, and not efficient in practice. Once the effects induced by class imbalance are understood, the natural next step is to use this knowledge to formulate a (practically usable) solution to the problem.
Distinguish Class Imbalance from Data Scarcity:
Real datasets are often affected by Class Imbalance; however, in many cases this condition is combined with a data scarcity for the minority class. These represent two different sources of difficulty for the learning process: being able to distinguish the impact of class imbalance from that coming from data scarcity would facilitate, for example, the connection between theoretical results (unaffected by data scarcity) and empirical observations.