Class Imbalance

How class imbalance affects the learning process.

Things that we already know:

We present a theoretical analysis of how class imbalance affects (S)GD and its variants, proving convergence and identifying conditions for improved per-class performance, highlighting how imbalance affects GD and SGD differently.

Some of the questions we want to address:

 While in the binary case the level of imbalance in a dataset can be quantified in a simple way, when multiple classes are present, extending the definition is nontrivial. Finding an appropriate measure would make it possible to set an unambiguous scale for quantifying imbalance, facilitating comparison between different works 

our analysis sheds light on how class imbalance impacts SGD dynamics. However, the solution proposed in the paper is for didactic purposes only, and not efficient in practice. Once the effects induced by class imbalance are understood, the natural next step is to use this knowledge to formulate a (practically usable) solution to the problem.

Real datasets are often affected by Class Imbalance; however, in many cases this condition is combined with a data scarcity for the minority class. These represent two different sources of difficulty for the learning process: being able to distinguish the impact of class imbalance from that coming from data scarcity would facilitate, for example, the connection between theoretical results (unaffected by data scarcity) and empirical observations.