Does an untrained network start out neutral or biased?
In other words: when faced with a binary dataset, what fraction of examples does it initially assign to "class 0"?
One might expect different outcomes:
The network splits examples evenly across the two classes.
The network systematically favors one class over the other.
Which of these scenarios actually happens?
If we pass a binary dataset through an untrained network, does it split 50/50, or does it lean toward one class?
The answer depends on the architecture.
Why should we care about predictions at initialization?
Because the initial predictive state shapes how the learning dynamics unfold later.
Want to know more?
Initial Guessing Bias: How Untrained Networks Favor Some Classes, Emanuele Francazi, Aurelien Lucchi, Marco Baity-Jesi, ICML 2024 [Conference paper] [arXiv link ] [talk] [GitHub project page]
We prove that untrained neural networks can unevenly distribute their guesses among different classes. This is due to a node-permutation symmetry breaking, caused by architectural elements such as activations, depth and max-pooling.
Where You Place the Norm Matters: From Prejudiced to Neutral Initializations, Emanuele Francazi, Francesco Pinto Aurelien Lucchi, Marco Baity-Jesi, [arXiv link ] [GitHub project page]
Normalization type and position shape prediction behavior at initialization; our theory shows that BatchNorm and LayerNorm differ fundamentally, and highlights the critical role of normalization placement within layers.
When the Left Foot Leads to the Right Path: Bridging Initial Prejudice and Trainability, Alberto Bassi, Carlo Albert, Aurelien Lucchi, Marco Baity-Jesi, Emanuele Francazi, [arXiv link ]
We prove the theoretical correspondance between order/chaos phase transition and initIal guessing bias in the mean-field regime.
Some of the questions we want to address:
Interplay between data-induced and architecture-induced bias effects:
Investigate how IGB interacts with other sources of bias during training. A deeper understanding could reveal ways to harness IGB to counterbalance effects like class imbalance, potentially inspiring new architecture-level solutions to improve learning under imbalance.
The role of the dataset beyond Class Imbalance:
Our work shows how the design of a neural network can bring out IGB. At the same time, however, our analysis shows how input distribution also plays a key role in the phenomenon. A more in-depth study of the interplay between these two elements, in addition to further complementing the understanding of IGB, may be an important first step in including modeling of real data in the analysis. Understanding the role of dataset distribution could also inform about pre-processing procedures (e.g., how to standardize the dataset).
Extension of the analysis to real data and broader architecture families:
Our analysis succeeds in quantitatively describing IGB on multilayer perceptron (MLP) and using a Gaussian blob as input. Since the phenomenon is empirically observed on broader settings, the natural next step is to extend our analysis on them.
IGB and Regularizations:
The condition driving the emergence of IGB suggests how some forms of network regularization might be effective in eliminating IGB. Understanding the effect of regularizations on the phenomenon can then inform about whether or not they should be included in the network design, depending on whether or not we want to eliminate IGB.