Phase Transition in Machine Learning

A learning machine such as deep neural networks contains many smaller models in the parameter space. The parameter which is identified as a smaller model is a singularity of a larger model. 

By using the division of unity, the prior distribution on the parameter space can be represented as a finite sum of distrbutions which have compact support. 

As sample size increases, the chosen parameter set by the posterior distribution changes from a singularity to the other singularity. This phenomenon is called phase transition.

Mathematically, phase transition is explained as follows. The the local integration Zk of the parameter is equal to the posterior probability of the local parameter set. The  local free energy is defined by Fk = - log Zk . Hence the minimum local free energy means the maximum posterior probability. The parameter set that makes the local free energy minimum is automatically chosen by the posterior distribution. 

A singularity which corresponds to a small model has a smaller variance and a larger bias. A singularity which does a large model has a larger variance and smaller variance. The parameter set which is chosen by the posterior distribution is determined by the balance of bias and variance. 

The balance between bias and variance in the free energy is different from that in the generalization error. The posterior distribution automatically makes the free energy smallest, however, it does not the generalization error. This is not a paradox. 

The learning curve defined by the generalization error has this form. This phenomenon is caused by the singularities of deep learning. Deep learning process is the multiple jumps from singularities to singularities. Even if the sample size is quite huge, it does not reach any regular parameter. 


Sumio Watanabe, Algebraic geometrical methods for hierarchical learning machines. Neural Networks, 14(8), 2001, pp.1049-1060