A learning machine such as deep neural networks contains many smaller models in the parameter space. The set of parameters that represents a smaller model contains singularities.
By using the division of unity, the prior distribution defined on the parameter space can be represented as a finite sum of distributions, each of which has a compact support.
As sample size increases, the chosen parameter set by the posterior distribution changes from a neighborhood of a singularity to the different neighborhood of other singularity. This phenomenon is called phase transition.
Mathematically, phase transition is explained as follows. Theintegration of the local parameter set Zk is equal to the posterior probability of its set. The local free energy is defined by Fk = - log Zk . Hence the minimum local free energy means the maximum posterior probability. The parameter set that makes the local free energy minimum is automatically chosen by the posterior distribution.
A singularity which corresponds to a small model has a smaller variance and a larger bias. A singularity which does a large model has a larger variance and smaller variance. The parameter set which is chosen by the posterior distribution is determined by the balance of bias and variance.
The balance between bias and variance in the free energy is different from that in the generalization error. The posterior distribution automatically makes the free energy smallest, however, it does not the generalization error. This is not a paradox.
The learning curve defined by the generalization error has this form. This phenomenon is caused by the singularities of deep learning. Deep learning process is the multiple jumps from singularities to singularities. Even if the sample size is quite huge, it does not reach any regular parameter.