Statistical Mechanics and Machine Learning

Deep neural networks are widely used as fundamental components of modern artificial intelligence. As a statistical model, they exhibit the following characteristics: (1) The parameters and the probability distribution represented by the model do not correspond one-to-one. (2) In practical problems, the model tends to be excessively redundant, and the rank of the Fisher information matrix is significantly smaller than the dimension of the parameters. (3) Since the posterior distribution concentrates around singular points, the model possesses a mechanism that automatically regularizes itself.

The mathematical foundation developed to elucidate the learning properties of such models is known as singular learning theory. Meanwhile, in real-world applications, AI alignment research is actively advancing to ensure deep neural networks learn in accordance with the designer’s intent. Singular learning theory is expected to play a crucial role as a mathematical foundation in this field.

In singular learning theory, the learning model and prior distribution serve as a candidate pair for mimicking an unknown data-generating distribution. The validity of the pair is verified by comparing theoretical predictions of free energy and generalization error with experimental values calculated from actual data. This framework is mathematically equivalent to statistical mechanics, where natural laws are derived by comparing theoretical predictions of how matter responds to environmental changes with experimental observations. One application of this theory is the derivation of learning curves in deep learning, which can be interpreted as continuous jumps between singular parameter regions. These jumps occur to balance energy and entropy ( i.e. bias and variance).

This implies that qualitative changes in neural network inference due to increasing data are a natural phenomenon and can be mathematically understood as learning that does not deviate from the designer’s intent.

In this subpage, I would like to explain a formal equivalence between statistical mechanics and machine learning. This equivalence has been well-known since the 20th century. If you are interested in this page and if you want to know more details, please see the following articles.

Levin,E., Tishby, N., Solla,S.A. A statistical approach to learning and generalization in layered neural networks, Proceedings of IEEE, vol.78, pp.1568-1574, 1990.

Sumio Watanabe, Review and prospect of algebraic research in equivalent framework between statistical mechanics and machine learning theory, arxiv2406.10234.

This paper was published in Reviews in mathematical physics.

doi/10.1142/S0129055X24610099

The formal equivalence between statistical mechanics and information theory is well known from the 20th century. Even if they have different nature and principle, they contain the same mathematical structure. From the viewpoint of mathematics, they are equivalent.

This slide explains statistical mechanics.

(1) J={Jij} is a set of random variables, which determines the random interaction.

(2) w={wi} is a set of observables such as a spin system. A candidate model in statistical mechanics is set by defining a random Hamiltonian H(w,J) of w.

(3) p(w|J) is the equilibrium state with the inverse temperature beta. This is called the Boltzmann or canonical distribution.

(4) The partition function is defined by the normalizing constant.

(5) Average free energy is defined by the expectation value for overall J.

This is an explanation of statistical mechanics. If the principle of equal weight is adopted, then the prior distribution is set as 1 for each state. The validation of a model as natural science is performed by comparison of theoretical and experimental free energies, or some values derived from free energies.

This slide explains machine learning theory.

(1) X^n is a set of random variables whose probability distribution is unknown.

(2) The variable w indicates the parameter of a candidate model p(x|w). H(w,J) is the minus log likelihood function of p(x|w).

(3) The posterior distribution is defined by using the inverse temperature beta. In the Bayes method, beta=1, whereas in the maximum likelihood method, beta=infinity.

(4) The marginal likelihood is equal to the normalizing constant.

(5) The average free energy is defined by the expectation over all X^n.

This is an explanation of machine learning theory. The prior distribution is a part of a model defined by a user. The validation of a model as information science is performed by comparison of theoretical and experimental free energies or some values derived from the free energies.

This table shows the correspondence between statistical mechanics and machine learning theory. It could be emphasized that their structures are completely equivalent, although this equivalence is formal one.

In both fields, mathematical theory is used for predicting the theoretical values of free energies for given models. In both fields, the models are checked by comparing theoretical and experimental free energies or some values derived from free energies.

In this subpage, we have explained that statistical mechanics and machine learning theory have the same structure, which is well-known since the 20th century. In deep learning, the Hamiltonian function has singularities, resulting that algebraic approach was introduced to study the free energy.

In the pioneer work of Professor Huzihiro Araki (荒木不二洋先生), it was shown that algebraic approach is necessary to study statistical mechanics and quantum field theory. Even in machine learning theory, algebraic approach enables us to derive the free energy of singular random Hamiltonian.

Page updated

Google Sites

Report abuse