Statistical Mechanics and Machine Learning

In this subpage, I would like to explain a formal equivalence between statistical mechanics and machine learning. This equivalence has been well-known since the 20th century. If you are interested in this page and if you want to know more details, please see the following articles. 


Levin,E., Tishby, N., Solla,S.A. A statistical approach to learning and generalization in layered neural networks, Proceedings of IEEE,  vol.78, pp.1568-1574, 1990.

Sumio Watanabe, Review and prospect of algebraic research in equivalent framework between statistical mechanics and machine learning theory, arxiv2406.10234. 

The formal equivalence between statistical mechanics and information theory is well known from the 20th century. Even if they have different nature and principle, they contain the same mathematical structure. From the viewpoint of mathematics, they are equivalent. 

This slide explains statistical mechanics.

 (1) J={Jij} is a set of random variables, which determines the random interaction. 

(2) w={wi} is a set of observables such as a spin system. A candidate model in statistical mechanics is set by defining a random Hamiltonian H(w,J) of w. 

(3) p(w|J) is the equilibrium state with the inverse temperature beta. This is called the Boltzmann or canonical distribution. 

(4) The partition function is defined by the normalizing constant. 

(5) Average free energy is defined by the expectation value for overall J. 

This is an explanation of statistical mechanics. If the principle of equal weight is adopted, then the prior distribution is set as 1 for each state. The validation of a model as natural science is performed by comparison of theoretical and experimental free energies, or some values derived from free energies.

This slide explains machine learning theory. 

(1) X^n is a set of random variables whose probability distribution is unknown.

(2) The variable w indicates the parameter of a candidate model p(x|w). H(w,J) is the minus log likelihood function of p(x|w).

(3) The posterior distribution is defined by using the inverse temperature beta. In the Bayes method, beta=1, whereas in the maximum likelihood method, beta=infinity.

(4) The marginal likelihood is equal to the normalizing constant.

 (5) The average free energy is defined by the expectation over all X^n. 

This is an explanation of machine learning theory. The prior distribution is a part of a model defined by a user. The validation of a model as information science is performed by comparison of theoretical and experimental free energies or some values derived from the free energies. 

This table shows the correspondence between statistical mechanics and machine learning theory. It could be emphasized that their structures are completely equivalent, although this equivalence is formal one.


In both fields, mathematical theory is used for predicting the theoretical values of free energies for given models. In both fields, the models are checked by comparing theoretical and experimental free energies or some values derived from free energies. 

In this subpage, we have explained that statistical mechanics and machine learning theory have the same structure, which is well-known since the 20th century. In deep learning, the Hamiltonian function has singularities, resulting that algebraic approach was introduced to study the free energy. 

In the pioneer work of Professor Huzihiro Araki (荒木不二洋先生), it was shown that algebraic approach is necessary to study statistical mechanics and quantum field theory. Even in machine learning theory, algebraic approach enables us to derive the free energy of singular random Hamiltonian.