Introduction to Singular Learning Theory

This page is written by Sumio Watanabe. You can find PDF file of this page. Japanese version is here.

Here, we will introduce singular learning theory for those who are encountering it for the first time.

Here are contents.

In the first part, we explain what makes deep learning singular.

Deep learning, also known as a hierarchical neural network or artificial neural network, is an inference and prediction model in information science that mimics biological neural circuits. In the diagram, the circles are neurons (nerve cells), and the lines connecting neurons are called synapses . What changes with learning is the strength ( connection weight) at which synapses transmit information, which is denoted by w. The output of deep learning for an external input x is written as f(x,w).

This explains the function of one neuron (nerve cell). Signals x1,x2,...,xM (real numbers) are transmitted to one neuron from another neuron through a synapse. Synapses have signal transmission strengths (connection weights) w1,w2,...,wM (real numbers). The weighted sum collected at the neuron, w1x1 + w2x2 + ... + wMxM, is added to a threshold θ, converted by a nonlinear function σ( ), and the output value is transmitted to another neuron. The function σ( ) is called the activation function . Note that in biological neural networks, electrical signals are actually transmitted.

Commonly used activation functions σ( ) include the sigmoid function, the ReLU function, and the swish function. Many other functions have also been devised.

A hierarchical neural network is a network made by layering neurons in layers. It is also known as an artificial neural network or deep learning. Depending on the application, a model in which the output is added back to the input or intermediate layers (called a recurrent network) may be used, but here we will focus on models in which information flows in only one direction.

( Note) The book published in 1986,

J. L. McClelland, D. E. Rumelhart, and G. E. Hinton, Parallel Distributed Processing, MIT Press,

describes hierarchical neural networks and backpropagation as a learning method. The question at the time was, when such a model learns data from the external world, how external information is represented in its internal parameters. In fact, the book describes the types of networks formed for various tasks. (Professors Rumelhardt and McClelland are cognitive psychologists; Professor Hinton won the Nobel Prize for his research in this field.) Soon after the book's publication, experiments demonstrated that this model exhibited high performance in a variety of computer engineering problems, sparking a second neuro-boom. After several twists and turns, modern artificial intelligence was realized. However, even 40 years later, there has been little progress in research on the original challenge of understanding the internal representation of neural networks.

It is my hope that singular learning theory will be the first step toward tackling this difficult challenge.

A function often used in physics and informatics is the linear sum of linearly independent functions e_k(x). Multilayer neural networks differ from this in that they repeatedly take linear sums and perform nonlinear transformations. Why do biological and artificial neural networks bother to use complex functions? Let's consider the difference between these two functions.

By using linearly independent functions such as polynomials and trigonometric functions, we can uniformly approximate any continuous function on a bounded closed set. In this case, the correspondence between the parameters {wk} and the function f(x,w) is one-to-one, so we can interpret the meaning of the parameters. This is an example of a regular model.

In a regular model, there is a one-to-one correspondence between the parameter set and the function space. Two functions match if and only if their parameters match. Such functions are called identifiable because their parameters can be uniquely determined from the functions.

Hierarchical neural networks can also uniformly approximate any continuous function on a bounded closed set. However, the correspondence between the parameter set {wk} and the function f(x,w) is not one-to-one. Furthermore, because the relationships between parameters are complex, it is not possible to interpret the function of a parameter simply by looking at its value. This property is called nonidentifiable or unidentifiable.

The reason why the parameters and functions do not have a one-to-one relationship is that, for example, when a parameter is 0, the output will be the same regardless of the parameter on the input side, or when the outputs of two different neurons are linearly dependent, the parameter on the output side becomes indefinite. The set of parameters that realizes a certain function is generally called an analytic set with a singular point.

(Note) An analytic set is the set formed by all the zeros of an analytic function. The set formed by all the zeros of a polynomial is called an algebraic set or algebraic variety.

When considering the ability to approximate a function, that is, how many functions can be produced with a model with the same number of parameters, it is known that hierarchical neural networks are superior to sums of linearly independent functions. In particular, the higher the input dimension, the better hierarchical neural networks are. However, it should be noted that a model that excels in terms of function approximation does not necessarily excel in its ability to predict unknown inputs from a finite amount of data. This is referred to as "function approximation ability is different from statistical predictive ability."

Summary of Part 1: Hierarchical neural networks excel in function approximation capabilities, but they have the peculiarity that the parameters and functions are not one-to-one.

In the second part, we consider the problem of learning and prediction.

Let's consider a situation where we are given n pieces of input data and n pieces of corresponding training data. In the case of character recognition, xi is a character image and yi is information about what character it is. The output of deep learning for input xi is f(xi,w), so we define the squared error H(w) as the sum of squares of the norm of the vector yi-f(xi,w).

The bottom of this graph represents the parameter set, and the vertical axis represents the squared error H(w). In a regular model, there is a single point where H(w) is minimum, and in that vicinity, the squared error can be approximated by a quadratic function.

The squared error in deep learning is defined in a much higher-dimensional space than two, making it extremely difficult to visually check. This diagram is an imaginary illustration based on the idea that a certain analytical set gives the same function. If we could create a way to visually check the squared error in a high-dimensional space, I think it would be useful for understanding deep learning.

Training a hierarchical neural network poses the problem of how to determine the parameter w. If the squared error is H(w), the penalty term to prevent parameter divergence is L(w), and the Gaussian noise is N, then the noisy steepest descent algorithm ( backpropagation algorithm ) becomes a stochastic differential equation for w defined by this formula. This stochastic differential equation can be solved, and it can be shown that its stationary solution (a solution that does not depend on time t) is equal to the posterior distribution when exp(-H(w)) is the likelihood function and exp(-L(w)) is the prior distribution. (Note) If you want to generate the same posterior distribution on a computer, there are other methods available.

In the following, we consider the case where the data {(xi,yi)} are independent. The likelihood function is the product of the probability model p(y|x,w) multiplied by n sets of data.

The prior distribution is defined as exp(-L(w)) using a penalty term L(w), and can be anything as long as it prevents the parameters from diverging (methods for designing this have been extensively studied, but we will not consider that here). In this case, the posterior distribution is defined using the likelihood function and prior distribution. Here, Z is a normalization constant, but it is the probability that the learning model will produce n outputs for n inputs, and this is called the marginal likelihood . (The marginal likelihood is not the probability due to the unknown distribution from which the data is generated.)

Since the squared error of the regularized model can be approximated by a quadratic function near the minimum point, it can be seen that the posterior distribution or likelihood function can be approximated by a normal distribution.

This is an illustration of the posterior distribution or likelihood function of a singular model such as deep learning. In general, the parameters that maximize the likelihood function (maximum likelihood estimator) do not lie within a finite range. Furthermore, the posterior distribution cannot be approximated by a normal distribution. While the posterior probability increases near singular points, it has been difficult to quantitatively examine the probability near singular points.

This diagram is a schematic diagram showing the relationship between the distribution generating the data (unknown distribution, true distribution) and a set of learning models. Generally, the unknown distribution generating the data is not included in the model. Note that the posterior probability is separate from the probability generating the data, and is defined using the learning model. Also, note that regular models and singular models have different properties as learning models. In these diagrams, the probability distribution is considered to be a single point, and the "distance" between probability distributions is defined using Kullback-Leibler divergence, and the geometry created by this was proposed by Professor Shunichi Amari. This is called information geometry, and is known to provide the foundations for statistics, information theory, and machine learning.

This diagram shows what happens to the posterior distribution in a regular model when the amount of data increases. Whether there is a lot or a little data, the posterior distribution is not far from the optimal point (the parameter that minimizes the KL divergence) from the unknown distribution, and as the amount of data increases, the variance of the posterior distribution becomes smaller. Note that even if the amount of data becomes infinitely large, the learning results will differ from the data-generated distribution.

In deep learning, because the learning model is extremely large, no matter how much data there is, the posterior distribution will never approach the optimal point from the unknown distribution (the learning result will be far from the data-generated distribution). We can see that the posterior probability is large near the singular point, and as the amount of data increases, it moves from a complex singular point (the function is simple) to a simple singular point (the function is complex). The mathematical reason why this phenomenon occurs will be explained in the next part.

To investigate the singular property of actual learning, it is useful to first consider the concept of the Fisher information matrix, defined by the following formula: If the model is regular, the rank of this matrix will be the same as the dimension of the parameters. This formula can be calculated from the gradient vector used in steepest descent.

The horizontal axis of this figure shows the eigenspaces of the Fisher information matrix in deep learning, arranged in ascending order of eigenvalue, and the vertical axis shows the magnitude of the eigenvalue. Most eigenvalues are 0, but the more eigenspaces with eigenvalue 0 there are, the more unique the model is.

(Note) When calculating eigenvalues numerically, you may get something like 1.0 x 10^(-20), and it's unclear whether it's 0 or not, but in learning theory, one guideline is whether the value becomes 1 when multiplied by n. If you're considering higher-order asymptotics, try multiplying by n^2.

In the second part, we explained that deep learning has singularities in the parameter space, so as more data is acquired, the posterior distribution continues to jump from the vicinity of one singularity to the vicinity of another singularity.

In the third part, we will explain the singular learning theory. If you are not good at math, please refer to the explanatory diagram to get a rough idea of what it means.

First, let's define free energy. It stands for "minus log marginal likelihood." The name "free energy" was chosen based on the mathematical equivalence between physics and learning theory , but knowledge of physics is not required here, so you don't need to worry about the meaning of the name. In information theory, free energy is sometimes called stochastic complexity or Bayesian code length.

　　　F: Free energy　

　　　S: Entropy of the unknown distribution

　　　KL: KL information between unknown and model

Between the identities

　　　F = nS + KL

holds true.

Next, we define the predictive distribution. The predictive distribution is the average of the learning model by the posterior distribution. This is an estimate of an unknown distribution using the learning model. Of course, the prediction of the learning model and the unknown distribution will never match, and generalization error represents how far apart the two are. From each definition, we can derive the identity that "generalization error is the increment in free energy." Although generalization error and free energy are strongly related in this way, minimizing generalization error is not the same as minimizing free energy. This is not a paradox. This corresponds to the fact that in regular model statistics, the model selection results for AIC and BIC are generally not the same.

This diagram shows the relationship between the data-generating distribution and the regular learning model. We can see that the posterior distribution is near the optimal parameters and its standard deviation is (1/n)^(1/2). Since the parameter dimension is d, by performing an integral over the parameters, we can derive that the marginal likelihood is as follows, using the KL distance H0 between the data-generating distribution and the optimal parameters and the parameter dimension d.

Once we know the marginal likelihoods, we can use them to calculate the free energy, where n is the number of independent data points (sample size, number of examples) and S is the entropy of the unknown distribution.

Now that we know the free energy, we can use it to calculate the generalization error. As the data approaches infinity, the generalization error approaches H0. We found that for finite n, the generalization error is inversely proportional to n, with a coefficient of (d/2). This curve is called the learning curve. We found that in a regular model, the learning curve is determined only by the dimension of the parameters.

　(Note) The terms learning theory and learning curve were originally coined in psychology to study the learning process in humans and animals. These terms have since been adapted to apply to artificial neural networks and continue to be used today. The term machine learning also has the same origin.

Now we will consider singular models. First, we cover the parameter set with the union of local sets. From now on, we will first consider the problem locally, and then the global case.

For a local parameter set W_k, let H_k be the KL distance from the data generation distribution to that set. Let's calculate and compare the posterior probabilities of many local parameter sets.

We consider expressing the prior distribution φ(w) as the sum of local prior distributions φ_k(w) that have non-zero parts only on local parameter sets. This way of thinking is called "division of unity". It is a common method for investigating global problems locally.

This diagram explains how to calculate the local marginal likelihood by restricting it to a local parameter set. Using the value determined by the singularity (local real log threshold ) λ_k and the KL distance H_k from the unknown distribution to the singularity, we can calculate the integral of the local marginal likelihood. From this, we can find the local free energy F_k(n). The more complex the singularity (the simpler the function), the smaller the real log threshold λ_k and the larger H_k.

The real log canonical threshold is a value that reflects the properties of the singularity. This value is called a birational invariant, and there are many mathematical methods to investigate it, such as the Hironaka resolution theorem and the Bernstein-Sato b-function. For any analytic function K(w), there exists an analytic function w=g(u) such that K(g(u)) has only normal crossing singularities. From this, the real log canonical threshold can be calculated. The fact that this value determines the marginal likelihood is the core of singular learning theory. We won't go into this topic further here, but if you want to learn more about the mathematics, please see here.

Now that we have calculated the local marginal likelihoods, we can calculate the total free energy by adding them up for all local parameter sets. This shows that the total free energy is a log sum exponential function of the local free energies. The log sum exponential function can be expressed as the difference between the minimum and a constant difference. In other words, the total free energy is effectively the minimum of the local free energies.

A small local free energy is equivalent to a large posterior probability of a local parameter set. In other words, as the amount of data increases, the parameter neighborhood where the posterior probability is maximized jumps from a complex singular point (where the function is simple) to a simple singular point (where the function is complex). This jump is called a phase transition.

(Note) No knowledge of physics is required here, but the phase transitions explained here have a mathematically equivalent structure to the phenomena of "water turning into ice" and "iron becoming a magnet." (In a state of thermal equilibrium, the phase is determined by minimizing free energy. The phase selected can change to minimize free energy due to changes in temperature, magnetic field, etc., and this is a phase transition.) Deep learning is realized by computers and is not a natural phenomenon, but this means that the mathematical mechanisms are the same as natural phenomena. It is not yet clear whether biological neural circuits exhibit phenomena similar to the phase transitions explained here. When you come up with something new, what phenomenon occurs in your brain's neural circuits?

Now that we have calculated the free energy, we can calculate the generalization error by calculating its increase. The free energy of a singular model is determined by the minimum value of the local free energy, but minimizing the local free energy is not the same as minimizing the local generalization error. Therefore, please note that the phase transition does not occur so that the generalization error is minimized.

The learning curve of a singular model is created by selecting from the learning curves generated by many different singularity neighborhoods, and which singularity neighborhood is selected is determined by minimizing local free energy. This results in a learning curve as shown in the figure. In deep learning, because the model is large, even if the amount of data increases, it is impossible to move from an unknown distribution to anywhere near the optimal parameters, and this phase transition is essentially repeated infinitely. If singular learning theory were to hold true for biological neural networks, then learning in living organisms would also consist of an infinite series of jumps.

In the third part, we explained that the learning curve of deep learning is formed by jumps from one singularity to another.

Part 4 will introduce AI alignment and singular learning theory. This part is currently developing rapidly and may be updated soon.

This research forms the foundation for the first stage of AI alignment, and clarifies the mathematical properties of singular models. Because many models used in machine learning are singular rather than regular, there has been much research into this topic. Examples include clarifying the estimation accuracy of latent variables, clarifying the real logarithm threshold in deep learning, analyzing the accuracy of variational Bayes methods, designing Markov chain Monte Carlo methods, the relationship between Newtonian figures and singular points, and elucidating the properties of nonnegative value decomposition.

The free energy and generalization error of regular models introduced in Part 3 have long been known in statistics and are not new findings. The theory of regular models has been widely studied and applied in statistics, such as AIC and BIC. Meanwhile, with the progress of research on singular models, research into the evaluation and characterization of singular models has begun in the fields of statistics and machine learning. Much remains unknown in this area, and further research is currently ongoing.

This section introduces research on AI alignment. The challenge of how to ensure that AI learning does not deviate from the designer's intent is called AI alignment. The possibility that singular learning theory can be useful in this area has been proposed and research is currently underway by D. Murfet, S. Wei, J. Hoogland, S. P. Lehalleur, Y. Hayashi, and many others. This is an area where young and talented researchers are pioneering new research fields. Their research is not only featured in papers, but also on video sites (such as YouTube), so we encourage Japanese researchers to take a look. You can find it by searching for "Singular Learning Theory."

Research into AI alignment is currently progressing at an astonishing pace on a global scale. I have set out the goal here, but even before AI alignment can be studied, there is a fundamental problem : since human values are diverse, it cannot be uniquely determined what AI should learn.

Here we have listed some of the challenges that must be overcome. In order for learning to occur, some kind of evaluation criteria must be set, but the problem is that these criteria do not exactly match human values, and if deep learning were to focus all its efforts on the set evaluation criteria, it would deviate from what was originally desired.

(Note) It is an illusion to think that humans can express their values precisely with a certain function.

Singular learning theory is a promising area for AI alignment. It is generally believed that deep learning cannot be interpreted, but even if it cannot be fully interpreted, considering the relationship between the properties of singularities and the functions realized may open up new and unknown developments in difficult challenges.

Mathematical assurance of AI safety is an important issue. So-called "statistical learning theory" considers the worst case as a bound from above, but when targeting complex models such as deep learning, the bound becomes too large, which is a problem. Singular learning theory has the potential to evaluate learning more precisely.

As explained in Part 3, deep learning involves jumping from a singularity neighborhood to another singularity neighborhood. This has the problem of being difficult to distinguish from anomalous learning. Furthermore, even if we consider the credibility and confidence intervals of parameters based on the posterior distribution and likelihood function at each stage, as the amount of data increases, the parameter domain changes to a completely different one. Therefore, a new method for considering singular models is needed.

In the fourth part, we introduced AI alignment and singular learning theory.

Thank you for reading to the end.

This is an introduction for those who are encountering singular learning theory for the first time. If you would like to learn more, please see Singular Learning Theory (1) and Singular Learning Theory (2) (lectures offered in the Master's program until 2023). For more information on phase transitions in deep learning, please see the following:

S. Watanabe, Algebraic geometry and statistical learning theory, Cambridge University Press, pp.235-239, 2009.

S. Watanabe, Algebraic geometrical methods for hierarchical learning machines. Neural Networks, Vol.14, 2001, pp.1049-1060. https://doi.org/10.1016/S0893-6080(01)00069-7

Page updated

Google Sites

Report abuse

Introduction to Singular Learning Theory

This page is written by Sumio Watanabe. You can find PDF file of this page. Japanese version is here.

Here, we will introduce singular learning theory for those who are encountering it for the first time.

Here are contents.

In the first part, we explain what makes deep learning singular.

Commonly used activation functions σ( ) include the sigmoid function, the ReLU function, and the swish function. Many other functions have also been devised.

In a regular model, there is a one-to-one correspondence between the parameter set and the function space. Two functions match if and only if their parameters match. Such functions are called identifiable because their parameters can be uniquely determined from the functions.

Summary of Part 1: Hierarchical neural networks excel in function approximation capabilities, but they have the peculiarity that the parameters and functions are not one-to-one.

In the second part, we consider the problem of learning and prediction.

The bottom of this graph represents the parameter set, and the vertical axis represents the squared error H(w). In a regular model, there is a single point where H(w) is minimum, and in that vicinity, the squared error can be approximated by a quadratic function.

In the following, we consider the case where the data {(xi,yi)} are independent. The likelihood function is the product of the probability model p(y|x,w) multiplied by n sets of data.

Since the squared error of the regularized model can be approximated by a quadratic function near the minimum point, it can be seen that the posterior distribution or likelihood function can be approximated by a normal distribution.

In the second part, we explained that deep learning has singularities in the parameter space, so as more data is acquired, the posterior distribution continues to jump from the vicinity of one singularity to the vicinity of another singularity.

In the third part, we will explain the singular learning theory. If you are not good at math, please refer to the explanatory diagram to get a rough idea of ​​what it means.

F: Free energy

S: Entropy of the unknown distribution

KL: KL information between unknown and model

Between the identities

F = nS + KL

holds true.

Once we know the marginal likelihoods, we can use them to calculate the free energy, where n is the number of independent data points (sample size, number of examples) and S is the entropy of the unknown distribution.

Now we will consider singular models. First, we cover the parameter set with the union of local sets. From now on, we will first consider the problem locally, and then the global case.

For a local parameter set W_k, let H_k be the KL distance from the data generation distribution to that set. Let's calculate and compare the posterior probabilities of many local parameter sets.

We consider expressing the prior distribution φ(w) as the sum of local prior distributions φ_k(w) that have non-zero parts only on local parameter sets. This way of thinking is called "division of unity". It is a common method for investigating global problems locally.

In the third part, we explained that the learning curve of deep learning is formed by jumps from one singularity to another.

Part 4 will introduce AI alignment and singular learning theory. This part is currently developing rapidly and may be updated soon.

Research into AI alignment is currently progressing at an astonishing pace on a global scale. I have set out the goal here, but even before AI alignment can be studied, there is a fundamental problem : since human values ​​are diverse, it cannot be uniquely determined what AI should learn.

In the fourth part, we introduced AI alignment and singular learning theory.

Thank you for reading to the end.

S. Watanabe, Algebraic geometry and statistical learning theory, Cambridge University Press, pp.235-239, 2009.

In the third part, we will explain the singular learning theory. If you are not good at math, please refer to the explanatory diagram to get a rough idea of what it means.

　　　F: Free energy　

　　　S: Entropy of the unknown distribution

　　　KL: KL information between unknown and model

　　　F = nS + KL

Research into AI alignment is currently progressing at an astonishing pace on a global scale. I have set out the goal here, but even before AI alignment can be studied, there is a fundamental problem : since human values are diverse, it cannot be uniquely determined what AI should learn.