The problem studied in this paper is: in a statistical sense (i.e., with a small generalization error), what kinds of functions can be learned by a mean-field shallow neural network trained via the Wasserstein gradient flow?
We now elaborate in detail on the meaning behind this message.
Training a mean field shallow neural network via Wasserstein gradient flow is equivalent to solving a negative Shannon entropy regularized empirical risk minimization, where the manifold of probability measures serve as statistical models.
The advantage of training shallow neural networks using the Wasserstein gradient flow lies in the fact that theoretical guarantees ensure the global convergence of the empirical loss (Chizat&Bach, Mei et.al., Rotskoff&Vanden-Eijnden, Nitanda et.al.) — although this does not mean that the empirical loss must necessarily reach zero, the fact that it can reach zero is of fundamental importance from an optimization perspective. However, this conclusion pertains only to optimization theory. From the standpoint of statistical theory, one must examine the generalization error of neural networks trained in this manner, which is precisely the focus of the present paper.
This is an important question — the purpose of training a neural network is to obtain an estimator with a small generalization error. Moreover, there exists a well-known phenomenon in mathematical statistics and optimization theory called the statistical–computational trade-off, which makes it particularly meaningful to investigate the generalization ability of neural networks trained via the Wasserstein gradient flow.
We first define the negative Shannon entropy regularized empirical risk minimization.