Deep Learning Foundation

Overview

While deep learning has led to tremendous breakthroughs in various applications, its theoretical foundation for validation, reliability, and interpretation is in its infancy. Establishing mathematical and statistical principles is crucial to validating and improving deep learning to obtain reliable results. We aim at developing a complete theoretical framework to analyze and guide the design of deep learning-based regression and PDE solvers, making scientific machine learning more interpretable and reliable. This theoretical framework consists of approximation theory, optimization theory, and generalization theory summarized in the figure below.

Deep Network Approximation

Our goal is to clarify the advantage of deep neural network (DNN) approximation and how to design networks. Important questions include whether DNNs conquer or lessen the curse of dimensionality and what is the optimal and explicit characterization of the approximation error. In particular, given an arbitrary depth L and width N, what is the explicit formula to characterize the approximation error to see whether a DNN is large enough to meet accuracy requirements, which cannot be answered by existing theories.

Mainly in work with Zuowei Shen and Shijun Zhang, we have explicitly quantified the approximation error of ReLU DNNs with arbitrary width N and depth L simultaneously, even with the optimal approximation rate, for continuous functions [pdf , pdf] and smooth functions [pdf] in the L^\infty and Sobolev norms. Unfortunately, ReLU DNNs cannot break the curse of dimensionality for these function classes by our results. This is bad news since the curse of dimensionality is the theoretical justification of DNN-based methods for solving high-dimensional problems. Then we discovered that designing effective activation functions is the key to the super approximation power of DNNs. We have shown that DNN approximation with advanced activation functions admits an error 1) exponentially small in the number of parameters and 2) without the curse of dimensionality for general continuous functions [pdf]. This super approximation power can be achieved even with only 3 hidden layers [pdf]. In a most recent work, we showed that a DNN with a fixed size of order d^2 can approximate any continuous functions on a hypercube within an arbitrarily small error. In the case of bandlimited functions, networks with ReLU activation functions can also have a dimension-independent approximation rate as shown together with Hadrien Montanelli and Qiang Du in [pdf], in the same spirit of the Barron spaces studied by Andrew Barron, Weinan E, Jinchao Xu, etc.

Deep Network Optimization & Generalization

It remains mysterious why standard training algorithms tend to find a good optimal solution that generalizes well, despite the highly non-convex landscape of the loss function. Recent breakthroughs of optimization convergence are under the over-parameterized setting, where the DNN width is extremely large. Over-parametrization typically results in a poor generalization error bound in existing generalization analysis but the test error in numerical implementation is good. Our goal in this direction is to fill the current gap between optimization and generalization analysis with a more realistic setting and generalize the analysis from regression problems to DNN-based PDE solvers.

In [pdf] with Fusheng Liu and Qianxiao Li, we proposed a framework to connect optimization with generalization by analyzing the generalization error based on the length of optimization trajectories in the gradient flow algorithm. We showed that gradient flow converges following a short path with an explicit length estimate. Such an estimate induces a length-based generalization bound showing that short optimization paths after convergence indicate good generalization. This framework can be applied to broad settings, e.g., underdetermined Lp linear regression, kernel regression, and overparameterized two-layer ReLU neural networks without the gap mentioned above. The proposed flow-induced analysis would better reveal the implicit benefit of deep learning than linearizing deep learning or convexifying its energy landscape. In [pdf] with Tao Luo, leveraging the theory of neural tangent kernel (NTK), we showed that gradient descent can identify a global minimizer of the two-layer neural network least-squares optimization for solving second-order linear PDEs under the assumption of over-parametrization. In [pdf], we have also analyzed the generalization error of the least-squares optimization for second-order linear PDEs, when the right-hand-side function of the PDE is in a Barron-type space and the least-squares is regularized with a Barron-type norm, without the over-parametrization assumption. The analysis in [pdf] has been extended to provide theoretical guarantees of deep learning for estimating the stationary density of Ito diffusion [pdf] and solving linear elliptic PDEs on unknown manifolds [pdf].