Research Accomplishments

Stochastic optimization and generalization.

Stochastic optimization can be applied to both empirical loss minimization and expected loss minimization problems. When applied to the latter, we consider a setting in which a fresh example is drawn from a probability distribution that defines the expected loss. In this case, the number of iterations of the optimization method corresponds to the number of examples used for training, and the optimization efficiency with respect to the number of iterations is nothing but the learning efficiency. This means the close relationship between stochastic optimization methods and machine learning theory. For instance, the generalization analyses of stochastic gradient descent methods based on this perspective have been conducted in [Nitanda and Suzuki (AISTATS2019)], [Yashima et al. (AISTATS2021)], [Nitanda and Suzuki (ICLR2021)], and others. In [Nitanda and Suzuki (AISTATS2019)], we showed linear convergence of expected classification error by stochastic gradient descent on reproducing kernel Hilbert space (RKHS) under the strong low-noise condition on labels, and in [Yashima et al. (AISTATS2021)], we extended this result to a setting with random Fourier features. [Nitanda and Suzuki (ICLR2021)] showed the optimality of (averaged) stochastic gradient descent in the sense of generalization for over-parameterized two-layer neural networks, under the neural tangent kernel regime, on the regression problem with the squared loss.

Convergence theory of neural networks.

The hypothesis function is optimized through the parameters to fit the given dataset. Therefore, it is natural to investigate the dynamics in the function space associated with optimization. [Nitanda and Suzuki (2017)], [Nitanda and Suzuki (ICLR2021)], and [Nitanda et al. (NeurIPS2021)], which analyzed two-layer neural networks, are based on this idea. There are two theories that describe the dynamics of neural networks in function space: mean-field theory and neural tangent kernel theory, and the effective theory switches depending on the learning setting of the neural network (e.g., initialization scale, output scale, and learning rate). The mean-field viewpoint of neural network dynamics was introduced by [Nitanda and Suzuki (2017)]. This theory showed that the dynamics of the gradient descent for neural networks can be described as a Wasserstein gradient flow in the space of probability distributions. Subsequently, the global convergence theories for mean-field neural networks were provided by several research groups. However, the neural networks in the mean-field regime (mean-field neural networks) are generally difficult to analyze, and some additional conditions or regularizations are required to guarantee the polynomial convergence complexity. [Nitanda et al. (NeurIPS2021)] proposed the particle dual averaging (PDA) method and showed that a two-layer mean-field neural network can be optimized in polynomial time under the relative entropy (Kullback-Leibler divergence) regularization. This is the first study that provides quantitative convergence guarantees for mean-field neural networks. Moreover, this result is refined by a subsequent study [Oko et al. (ICLR2022)] that proposed particle stochastic dual coordinate ascent (P-SDCA) with faster convergence guarantees for empirical risk minimization problems. The PDA and P-SDCA methods are extensions of the dual averaging method and stochastic dual coordinate ascent, developed for the optimization over finite-dimensional spaces, into the space of probability distributions. In general, the proposed methods can minimize convex functionals with negative entropy regularization, and hence they also can be regarded as nonlinear extensions of the Langevin algorithm which minimizes linear functionals with negative entropy regularization. The normal noisy gradient descent for mean-field neural networks can be described as mean-field Langevin dynamics, whose evolution of distributions is nonlinear Fokker–Planck equation, and its convergence in both continuous- and discrete-time settings was shown in [Nitanda et al. (AISTATS2022)]. The key in the theory is the proximal Gibbs distribution which makes the connection with convex optimization theory.

Optimization in function spaces.

From the viewpoint of optimization in function space, dynamics need not be associated with the optimization of parameters. For example, the gradient boosting method approximates an iteration of gradient descent in function space by adding a weak learner to the ensemble. Each layer of the residual neural network (ResNet) also has such a property. From this perspective, optimization methods for ResNets based on gradient boosting have been proposed in [Nitanda and Suzuki (ICML2018)] and [Nitanda and Suzuki (AISTATS2020)], and optimization methods for generative adversarial networks (GANs) have been proposed in [Nitanda and Suzuki (AISTATS2018)]. The idea of these methods is also related to [Nitanda and Suzuki (2017)] which gave the mean-field theory of neural networks. In this paper, we described the dynamics called Wasserstein gradient flow, in the space of probability distributions, associated with the (stochastic) gradient descent and showed that this dynamics implicitly optimizes the transport map for probability distributions, which can be regarded as a gradient boosting method for ResNets. In other words, the gradient boosting of the transport map for probability distributions is also related to the gradient method for mean-field two-layer neural networks.

Scalable stochastic optimization.

The large-scale dataset is beneficial in improving generalization performance, whereas the cost of optimization generally increases. Therefore, it is important to study stochastic optimization methods that are scalable to large data sets. Stochastic optimization is a scalable algorithm by nature, but there is room to develop better methods by exploiting the structure of optimization problems. For example, in [Nitanda (NIPS2014)] and [Nitanda (AISTATS2016)], we attempted to develop accelerated variance-reduced methods that make use of the finite-sum structure of the empirical loss minimization problem. In [Nitanda et al. (ICDM2019)], we discussed the tradeoff between iteration complexity, total computational complexity, and mini-batch efficiency of stochastic optimization methods when using mini-batch updates, and showed the optimality of the variant method of [Nitanda (NIPS2014)] regarding this tradeoff. In [Nitanda and Suzuki (AISTATS2017)], we developed a stochastic optimization method for DC optimization problems where the objective function is the difference of convex functions, and showed its effectiveness in learning Boltzmann machines.


References

[Nitanda et al. (AISTATS2022)] Atsushi Nitanda, Denny Wu, and Taiji Suzuki. Convex Analysis of the Mean Field Langevin Dynamics. The 25th International Conference on Artificial Intelligence and Statistics (AISTATS2022), Proceedings of Machine Learning Research, 151:9741--9757, 2022. [arXiv]

[Oko et al. (ICLR2022)] Kazusato Oko, Taiji Suzuki, Atsushi Nitanda, and Denny Wu. Particle Stochastic Dual Coordinate Ascent: Exponential convergent algorithm for mean field neural network optimization. The 10th International Conference on Learning Representations (ICLR2022), 2022. [openreview]

[Nitanda and Suzuki (ICLR2021)] Atsushi Nitanda and Taiji Suzuki. Optimal Rates for Averaged Stochastic Gradient Descent under Neural Tangent Kernel Regime. The 9th International Conference on Learning Representations (ICLR2021), 2021. (outstanding paper award) [arXiv], [openreview]

[Amari et al. (ICLR2021)] Shun-ichi Amari, Jimmy Ba, Roger Grosse, Xuechen Li, Atsushi Nitanda, Taiji Suzuki, Denny Wu, and Ji Xu. When Does Preconditioning Help or Hurt Generalization?. The 9th International Conference on Learning Representations (ICLR2021), 2021. [arXiv], [openreview]

[Yashima et al. (AISTATS2021)] Shingo Yashima, Atsushi Nitanda, and Taiji Suzuki. Exponential Convergence Rates of Classification Errors on Learning with SGD and Random Features. The 24th International Conference on Artificial Intelligence and Statistics (AISTATS2021), Proceedings of Machine Learning Research, 130:1954—1962, 2021. [arXiv]

[Nitanda and Suzuki (AISTATS2020)] Atsushi Nitanda and Taiji Suzuki. Functional Gradient Boosting for Learning Residual-like Networks with Statistical Guarantees. The 23rd International Conference on Artificial Intelligence and Statistics (AISTATS2020), Proceedings of Machine Learning Research, 108:2981—2991, 2020.

[Nitanda et al. (ICDM2019)] Atsushi Nitanda, Tomoya Murata, and Taiji Suzuki. Sharp Characterization of Optimal Minibatch Size for Stochastic Finite Sum Convex Optimization. 2019 IEEE International Conference on Data Mining (ICDM2019), pp. 488—497. 2019. (regular, best paper candidate for KAIS publication) [slide]

[Nitanda and Suzuki (AISTATS2019)] Atsushi Nitanda and Taiji Suzuki. Stochastic Gradient Descent with Exponential Convergence Rates of Expected Classification Errors. The 22nd Artificial Intelligence and Statistics (AISTATS2019), Proceedings of Machine Learning Research, 89:1417—1426, 2019. (oral presentation) [arXiv] [slide]

[Nitanda and Suzuki (ICML2018)] Atsushi Nitanda and Taiji Suzuki. Functional Gradient Boosting based on Residual Network Perception. The 35th International Conference on Machine Learning (ICML2018), Proceedings of Machine Learning Research, 80:3819—3828, 2018. [arXiv] [code] [slide]

[Nitanda and Suzuki (AISTATS2018)] Atsushi Nitanda and Taiji Suzuki. Gradient Layer: Enhancing the Convergence of Adversarial Training for Generative Models. The 21st International Conference on Artificial Intelligence and Statistics (AISTATS2018), Proceedings of Machine Learning Research, 84: 1008—1016, 2018. [arXiv]

[Nitanda and Suzuki (AISTATS2017)] Atsushi Nitanda and Taiji Suzuki. Stochastic Difference of Convex Algorithm and its Application to Training Deep Boltzmann Machines. The 20th International Conference on Artificial Intelligence and Statistics (AISTATS2017), Proceedings of Machine Learning Research, 54:470—478, 2017.

[Nitanda (AISTATS2016)] Atsushi Nitanda. Accelerated Stochastic Gradient Descent for Minimizing Finite Sums. The 19th International Conference on Artificial Intelligence and Statistics (AISTATS2016), Proceedings of Machine Learning Research, 51:195—203, 2016. [arXiv]

[Nitanda (NIPS2014)] Atsushi Nitanda. Stochastic Proximal Gradient Descent with Acceleration Techniques. In Advances in Neural Information Processing Systems 27 (NIPS2014), pp.1574—1582, 2014.

[Nitanda et al. (NeurIPS2021)] Atsushi Nitanda, Denny Wu, and Taiji Suzuki. Particle Dual Averaging: Optimization of Mean Field Neural Networks with Global Convergence Rate Analysis. In Advances in Neural Information Processing Systems 34 (NeurIPS2021), pp.****--****, 2021. [arXiv], [openreview]

[Nitanda and Suzuki (2017)] Atsushi Nitanda and Taiji Suzuki. Stochastic Particle Gradient Descent for Infinite Ensemble. 2017. [arXiv]