Author: Shiyun Xu, Zhiqi Bu
We propose the generalized Newton's method (GeN) -- a Hessian-informed approach that applies to any optimizer such as SGD and Adam, and covers the Newton-Raphson method as a sub-case. Our method automatically and dynamically selects the learning rate that accelerates the convergence, without the intensive tuning of the learning rate scheduler. In practice, out method is easily implementable, since it only requires additional forward passes with almost zero computational overhead (in terms of training time and memory cost), if the overhead is amortized over many iterations. We present extensive experiments on language and vision tasks (e.g. GPT and ResNet) to showcase that GeN optimizers match the state-of-the-art performance, which was achieved with carefully tuned learning rate schedulers. Code to be released at \url{https://github.com/ShiyunXu/AutoGeN}.
Author: Zhiqi Bu, Shiyun Xu, Ian J. Barnett
Differential learning rate (DLR), a technique that applies different learning rates (instead of a single one) to different model parameters, has been widely used in deep learning and achieved empirical success via its various forms. For example, parameter-efficient training (PET) applies zero learning rates to most parameters so as to significantly saves the computational cost; adaptive optimizers such as Adam apply the coordinate-wise learning rate to accelerate the convergence.
At the core, DLR leverages the observation that different parameters can have different loss curvature, which is hard to characterize in general. We propose the Hessian-informed differential learning rate (Hi-DLR), an efficient approach that captures the loss curvature of parameters for any model and optimizer adaptively. Given a proper grouping of parameters, we empirically demonstrate that Hi-DLR can improve the convergence by dynamically determining the learning rates during the training. Furthermore, we can quantify the influence of different parameters and freeze the less-contributing parameters, which leads to a new PET that automatically adapts to various tasks and models.
Author: Dan Kushnir, Shiyun Xu
Past years have witnessed the fast and thorough development of active learning, a human-in-the-loop semi-supervised learning that helps reduce the burden of expensive data annotation. Diverse techniques have been proposed to improve the efficiency of label acquisition. However, the existing techniques are mostly intractable at scale on massive unlabeled instances. In particular, the query time and model retraining time of large scale image-data models is usually linear or even quadratic in the size of the unlabeled pool set and its dimension. The main reason for this intractability is the iterative need to scan the pool set at least once in order to select the best samples for label annotation. To alleviate this computational burden we propose efficient Diffusion Graph Active Learning (DGAL). DGAL is used on a pre-computed Variational-Auto-Encoders (VAE) latent space to restrict the pool set to a much smaller candidates set. The sub-sample is then used in deep architectures, to reduce the query time, via an additional standard active learning baseline criterion. DGAL demonstrates a query time versus accuracy trade-off that is two or more orders of magnitude acceleration over state-of-the-art methods. Moreover, we demonstrate the important exploration-exploitation trade-off in DGAL that allows the restricted set to capture the most impactful samples for active learning at each iteration.
Author: Jialin Mao, Zhiqi Bu, Shiyun Xu
Large convolutional neural networks (CNN) can be difficult to train in the differentially private (DP) regime, since the optimization algorithms require a computationally expensive operation, known as the per-sample gradient clipping. We propose an efficient and scalable implementation of this clipping on convolutional layers, termed as the mixed ghost clipping, that significantly eases the private training in terms of both time and space complexities, without affecting the accuracy. The improvement in efficiency is rigorously studied through the first complexity analysis for the mixed ghost clipping and existing DP training algorithms. Extensive experiments on vision classification tasks, with large ResNet, VGG, and Vision Transformers, demonstrate that DP training with mixed ghost clipping adds memory overhead and slowdown to the standard non-private training. Specifically, when training VGG19 on CIFAR10, the mixed ghost clipping is faster than state-of-the-art Opacus library with larger maximum batch size. To emphasize the significance of efficient DP training on convolutional layers, we achieve 96.7% accuracy on CIFAR10 and 83.0% on CIFAR100 at using BEiT, while the previous best results are 94.8% and 67.4%, respectively. We open-source a privacy engine (https://github.com/woodyx218/private_vision) that implements DP training of CNN with a few lines of code.
Author: Shiyun Xu, Zhiqi Bu, Pratik Chaudhari, Ian J. Barnett
Interpretable machine learning has demonstrated impressive performance while preserving explainability. In particular, neural additive models (NAM) offer the interpretability to the black-box deep learning and achieve state-of-the-art accuracy among the large family of generalized additive models. In order to empower NAM with feature selection and improve the generalization, we propose the sparse neural additive models (SNAM) that employ the group sparsity regularization (e.g. Group LASSO), where each feature is learned by a sub-network whose trainable parameters are clustered as a group. Specifically, we show that SNAM with subgradient and proximal gradient descents provably converges to zero training loss as t→∞, and that the estimation error of SNAM vanishes asymptotically as n→∞. We also prove that SNAM, similar to LASSO, can have exact support recovery, i.e. perfect feature selection, with appropriate regularization. Moreover, we show that the SNAM can generalize well and preserve the `identifiability', recovering each feature's effect. We validate our theories via extensive experiments and further testify to the good accuracy and efficiency of SNAM.
Author: Kan Chen, Zhiqi Bu, Shiyun Xu
Sparse Group LASSO (SGL) is a regularized model for high-dimensional linear regression problems with grouped covariates. SGL applies and penalties on the individual predictors and group predictors, respectively, to guarantee sparse effects both on the inter-group and within-group levels. In this paper, we apply the approximate message passing (AMP) algorithm to efficiently solve the SGL problem under Gaussian random designs. We further use the recently developed state evolution analysis of AMP to derive an asymptotically exact characterization of SGL solution. This allows us to conduct multiple fine-grained statistical analyses of SGL, through which we investigate the effects of the group information and (proportion of penalty). With the lens of various performance measures, we show that SGL with small benefits significantly from the group information and can outperform other SGL (including LASSO) or regularized models which does not exploit the group information, in terms of the recovery rate of signal, false discovery rate and mean squared error.
Author: Shiyun Xu, Zhiqi Bu
We incorporate over-parameterized neural networks into semi-parametric models to bridge the gap between inference and prediction, especially in the high dimensional linear problem. We show the theoretical foundations that make this possible and demonstrate with numerical experiments. Furthermore, we propose a framework, DebiNet, in which we plug-in arbitrary feature selection methods to our semi-parametric neural network and illustrate that our framework debiases the regularized estimators and performs well, in terms of the post-selection inference and the generalization error. Pytorch code is in Github.
Authors: Zhiqi Bu, Shiyun Xu, Kan Chen
We show from a dynamical system perspective that, when training an over-parameterized neural networks, the Heavy Ball (HB) method can converge to global minimum on mean squared error (MSE) at a linear rate (similar to Gradient Descent); however, the Nesterov accelerated gradient descent (NAG) only converges to global minimum sublinearly. Our results rely on the connection between neural tangent kernel (NTK) and analyze the limiting ordinary differential equations (ODE) for optimization algorithms. We show that, optimizing the non-convex loss over the weights corresponds to optimizing some strongly convex loss over the prediction error. Pytorch code is in Github.
Author: Shiyun Xu, Ian Barnett
L1-regularization is a common approach to sparsify and compress a neural network. We apply multiple variants of L1-penalty, including SLOPE, Group Lasso, and Sparse Group Lasso to deep learning and investigate the sparsity-accuracy trade-offs on different network structures. Especially, we leverage AUC-ROC to analyze the effect of feature selection by different regularized neural networks. A detailed review of traditional regularization, its application in neural networks as well as variable importance methods is shown in the paper. Pytorch code will appear on Github.
Authors: Shiyun Xu, Menglin Shao, Wenxuan Qiao, Pengjian Shang
We construct two AIC methods either through replacing variance with high-order moments or through introducing Tsallis entropy. We develop a model that can create an AIC plane to account for volatility behaviors of time series. Experiments on stock price data in China (mainland), Hong Kong and United States demonstrate that our model can effectively differentiate the multi-scale volatility.