Sparse Neural Networks Training

Tutorial Organizers:

Shiwei Liu (Eindhoven University of Technology)
Ghada Sokar (Eindhoven University of Technology)
Zahra Atashgahi (University of Twente)
Decebal Constantin Mocanu (University of Twente, Eindhoven University of Technology)
Elena Mocanu (University of Twente)

Abstract:

Motivated by the success of GPT-3, a trillion parameters model race appears to be taking shape, drawing in more technological giants with significant investment. In concert with the increasingly strong results, the resources required to train and deploy those massive models are prohibitive. While sparse neural networks have been widely used to substantially reduce the computational demands of inference, researchers recently began to investigate techniques to train intrinsically sparse neural networks from scratch to accelerate training (sparse training). As a relatively new avenue, sparse training receives upsurging attention and quickly evolves as a universal approach that has demonstrated strong results in a wide variety of architectures. This tutorial aims to give a comprehensive discussion of sparsity in neural network training. We first revisit the existing approaches to obtain sparse neural networks from the perspective of the accuracy-efficiency trade-off. Then we dig into the performance of sparse neural networks training for different machine learning paradigms, including supervised learning, unsupervised learning, and reinforcement learning. We look to both, single tasks and continual learning. Finally, we point out the current challenges of sparse neural networks training in scale and promising future directions.

Detailed Program, Location:

Location: ECMLPKDD 2022, Grenoble, France

Date: Friday morning, 23 September 2022

Program. Outline

Description:

Scope of the Tutorial: The goal of this tutorial is to offer a comprehensive presentation of sparsity in neural networks training including the basic concept of sparse neural networks, representative approaches in various fields, and challenges and future directions. Sparse neural network training is an existing expanding research topic in Machine Learning (ML), as modern neural networks are computationally expensive to train and exploit. For instance, GPT-3, a state-of-the-art model with 175 billion parameters, would require approximately 36 years with 8 V100 GPUs or seven months with 512 V100 GPUs to train, assuming perfect data-parallel scaling. Therefore, investment in the usage of sparsity to accelerate the training of neural networks is important. This tutorial focuses on presenting the fundamental principles of sparse neural networks training, its state-of-the-art methods and algorithms, its synergy with the three main machine learning paradigms (e.g., supervised, unsupervised, and reinforcement learning), and its applicability to a wide range of theoretical and applied research and engineering fields. The tutorial considers also the potential of sparse training in reducing the high energy costs associated with the current widely-used deep learning techniques, contributing toward obtaining greener Artificial Intelligence (AI). A broad overview of sparse training for both researchers and practitioners will be provided. The tutorial content is briefly described below.

Preliminaries: From Dense-to-Sparse to Sparse-to-Sparse training: Sparsity, referring to the proportion of weights, filters (neurons), etc., that are zero-valued in neural networks, is one of the most commonly used techniques to accelerate the training and inference of modern neural networks. The corresponding memory requirements and computation (i.e., additions and multiplications) associated with the zero-valued elements can be saved. We categorize sparsity-inducing techniques into dense-to-sparse training and sparse-to-sparse training based on whether or not the sparse networks are inherited from a dense network. We first introduce the basic concepts of dense-to-sparse training and sparse-to-sparse training and draw a comparison between these two categories from the perspective of the performance-efficiency trade-off. We then introduce in detail the first papers on sparse-to-sparse training -- Complex Boltzmann Machines and Sparse Evolutionary Training (SET). SET directly trains sparse neural networks from scratch while dynamically adjusting the sparse connectivity during training to fit the data distribution with a fixed small parameter budget. SET jointly optimizes model parameters and explores sparse connectivity throughout the entire training process, leading to stronger supervised learning results compared with naively sparse neural networks training with fixed sparse connectivity or even with dense neural networks training.

Sparse Training meets Supervised Learning (SL): Following the basic principles pioneered by SET, many advanced approaches have been proposed to reinforce the performance of sparse neural network training. We comprehensively revisit these approaches in supervised learning to enable the training of intrinsically sparse neural networks, highlight their technical novelties, and understand them from the perspective of In-Time Over-Parameterization (ITOP). ITOP, defined as a thorough exploration of the parameter space during training, is one of the keys to understand the success of sparse training and to match or outperform the performance of dense networks training. Further on, we briefly discuss the usage of sparse training in other important single-task learning areas, including recurrent neural networks, generative adversarial networks, and deep ensembles.

Sparse Training meets Reinforcement Learning (RL): The current advance in Deep Reinforcement Learning (DRL) is obtained by mainly dense neural networks and, occasionally, dense-to-sparse training. The effectiveness of sparse training in reinforcement learning has been demonstrated just in late 2021. Consequently, we discuss the challenges in introducing sparse training in DRL due to data non-stationarity during training. Then, we discuss how we can address these challenges. In addition, we present the desirable properties obtained by introducing sparse training to the DRL paradigm, including sample efficiency, higher performance, and enabling DRL agents to be deployed on low-resource devices.

Sparse Training meets Unsupervised Learning (UL): This part of the tutorial focuses on the sparse training of artificial neural networks in unsupervised settings. Since providing annotations is a laborious task in many real-world problems, e.g., healthcare, performing unsupervised learning is of great interest in such applications. In this part, we start by giving a brief overview of unsupervised learning in dense neural networks. Later on, we explain how sparse training can be exploited to perform unsupervised learning more efficiently in terms of computational resources during both training and inference while maintaining a comparable level of accuracy with the dense equivalents. We mainly review recent literature on unsupervised sparse training, including sparse training of Restricted Boltzmann Machines (RBMs) which is the pioneering work that introduces the idea of training sparse neural networks from scratch and showed that it has competitive performance with dense training, and unsupervised feature selection using sparsely trained autoencoders. We discuss these also from the concrete energy-efficient and wall-clock running time perspectives.

Sparse Training: From a single task to Continual Learning (CL): Continual learning (CL) aims to build artificial intelligent agents that can learn multiple tasks sequentially, accumulating past knowledge and using it to help in future learning. CL ability is necessary for intelligence or AGI (artificial general intelligence). One of the main challenges in this paradigm is catastrophic forgetting which means that the network forgets the previously learned tasks when it learns a new one. In this tutorial, we discuss how sparse training helps in solving this challenge using the SpaceNet method. SpaceNet dynamically trains sparse neural networks from scratch to produce sparse representations for each task. This sparse representation is crucial to reduce the interference between tasks and forgetting. In addition, using a sparse network for each task enables building memory and computationally efficient agents that can scale to a large number of tasks.

Open challenges: The tutorial gradually introduced sparse training theoretical details for pruning, simultaneously training and pruning, one-shot pruning, static sparse training, and dynamic sparse training (with random regrow or gradient regrow) and discussed sparse training state-of-the-art in SL, UL, and RL. We then showed that it is possible to have a smooth pass from a single task learning to continual learning. In this part, we will present sparse training challenges and limitations while introducing a couple of new theoretical research directions which has the potential of alleviating sparse training limitations to push deep learning scalability well beyond its current boundaries. We believe that a way to move forward is to study these challenges as a whole within the research community. Consequently, instead of a concluding phrase, by listing an arguable quote from Metaphysics of Aristotle: \textit{The whole is more than the sum of its parts}, we invite researchers to brainstorm with us.

Selected References:

Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H. Nguyen, Madeleine Gibescu, and Antonio Liotta. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science, Nature Communications, 9(1), pp.1-12, 2018. (Preprint arXiv:1707.04780, July 2017).
Shiwei Liu, Tim Van der Lee, Anil Yaman, Zahra Atashgahi, Davide Ferraro, Ghada Sokar, Mykola Pechenizkiy, and Decebal Constantin Mocanu. Topological insights into sparse neural networks. In ECMLPKDD, pp. 279-294. Springer, Cham, 2020.
Guillaume Bellec, David Kappel, Wolfgang Maass, and Robert Legenstein. Deep rewiring: Training very sparse deep networks. In International Conference on Learning Representations (ICLR), 2018.
Hesham Mostafa and Xin Wang. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In International Conference on Machine Learning (ICML), pp. 4646-4655, PMLR, 2019.
Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. Rigging the lottery: Making all tickets winners. In International Conference on Machine Learning (ICML), pp. 2943-2952. PMLR, 2020.
Tim Dettmers and Luke Zettlemoyer. Sparse networks from scratch: Faster training without losing performance. arXiv:1907.04840, 2019.
Shiwei Liu, Tianlong Chen, Xiaohan Chen, Zahra Atashgahi, Lu Yin, Huanyu Kou, Li Shen, Mykola Pechenizkiy, Zhangyang Wang, and Decebal Constantin Mocanu. Sparse training via boosting pruning plasticity with neuroregeneration. Advances in Neural Information Processing Systems (NeurIPS) 34, 2021.
Siddhant Jayakumar, Razvan Pascanu, Jack Rae, Simon Osindero, and Erich Elsen. Top-kast: Top-k always sparse training. In Advances in Neural Information Processing Systems (NeurIPS) 33, pp.20744-20754, 2020.
Shiwei Liu, Lu Yin, Decebal Constantin Mocanu, and Mykola Pechenizkiy. Do we actually need dense over-parameterization? in-time over-parameterization in sparse training. In International Conference on Machine Learning (ICML), pp. 6989-7000. PMLR, 2021.
Ghada Sokar, Elena Mocanu, Decebal Constantin Mocanu, Mykola Pechenizkiy, and Peter Stone. Dynamic Sparse Training for Deep Reinforcement Learning., arXiv:2106.04217, 2021.
Decebal Constantin Mocanu, Elena Mocanu, Nguyen, P. H., Madeleine Gibescu, and Antonio Liotta, A topological insight into restricted Boltzmann machines. Machine Learning (ECML-PKDD Journal Track), 104(2), pp. 243-270, 2016.
Zahra Atashgahi, Ghada Sokar, Tim van der Lee, Elena Mocanu, Decebal Constantin Mocanu, Raymond Veldhuis, and Mykola Pechenizkiy. Quick and robust feature selection: the strength of energy-efficient sparse training for autoencoders. Machine Learning (ECML-PKDD 2022 Journal Track), pp. 1-38, 2022.
Ghada Sokar, Decebal Constantin Mocanu, and Mykola Pechenizkiy. SpaceNet: Make free space for continual learning. Neurocomputing 439, pp. 1-11, 2021.
Shiwei Liu, Decebal Constantin Mocanu, Yulong Pei, and Mykola Pechenizkiy. Selfish sparse RNN training. In International Conference on Machine Learning (ICML), pp. 6893-6904. PMLR, 2021.
Shiwei Liu, Tianlong Chen, Zahra Atashgahi, Xiaohan Chen, Ghada Sokar, Elena Mocanu, Mykola Pechenizkiy, Zhangyang Wang, and Decebal Constantin Mocanu. Deep Ensembling with No Overhead for either Training or Testing: The All-Round Blessings of Dynamic Sparsity. In International Conference on Learning Representations (ICLR), 2022.
Tianlong Chen, Yu Cheng, Zhe Gan, Lu Yuan, Lei Zhang, and Zhangyang Wang. Chasing sparsity in vision transformers: An end-to-end exploration. Advances in Neural Information Processing Systems (NeurIPS) 34, 2021.
Decebal Constantin Mocanu, Elena Mocanu, Tiago Pinto, Selima Curci, Phuong H Nguyen, Madeleine Gibescu, Damian Ernst, Zita Vale. Sparse Training Theory for Scalable and Efficient Agents, 20th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2021.
Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems (NIPS), pp. 598–605, 1990.
Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems (NIPS), pp. 1135–1143, 2015.
Christos Louizos, Max Welling, and Diederik P Kingma. Learning Sparse Neural Networks through $L_0$ Regularization. In International Conference on Learning Representations (ICLR), 2018.
Junjie Liu, Zhe XU, Runbin SHI, Ray C. C. Cheung, and Hayden K.H. So. Dynamic Sparse Training: Find Efficient Sparse Network From Scratch With Trainable Masked Layers. In International Conference on Learning Representations, (ICLR), 2020.
Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr. SNIP: single-shot network pruning based on connection sensitivity. In International Conference on Learning Representations, (ICLR), 2019.
Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking Winning Tickets Before Training by Preserving Gradient Flow. In International Conference on Learning Representations, (ICLR), 2020.
Shunshi Zhang and Bradly C. Stadie. One-Shot Pruning of Recurrent Neural Networks by Jacobian Spectrum Evaluation. In International Conference on Learning Representations, (ICLR), 2020.
Jeremy Kepner and Ryan Robinett. Radix-net: Structured sparse matrices for deep neural networks. In 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 268–274, 2019.
Jonathan Frankle and Michael Carbin. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. In International Conference on Learning Representations (ICLR), 2019.
Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, Alexandra Peste. Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks. Journal of Machine Learning Research. Vol 22, Nr. 241, pages 1-124, Sep. 2021.
Sara Hooker. The Hardware Lottery. Communications of the ACM, 64(12), pp 58–65, 2021.
Haonan Yu, Sergey Edunov, Yuandong Tian, and Ari S Morcos. Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP. In International Conference on Learning Representations (ICLR), 2020.
Marc Aurel Vischer, Robert Tjarko Lange, and Henning Sprekeler. On Lottery Tickets and Minimal Task Representations in Deep Reinforcement Learning. In International Conference on Learning Representations (ICLR), 2021. %https://iclr.cc/virtual/2021/search?query=lottery%20tickets

Organizers Short Bios:

Shiwei Liu is a postdoc at the Department of Mathematics and Computer Science, Eindhoven University of Technology (TU/e). His research interests cover sparsity in neural networks, sparse training, computer vision, and deep ensemble. During his Ph.D., he has over 15 publications in sparse neural networks, including top conferences such as ICLR, ICML, NeurIPS, and ECMLPKDD. He has served as the PC member of top-tier conferences including ICLR, ICML, NeurIPS, AISTATS, ECMLPKDD, and CVPR, serving as area chair in ICIP 2022. He will join the Institute for Foundations of Machine Learning of UT Austin as a postdoc in September of 2022. Email address: s.liu3@tue.nl.
Ghada Sokar is a Ph.D. candidate at the Department of Mathematics and Computer Science, Eindhoven University of Technology (TU/e). Her research focuses on continual learning, sparse neural networks, and reinforcement learning. She has publications in the continual learning field and sparsity for reinforcement learning, feature selection, and deep ensembles in journals and conferences, such as Neurocomputing, Machine Learning, AAMAS, ICLR, and ECMLPKDD. She is a member of the diversity and inclusion committee in ContinualAI. She has served as a PC member of top-tier conferences, including ICLR, ICML, and AAMAS. Email address: g.a.z.n.sokar@tue.nl.
Zahra Atashgahi is a Ph.D. candidate at the Department of Electrical Engineering, Mathematics and Computer Science (EEMCS), University of Twente, The Netherlands. Her research interests include cost-effective artificial neural networks, sparse neural networks, feature selection, ensemble learning, time series analysis, and healthcare. She has publications on sparse neural networks in journals and conferences, such as Machine Learning, ECMLPKDD, ICLR, and NeurIPS. She has served as a PC member of conferences and workshops, including ICML and ICBINB. Email address: z.atashgahi@utwente.nl.
Decebal Constantin Mocanu is an Assistant Professor in artificial intelligence and machine learning at the University of Twente, the Netherlands, Guest Assistant Professor at Eindhoven University of Technology (TU/e), and an alumni member of TU/e Young Academy of Engineering. During his PhD (graduated June 2017 at TU/e) and after that, Decebal worked on connections and nodes importance in complex networks, generative replay, online learning, multitask learning, static and adaptive sparse connectivity in neural networks. His short-term research interest is to conceive scalable deep artificial neural network models and their corresponding learning algorithms using principles from network science, evolutionary computing, optimization, and neuroscience.
Elena Mocanu is an Assistant Professor within the Department of Computer Science at the University of Twente, the Netherlands. She received her PhD in machine learning from Eindhoven University of Technology, in 2017. During her PhD, she visited the University of Texas at Austin, where she worked with Michael Webber and Peter Stone on machine learning, decision making, and autonomous systems through the means of sparse neural networks. As a mathematician with a big passion for neural networks, her current research is focused on understanding neural networks and how their learning capabilities can be improved.