13:30-14:00
14:00 - 14:15
Olga Saukh
14:15 - 14:40
Training modern neural networks is time-consuming, expensive, and energy-intensive. As neural network architectures double in size every few months, it is difficult for researchers and businesses without immense budgets to keep up, especially as hardware improvements stagnate. In this talk, I will describe one approach for managing this challenge: changing the training algorithm itself. While many companies and researchers are focused on building hardware and systems to allow existing algorithms to run faster in a mathematically equivalent fashion, there is nothing sacred about this math. On the contrary, training neural networks is inherently approximate, relying on noisy data, convex optimizers in nonconvex regimes, and ad hoc tricks and hacks that seem to work well in practice for reasons that elude us.
He will discuss how we have put this approach into practice at MosaicML, including the dozens of algorithmic changes we have studied (which are freely available open-source), the science behind how these changes interact with each other (the composition problem), and how we evaluate whether these changes have been effective. I will also detail several surprises we have encountered and lessons we have learned along the way. In the months since we began this work in earnest, we have reduced the training times of standard computer vision models by 5-7x and standard language models by 2x on publicly available cloud instances, and we believe we are just scratching the surface.
14:40 - 15:05
In contrast to output-space ensembles of deep models, weight-space aggregation allows reducing memory footprint, saving inference time and energy, i.e., critical resources of resource-constrained devices residing at the edge. To achieve the best ensemble performance, the models comprising it should be diverse, yet contain no energy barrier along the linear interpolation between them. We conjecture that if permutation invariance of neural networks is taken into account, SGD solutions trained from different initializations will likely have no barrier in the linear interpolation between them. Although it is a bold conjecture, we show how extensive empirical attempts fall short of refuting it. Furthermore, we show when we have practical methods to construct efficient ensembles and touch upon open research questions that arise in this space. Our work has implications for the lottery ticket hypothesis and resource-efficient distributed training.
15:05 - 15:30
Mostafa Dehghani recording
Efficiency is a critical aspect of developing and deploying machine learning models. Inference time and latency directly affect the user experience, and some applications have hard requirements. In addition to inference costs, model training also has direct financial and environmental impacts. Although there are numerous well-established metrics (cost indicators) for measuring model efficiency, researchers and practitioners often assume that these metrics are correlated with each other and report only few of them. In this talk, I will talk about some of the common cost indicators, their advantages and disadvantages, and how they can contradict each other.
15:30 - 15:55
A central goal of artificial intelligence is to design algorithms that are both generalizable and interpretable. We combine brain-inspired neural computation principles and scalable deep learning architectures to design compact neural controllers for task-specific compartments of a full-stack autonomous vehicle control system. We show that a single algorithm with 19 control neurons, connecting 32 encapsulated input features to outputs by 253 synapses, learns to map high-dimensional inputs into steering commands. This system shows superior generalisability, interpretability, and robustness compared with orders-of-magnitude larger black-box learning systems. The obtained neural agents enable high-fidelity autonomy for task-specific parts of a complex autonomous system.
15:55 - 16:15
16:20 - 16:45
Over the past couple of years, there has been significant progress both in terms of algorithmic and computational support for inducing and exploiting unstructured sparsity in deep neural networks. Yet, some questions are still unclear, such as: what are the maximal sparsity levels we can induce in DNN weights? can these sparsity levels be leveraged with speedups? and how would such speedups compare with alternative compression techniques, such as quantization and structured sparsity?
In this talk, Dan will overview our recent progress on answering these questions. Specifically, I will survey the maximal sparsity levels achievable using existing methods on emerging community benchmarks, the speedups they imply on sparsity-enabled inference (and training) engines, and provide some accuracy-speedup comparisons with alternative methods.
16:45 - 17:10
Despite the great progress in the development of efficient deep neural networks, highly accurate models are still too expensive to process video frames in real-time, especially on low-power devices such as smartphones. Video tensors, despite being huge, are highly redundant. This talk explores several ideas to speed up deep neural networks by leveraging the inherent redundancies in the video. Instead of processing the redundant information over and over, we identify and process a minimal set of pixels, regions, and frames that bring in novel information about the video. We also explore how the temporal redundancies can be leveraged to further compress the model either by dynamically selecting the backbone or by knowledge distillation.
17:10 - 17:30
Elias Frantar recording
The recent focus on the efficiency of deep neural networks (DNNs) has led to significant work on model compression approaches, of which weight pruning is one of the most popular. At the same time, there is rapidly-growing computational support for efficiently executing the unstructured-sparse models obtained via pruning. Yet, most existing pruning methods minimize just the number of remaining weights, i.e. the size of the model, rather than optimizing for inference time. We address this gap by introducing SPDY, a new compression method which automatically determines layer-wise sparsity targets achieving a desired inference speedup on a given system, while minimizing accuracy loss. SPDY is composed of two new techniques: the first is an efficient dynamic programming algorithm for solving the speedup-constrained layer-wise compression problem assuming a set of given layer-wise sensitivity scores; the second is a local search procedure for determining accurate layer-wise sensitivity scores. Experiments across popular vision and language models show that SPDY guarantees speedups while recovering higher accuracy relative to existing strategies, both for one-shot and gradual pruning scenarios, and is compatible with most existing pruning approaches. We also extend our approach to the recently-proposed task of pruning with very little data, where we achieve the best known accuracy recovery when pruning to the GPU-supported 2:4 sparsity pattern.
17:30 - 18:00
A high-level informal talk about some of the key challenges and opportunities Sara sees in academic research directions that aim to make progress on ML efficiency. She will also leave time for questions and discussion.
18:00 - 19:30