April 9, 2021

Communication Efficient Distributed Optimization

an NSF-TRIPODS Workshop

In recent years, the computational paradigm for large-scale machine learning and data analytics has shifted to a massively large distributed system composed of individual computational nodes (e.g., ranges from GPUs to low-end commodity hardware). For example, modern large-scale distributed systems such as Apache Spark and computational primitives such as MapReduce have gained significant traction, enabling engineers to execute production-scale tasks on terabytes of data. A typical distributed optimization problem for training machine learning models has several considerations. The first of these is the convergence rate of the algorithm: after a number of iterations, how close are we to the optimum solution? While the convergence rate is a measure of efficiency in conventional systems, in today's distributed computing framework, the aspects of communication are at odds with the convergence rate. Even though the ML algorithms are highly scalable, with high dimensionality of data and increasing number of participating devices (such as in a Federated learning setup), communication becomes a bottleneck to the efficiency and speed of the learning. In recent years various techniques have been developed to alleviate the problem of communication bottleneck, including but not limited to quantization and sparsification, local updates, one-shot learning. The goal of this workshop is to bring these ideas together.

The Speakers

Martin Jaggi

EPFL

Aryan Mokhtari

University of Texas at Austin

Virginia Smith

Carnegie Mellon University

Himanshu Tyagi

Indian Institute for Science

Schedule (April 9, 2021)

All times are in Pacific Time (GMT-7).

08:45 Opening Remarks

09:00 Himanshu Tyagi

Information-constrained optimization: Can Adaptive Processing of Gradients Help?

We derive lower bounds for convergence rate of stochastic optimization using noisy gradients, when only limited information about the noisy gradients is allowed. Examples of such limitations on information include local privacy constraints, communication constraints, and constraint on the number of coordinates of the gradient that can be computed. Existing lower bounds for this problem have focused only on convex functions and not allowed the information constraints to be obtained adaptively. In contrast, our lower bounds apply to adaptive procedures for obtaining information about the gradients and allow us to examine whether adaptivity helps. We show that for both convex and strongly convex functions and for all three information constraints mentioned above, adaptivity does not lead to improvement in convergence rate. Furthermore, we exhibit an interesting structured least squares problem where adaptivity does lead to significantly faster optimization.

This talk is based on joint work with Jayadev Acharya, Clément Canonne, and Prathamesh Mayekar.

Himanshu Tyagi: Workshop on Communication Efficient Distributed Optimization

10:00 Martin Jaggi

Communication-Efficient Distributed Deep Learning

We discuss gradient compression methods for communication-efficient distributed training, in centrally-coordinated as well as fully decentralized scenarios. In particular, we demonstrate that low-rank linear compressors applied on model differences can lead to fast encoding and decoding, as well as efficient aggregation between workers, while maintaining training and test accuracy. The key building blocks used are the linearity of the power iteration, applied to the fast-evolving matrix of gradients, paired with error-feedback. We present empirical results on reduced training times for many neural network architectures with our open-source code, as well as theoretical convergence rates of the methods, applicable also for heterogeneous data, and asymptotically independent of the network topology and compression ratio.

This talk is based on joint work with Sai Praneeth Karimireddy & Thijs Vogels.

Martin Jaggi: Workshop on Communication Efficient Distributed Optimization

11:00 Poster Session

13:00 Virginia Smith

Heterogeneity Meets Communication-Efficiency: Challenges and Opportunities

A defining trait of federated learning is the presence of heterogeneity, i.e., that data and systems characteristics may differ significantly across the network. In this talk I show that the challenge of heterogeneity pervades the machine learning process in federated settings, affecting issues such as optimization, modeling, and fairness. In terms of optimization, I discuss distributed optimization methods that offer robustness to systems and statistical heterogeneity. I then explore the role that heterogeneity plays in delivering models that are accurate and fair to all users/devices in the network. Finally, I consider scenarios where heterogeneity may in fact afford benefits to distributed learning, through recent work in one-shot federated clustering.

Virginia Smith: Workshop on Communication Efficient Distributed Optimization

14:00 Aryan Mokhtari

Towards Communication-Efficient Personalized Federated Learning via Representation Learning and Meta-Learning

In Federated Learning (FL), we aim to train models across multiple computing units (users), while users can only communicate with a shared central server without exchanging their data samples. This mechanism exploits all users' computational power and allows users to obtain a richer model as their models are trained over a larger set of data points. However, traditional FL schemes often develop a common output for all the users and do not lead to a personalized model for each user. This is indeed an important missing feature, especially given the heterogeneity of the underlying data distribution for various users. In this talk, we address this issue by leveraging tools from (i) representation learning and (ii) meta-learning. In the first part of the talk, we study the cases that users’ data share a common representation. In particular, we discuss a novel and communication-efficient federated learning framework for learning a shared data representation across clients and unique local heads for each client. Our result shows provable sample complexity benefits for FL in the presence of data heterogeneity, both for original users and new users entering after a representation has been learned. In the second part of the talk, we focus on the settings that either there is no common structure between the given training tasks or the structure is unknown. In such cases, our goal is to obtain a global model for all users that can be adapted to each user’s task with a few computationally cheap local operations. We show how this goal can be achieved by exploiting ideas from Model-Agnostic Meta-Learning (MAML).

Aryan Mokhtari: Workshop on Communication Efficient Distributed Optimization