Federated Learning One World Seminar

Archive of Talks: FLOW Season 2020

FLOW Talk #28

December 16, 2020 @ 1pm Coordinated Universal Time (UTC)

Ali Ramezani-Kebrya (University of Toronto and Vector Institute)

Adaptive Gradient Quantization for Data-Parallel SGD

host: Dan Alistarh

Abstract: Many communication-efficient variants of SGD use gradient quantization schemes. These schemes are often heuristic and fixed over the course of training. We empirically observe that the statistics of gradients of deep models change during the training. Motivated by this observation, we introduce two adaptive quantization schemes, ALQ and AMQ. In both schemes, processors update their compression schemes in parallel by efficiently computing sufficient statistics of a parametric distribution. We improve the validation accuracy by almost 2% on CIFAR-10 and 1% on ImageNet in challenging low-cost communication setups. Our adaptive methods are also significantly more robust to the choice of hyperparameters.

Paper: Fartash Faghri, Iman Tabrizian, Ilia Markov, Dan Alistarh, Daniel Roy, Ali Ramezani-Kebrya. Adaptive Gradient Quantization for Data-Parallel SGD, arXiv:2010.12460, 2020

FLOW Talk #27

December 2, 2020 @ 4pm Coordinated Universal Time (UTC)

Honglin Yuan (Stanford)

Federated Composite Optimization

host: Peter Richtárik

Abstract: Federated Learning (FL) is a distributed learning paradigm which scales on-device learning collaboratively and privately. Standard FL algorithms such as Federated Averaging (FedAvg) are primarily geared towards smooth unconstrained settings. In this paper, we study the Federated Composite Optimization (FCO) problem, where the objective function in FL includes an additive (possibly) non-smooth component. Such optimization problems are fundamental to machine learning and arise naturally in the context of regularization (e.g., sparsity, low-rank, monotonicity, and constraint). To tackle this problem, we propose different primal/dual averaging approaches and study their communication and computation complexities. Of particular interest is Federated Dual Averaging (FedDualAvg), a federated variant of the dual averaging algorithm. FedDualAvg uses a novel double averaging procedure, which involves gradient averaging step in standard dual averaging and an average of client updates akin to standard federated averaging. Our theoretical analysis and empirical experiments demonstrate that FedDualAvg outperforms baselines for FCO.

Paper:

FLOW Talk #26

November 25, 2020 @ 1pm Coordinated Universal Time (UTC)

On Biased Compression for Distributed Learning

host: Peter Richtárik

Abstract: In the last few years, various communication compression techniques have emerged as an indispensable tool helping to alleviate the communication bottleneck in distributed learning. However, despite the fact biased compressors often show superior performance in practice when compared to the much more studied and understood unbiased compressors, very little is known about them. In this work we study three classes of biased compression operators, two of which are new, and their performance when applied to (stochastic) gradient descent and distributed (stochastic) gradient descent. We show for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings. Our distributed SGD method enjoys the ergodic rate $O(\delta \frac{L}{\mu} \exp(-K) + (C+D)/(K\mu))$, where $\delta$ is a compression parameter which grows when more compression is applied, $L$ and $\mu$ are the smoothness and strong convexity constants, $C$ captures stochastic gradient noise ($C=0$ if full gradients are computed on each node) and $D$ captures the variance of the gradients at the optimum ($D=0$ for over-parameterized models). Further, via a theoretical study of several synthetic and empirical distributions of communicated gradients, we shed light on why and by how much biased compressors outperform their unbiased variants. Finally, we propose a new highly performing biased compressor---combination of Top-k and natural dithering---which in our experiments outperforms all other compression techniques.

Paper: Aleksandr Beznosikov, Samuel Horváth, Peter Richtárik and Mher Safaryan. On Biased Compression for Distributed Learning, arXiv:2002.12410, 2020

FLOW Talk #25

November 18, 2020 @ 1pm Coordinated Universal Time (UTC)

Jayadev Acharya (Cornell)

High-Dimensional Estimation under Information Constraints

host: Aurélien Bellet

Abstract: We consider distributed high-dimensional parameter estimation problems (e.g., Gaussians, product Bernoulli, discrete distributions) under local information constraints (e.g., communication, privacy). We will discuss a general "plug-and-play" framework to establish tight minimax lower bounds in these settings. A key focus of our study is the role of interactivity in distributed estimation, and our lower bounds hold for interactive protocols. Time permitting, we will discuss applications to convex optimization under privacy and communication constraints.

Papers:

FLOW Talk #24

November 11, 2020 @ 1pm Coordinated Universal Time (UTC)

Gregory Valiant (Stanford)

Statistical Challenges in the Federated Setting

host: Dan Alistarh

Abstract: In this talk, I'll describe several theoretical frameworks that distill different aspects of the statistical challenges posed in federated learning settings. For each of these frameworks we'll show some clean and surprising results. The first section of the talk will explore the extent to which having a large number data sources can compensate for having relatively little data from each source---even in settings in which the data sources are heterogeneous. In the second portion of the talk, I'll discuss the problem of developing algorithms and estimators that are robust to a certain fraction of bad data sources. The final third of the talk will focus on the challenge of making accurate predictions despite data coming from non-uniform and non-stationary distributions. Here we introduce a framework for statistical estimation that leverages knowledge of the data collection process, and makes no distributional assumptions on the underlying data values.

Papers:

FLOW Talk #23

November 4, 2020 @ 1pm Coordinated Universal Time (UTC)

Nicholas D Lane (University of Cambridge)

Flower: A Friendly Federated Learning Research Framework

host: Samuel Horváth

Abstract: Federated Learning (FL) has emerged as a promising technique for edge devices to collaboratively learn a shared prediction model, while keeping their training data on the device, thereby decoupling the ability to do machine learning from the need to store the potentially privacy sensitive user data in the cloud. However, despite the rapid progress made in FL in recent years, it still remains far too difficult to evaluate FL algorithms under a full range of realistic system constraints (viz. compute, memory, energy, wired/wireless networking) and scale (thousands of federated devices and larger). As a consequence, our understanding of how these factors influence FL performance and should shape the future evolution of FL algorithms remains in a very underdeveloped state.

In this talk, I will describe how we have begun to address this situation by developing Flower -- an open-source framework (http://flower.dev) built to help bridge this gap in evaluation and design. Through Flower, it becomes relatively simple to measure the impact of common real-world FL situations, such as if participating devices have limited compute resources (e.g., an embedded device), or when network speeds are highly varied and unstable. I will highlight early empirical observations, made using Flower, as to what the implications are for existing algorithms under the types of heterogeneous large-scale FL systems we anticipate will increasingly appear. Finally, to showcase the potential and flexibility of Flower, I will show how it can even be used to make assessments of the carbon footprint of FL in various settings -- to the best of our knowledge, this is the first time FL has been studied from the perspective of its environmental impact.

Paper: Daniel J. Beutel, Taner Topal, Akhil Mathur, Xinchi Qiu, Titouan Parcollet, and Nicholas D. Lane. Flower: A Friendly Federated Learning Research Framework, arXiv:2007.14390, 2020

FLOW Talk #22

October 21, 2020 @ 1pm Coordinated Universal Time (UTC)

Borja Balle (DeepMind)

Privacy Amplification via Random Check-Ins

host: Aurélien Bellet

Abstract: Differentially Private Stochastic Gradient Descent (DP-SGD) forms a fundamental building block in many applications for learning over sensitive data. Two standard approaches, privacy amplification by subsampling, and privacy amplification by shuffling, permit adding lower noise in DP-SGD than via naıve schemes. A key assumption in both these approaches is that the elements in the data set can be uniformly sampled, or be uniformly permuted — constraints that may become prohibitive when the data is processed in a decentralized or distributed fashion. In this paper, we focus on conducting iterative methods like DP-SGD in the setting of federated learning (FL) wherein the data is distributed among many devices (clients). Our main contribution is the random check-in distributed protocol, which crucially relies only on randomized participation decisions made locally and independently by each client. It has privacy/accuracy trade-offs similar to privacy amplification by subsampling/shuffling. However, our method does not require server-initiated communication, or even knowledge of the population size. To our knowledge, this is the first privacy amplification tailored for a distributed learning framework, and it may have broader applicability beyond FL. Along the way, we improve the privacy guarantees of amplification by shuffling and show that, in practical regimes, this improvement allows for similar privacy and utility using data from an order of magnitude fewer users.

Paper: Borja Balle, Peter Kairouz, H. Brendan McMahan, Om Thakkar and Abhradeep Thakurta. Privacy Amplification via Random Check-Ins, arXiv:2007.06605, 2020.

FLOW Talk #21

October 14, 2020 @ 1pm Coordinated Universal Time (UTC)

Yassine Laguel (Université Grenoble Alpes)

Device Heterogeneity in Federated Learning: A Superquantile Approach

host: Samuel Horváth

Abstract: We propose a federated learning framework to handle heterogeneous client devices which do not conform to the population data distribution. The approach hinges upon a parameterized superquantile-based objective, where the parameter ranges over levels of conformity. We present an optimization algorithm and establish its convergence to a stationary point. We show how to practically implement it using secure aggregation by interleaving iterations of the usual federated averaging method with device filtering. We conclude with numerical experiments on neural networks as well as linear models on tasks from computer vision and natural language processing.

Paper: Yassine Laguel, Krishna Pillutla, Jérôme Malick, Zaid Harchaoui. Device Heterogeneity in Federated Learning: A Superquantile Approach, arXiv:2002.11223, 2020

FLOW Talk #20

October 7, 2020 @ 1pm Coordinated Universal Time (UTC)

Eduard Gorbunov (Moscow Institute of Physics and Technology)

Linearly Converging Error Compensated SGD

host: Peter Richtárik

Abstract: In this paper, we propose a unified analysis of variants of distributed SGD with arbitrary compressions and delayed updates. Our framework is general enough to cover different variants of quantized SGD, Error-Compensated SGD (EC-SGD) and SGD with delayed updates (D-SGD). Via a single theorem, we derive the complexity results for all the methods that fit our framework. For the existing methods, this theorem gives the best-known complexity results. Moreover, using our general scheme, we develop new variants of SGD that combine variance reduction or arbitrary sampling with error feedback and quantization and derive the convergence rates for these methods beating the state-of-the-art results. In order to illustrate the strength of our framework, we develop 16 new methods that fit this. In particular, we propose the first method called EC-SGD-DIANA that is based on error-feedback for biased compression operator and quantization of gradient differences and prove the convergence guarantees showing that EC-SGD-DIANA converges to the exact optimum asymptotically in expectation with constant learning rate for both convex and strongly convex objectives when workers compute full gradients of their loss functions. Moreover, for the case when the loss function of the worker has the form of finite sum, we modified the method and got a new one called EC-LSVRG-DIANA which is the first distributed stochastic method with error feedback and variance reduction that converges to the exact optimum asymptotically in expectation with constant learning rate.

Paper: Eduard Gorbunov, Dmitry Kovalev, Dmitry Makarenko, and Peter Richtárik. Linearly Converging Error Compensated SGD, NeurIPS 2020

FLOW Talk #19

September 30, 2020 @ 1pm Coordinated Universal Time (UTC)

Chulin Xie (Zhejiang University)

DBA: Distributed Backdoor Attacks Against Federated Learning

host: Samuel Horváth

Abstract: Backdoor attacks aim to manipulate a subset of training data by injecting adversarial triggers such that machine learning models trained on the tampered dataset will make arbitrarily (targeted) incorrect prediction on the testset with the same trigger embedded. While federated learning (FL) is capable of aggregating information provided by different parties for training a better model, its distributed learning methodology and inherently heterogeneous data distribution across parties may bring new vulnerabilities.

In addition to recent centralized backdoor attacks on FL where each party embeds the same global trigger during training, we propose the distributed backdoor attack (DBA) --- a novel threat assessment framework developed by fully exploiting the distributed nature of FL. DBA decomposes a global trigger pattern into separate local patterns and embed them into the training set of different adversarial parties respectively. Compared to standard centralized backdoors, we show that DBA is substantially more persistent and stealthy against FL on diverse datasets such as finance and image data. Extensive experiments show that the attack success rate of DBA is significantly higher than centralized backdoors under different settings. Moreover, we find that distributed attacks are indeed more insidious, as DBA can evade two state-of-the-art robust FL algorithms against centralized backdoors. We also provide explanations for the effectiveness of DBA via feature visual interpretation and feature importance ranking. To further explore the properties of DBA, we test the attack performance by varying different trigger factors. Our proposed DBA and thorough evaluation results shed lights on characterizing the robustness of FL.

Paper: Chulin Xie, Keli Huang, Pin-Yu Chen, Bo Li. DBA: Distributed Backdoor Attacks against Federated Learning, ICLR 2020

FLOW Talk #18

September 23, 2020 @ 4pm Coordinated Universal Time (UTC)

Wenting Zheng (Berkeley)

Sharing Without Showing: Building Systems for Secure Collaborative Learning

[slides]

host: Samuel Horváth

Abstract: In many domains such as finance and medicine, organizations have encountered obstacles in acquisition of high quality data because their target applications need sensitive data that reside across multiple parties. However, such data cannot be shared today due to data privacy concerns, policy regulation, and business competition.

How can we allow organizations to run learning tasks on their combined dataset, but without revealing their sensitive input to the other parties? A key approach to enabling this capability is to co-design systems with cryptography to build practical and functional systems that provide strong and provable security. In this talk, I will focus on our recent systems that enable secure collaborative learning, as well as our on-going open source effort that has been used by many industrial collaborators.

Paper: Wenting Zheng. Sharing without Showing: Building Systems for Secure Collaborative Learning, PhD Dissertation, University of California, Berkeley, 2020.

FLOW Talk #17

September 16, 2020 @ 1pm Coordinated Universal Time (UTC)

Peter Davies (IST Austria)

Distributed Variance Reduction with Optimal Communication

host: Dan Alistarh

Abstract: We consider the problem of distributed variance reduction: n machines each receive probabilistic estimates of an unknown true vector Δ, and must cooperate to find a common estimate of Δ with lower variance, while minimizing communication. Variance reduction is closely related to the well-studied problem of distributed mean estimation, and is a key procedure in instances of distributed optimization, such as data-parallel stochastic gradient descent. Previous work typically assumes an upper bound on the norm of the input vectors, and achieves an output variance bound in terms of this norm. However, in real applications, the input vectors can be concentrated around the true vector Δ, but Δ itself may have large norm. In this case, output variance bounds in terms of input norm perform poorly, and may even increase variance. In this paper, we show that output variance need not depend on input norm. We provide a method of quantization which allows variance reduction to be performed with solution quality dependent only on input variance, not on input norm, and show an analogous result for mean estimation. This method is effective over a wide range of communication regimes, from sublinear to superlinear in the dimension. We also provide lower bounds showing that in many cases the communication to output variance trade-off is asymptotically optimal. Further, we show experimentally that our method yields improvements for common optimization tasks, when compared to prior approaches to distributed mean estimation.

Paper: Peter Davies, Vijaykrishna Gurunathan, Niusha Moshrefi, Saleh Ashkboos, Dan Alistarh. Distributed Variance Reduction with Optimal Communication, arXiv:2002.09268, 2020.

FLOW Talk #16

September 9, 2020 @ 1pm Coordinated Universal Time (UTC)

Misha Khodak (Carnegie Mellon University)

ARUBA: Efficient, Adaptive, and Private Meta-Learning with Provable Guarantees

host: Samuel Horváth

Abstract: Meta-learning has recently re-emerged as an important direction for developing algorithms for multi-task learning, dynamic environments, and federated settings. We present a theoretical framework for designing and understanding practical meta-learning methods that integrates natural formalizations of task-similarity with the extensive literature on online convex optimization and sequential prediction algorithms. Our approach enables the task-similarity to be learned adaptively, provides sharper transfer-risk bounds in the setting of statistical learning-to-learn, and leads to straightforward derivations of average-case regret bounds for efficient algorithms in settings where the task-environment changes dynamically or the tasks share a certain geometric structure. We use our theory to modify several popular meta-learning algorithms and improve their training and meta-test-time performance on standard problems in few-shot and federated learning. Finally, we apply our framework to develop a principled approach for differentially private meta-learning, i.e. federated learning with personalization, that keeps data-points private to the aggregator without needing to go through the communication bottleneck of secure aggregation or the noisiness of local differential privacy.

Papers:

FLOW Talk #15

September 2, 2020 @ 1pm Coordinated Universal Time (UTC)

Federated Learning in Collaborative Clinical Research

host: Aurélien Bellet

Abstract:

Paper: Mathieu Andreux, Andre Manoel, Romuald Menuet, Charlie Saillard, Chloé Simpson. Federated Survival Analysis with Discrete-Time Cox Models, arXiv:2006.08997, 2020

FLOW Talk #14

August 26, 2020 @ 4pm Coordinated Universal Time (UTC)

Peter Kairouz (Google)

Federated Analytics

host: Aurélien Bellet

Abstract: Federated analytics (FA) allows data scientists to generate analytical insight from the combined information in distributed datasets without requiring data centralization. FA embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs resulting from traditional, centralized database querying approaches. In this talk, I will (1) discuss how FA differs from more traditional distributed database querying and data collection paradigms, focusing on the main defining characteristics and challenges of the federated setting, (2) highlight the relationship between FA and federated learning (FL), (3) present a new federated analytics algorithm for discovering frequent items (heavy hitters) with differential privacy; and (4) discuss a set of interesting open questions in this space.

Papers:

FLOW Talk #13

August 19, 2020 @ 1pm Coordinated Universal Time (UTC)

Tian Li (Carnegie Mellon University)

Learning in Heterogeneous Networks: Optimization and Fairness

host: Samuel Horváth

Abstract:

Heterogeneity (stemming from non-identically distributed data and systems variability) poses fundamental challenges for federated learning--a paradigm for privacy-preserving training in distributed networks of remote devices or organizations.

In this talk, we present principled federated optimization objectives and methods to handle heterogeneity, both theoretically and empirically. We first show how heterogeneity can negatively impact the convergence of federated optimization methods. To mitigate this issue, we describe FedProx, an optimization framework we develop for robustly learning over data from non-identical distributions, while adhering to systems constraints. We then talk about the impact of heterogeneity on fairness, i.e., how naively minimizing an aggregate loss function may disproportionately advantage or disadvantage some of the devices in federated networks. I discuss a novel optimization objective, as well as efficient solvers, that encourage fairer (more uniform) performance distributions across heterogeneous devices. Finally, I briefly mention extensions of this fair objective to other problems in machine learning, and conclude by discussing future work in these areas.

Papers:

FLOW Talk #12

August 5, 2020 @ 1pm Coordinated Universal Time (UTC)

A Better Alternative to Error Feedback for Communication-Efficient Distributed Learning

host: Peter Richtárik

Abstract: Modern large-scale machine learning applications require stochastic optimization algorithms to be implemented on distributed compute systems. A key bottleneck of such systems is the communication overhead for exchanging information across the workers, such as stochastic gradients. Among the many techniques proposed to remedy this issue, one of the most successful is the framework of compressed communication with error feedback (EF). EF remains the only known technique that can deal with the error induced by contractive compressors which are not unbiased, such as Top-K. In this paper, we propose a new and theoretically and practically better alternative to EF for dealing with contractive compressors. In particular, we propose a construction which can transform any contractive compressor into an induced unbiased compressor. Following this transformation, existing methods able to work with unbiased compressors can be applied. We show that our approach leads to vast improvements over EF, including reduced memory requirements, better communication complexity guarantees and fewer assumptions. We further extend our results to federated learning with partial participation following an arbitrary distribution over the nodes, and demonstrate the benefits thereof. We perform several numerical experiments which validate our theoretical findings.

Papers:

FLOW Talk #11

July 29, 2020 @ 1pm Coordinated Universal Time (UTC)

Krishna Pillutla (University of Washington)

Robust Aggregation for Federated Learning

host: Samuel Horváth

Abstract: We present a robust aggregation approach to make federated learning robust to settings when a fraction of the devices may be sending corrupted updates to the server. The proposed approach relies on a robust secure aggregation oracle based on the geometric median, which returns a robust aggregate using a constant number of calls to a regular non-robust secure average oracle. The robust aggregation oracle is privacy-preserving, similar to the secure average oracle it builds upon. We provide experimental results of the proposed approach with linear models and deep networks for two tasks in computer vision and natural language processing. The robust aggregation approach is agnostic to the level of corruption; it outperforms the classical aggregation approach in terms of robustness when the level of corruption is high, while being competitive in the regime of low corruption.

Paper: Krishna Pillutla, Sham M. Kakade, Zaid Harchaoui. Robust Aggregation for Federated Learning, arXiv:1912.13445, 2019.

FLOW Talk #10

July 22, 2020 @ 1pm Coordinated Universal Time (UTC)

Heterogeneous and Pluralistic Distributed Learning

[slides]

host: Mher Safaryan

Abstract:

Papers:

FLOW Talk #09

July 15, 2020 @ 1pm Coordinated Universal Time (UTC)

Sashank Reddi (Google)

Adaptive Federated Optimization

host: Samuel Horváth

Abstract: Federated learning is a distributed machine learning paradigm in which a large number of clients coordinate with a central server to learn a model without sharing their own training data. Due to the heterogeneity of the client datasets, standard federated optimization methods such as Federated Averaging (FedAvg) are often difficult to tune and exhibit unfavorable convergence behavior. In non-federated settings, adaptive optimization methods have had notable success in combating such issues. In this work, we propose federated versions of adaptive optimizers, including Adagrad, Adam, and Yogi, and analyze their convergence in the presence of heterogeneous data for general non-convex settings. Our results highlight the interplay between client heterogeneity and communication efficiency. We also perform extensive experiments on these methods and show that the use of adaptive optimizers can significantly improve the performance of federated learning.

Paper:

  • Reddi, Sashank and Charles, Zachary and Zaheer, Manzil and Garrett, Zachary and Rush, Keith and Konecny, Jakub and Kumar, Sanjiv and McMahan, H Brendan Adaptive Federated Optimization, arXiv:2003.00295, 2020.

FLOW Talk #08

July 8, 2020 @ 1pm Coordinated Universal Time (UTC)

On the Outsized Importance of Learning Rates in Local Update Methods

[slides]

host: Aurélien Bellet

Abstract: In this work, we study a family of algorithms, which we refer to as local update methods, that generalize many federated learning and meta-learning algorithms. We prove that for quadratic objectives, local update methods perform stochastic gradient descent on a surrogate loss function which we exactly characterize. We show that the choice of client learning rate controls the condition number of that surrogate loss, as well as the distance between the minimizers of the surrogate and true loss functions. We use this theory to derive novel convergence rates for federated averaging that showcase this trade-off between the condition number of the surrogate loss and its alignment with the true loss function. We validate our results empirically, showing that in communication-limited settings, proper learning rate tuning is often sufficient to reach near-optimal behavior. We also present a practical method for automatic learning rate decay in local update methods that helps reduce the need for learning rate tuning, and highlight its empirical performance on a variety of tasks and datasets.

Paper: Zachary Charles, Jakub Koneč. On the Outsized Importance of Learning Rates in Local Update Methods, arXiv:2007.00878, 2020


FLOW Talk #07

July 1, 2020 @ 1pm Coordinated Universal Time (UTC)

Personalized Federated Learning with Theoretical Guarantees: A Model-Agnostic Meta-Learning Approach

host: Peter Richtárik

Abstract: In Federated Learning, we aim to train models across multiple computing units (users), while users can only communicate with a common central server, without exchanging their data samples. This mechanism exploits the computational power of all users and allows users to obtain a richer model as their models are trained over a larger set of data points. However, this scheme only develops a common output for all the users, and, therefore, it does not adapt the model to each user. This is an important missing feature, especially given the heterogeneity of the underlying data distribution for various users. In this paper, we study a personalized variant of the federated learning in which our goal is to find an initial shared model that current or new users can easily adapt to their local dataset by performing one or a few steps of gradient descent with respect to their own data. This approach keeps all the benefits of the federated learning architecture, and, by structure, leads to a more personalized model for each user. We show this problem can be studied within the Model-Agnostic Meta-Learning (MAML) framework. Inspired by this connection, we study a personalized variant of the well-known Federated Averaging algorithm and evaluate its performance in terms of gradient norm for non-convex loss functions. Further, we characterize how this performance is affected by the closeness of underlying distributions of user data, measured in terms of distribution distances such as Total Variation and 1-Wasserstein metric.

Paper: Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. Personalized Federated Learning: A Meta-Learning Approach, arXiv:2002.07948, 2020.

FLOW Talk #06

June 24, 2020 @ 1pm Coordinated Universal Time (UTC)

Hadrien Hendrikx (École Normale Supérieure & INRIA)

Statistical Preconditioning for Federated Learning

host: Samuel Horváth

Abstract: We consider the setting of distributed empirical risk minimization where multiple machines compute the gradients in parallel and a centralized server updates the model parameters. In order to reduce the number of communications required to reach a given accuracy, we propose a preconditioned accelerated gradient method where the preconditioning is done by solving a local optimization problem over a subsampled dataset at the server. The convergence rate of the method depends on the square root of the relative condition number between the global and local loss functions. This relative condition number depends on how close the function estimate at the server is compared with the global functions. We estimate the relative condition number for linear prediction models by studying uniform concentration of the Hessians over a bounded domain, which allows usto derive improved convergence rates for existing preconditioned gradient methods and our accelerated method. Experiments on real-world datasets illustrate how statistical preconditioning can drastically reduce the iteration complexity of the algorithm. They also show the benefits of acceleration in the ill-conditioned regime. The focus of this talk will be on the relevance of relative optimization ideas for federated learning rather than on the acceleration aspect.

Paper: Hadrien Hendrikx, Lin Xiao, Sebastien Bubeck, Francis Bach, Laurent Massoulie. Statistically Preconditioned Accelerated Gradient Method for Distributed Optimization, arXiv:2002.10726, 2020.

FLOW Talk #05

June 17, 2020 @ 1pm Coordinated Universal Time (UTC)

Federated Learning of a Mixture of Global and Local Models: Local SGD and Optimal Algorithms

host: Virginia Smith

Abstract: We propose a new optimization formulation for training federated learning models. The standard formulation has the form of an empirical risk minimization problem constructed to find a single global model trained from the private data stored across all participating devices. In contrast, our formulation seeks an explicit trade-off between this traditional global model and the local models, which can be learned by each device from its own private data without any communication. Further, we develop several efficient variants of SGD for solving the new formulation and prove communication complexity guarantees. The simplest among our methods are similar but not identical to federated averaging / local SGD, thus shedding some light on the essence of the elusive method. We also show that incorporating both Nesterov's acceleration and variance reduction on top of local SGD results in an optimal algorithm in terms of communication complexity.

Paper: Filip Hanzely and Peter Richtárik. Federated Learning of a Mixture of Global and Local Models, arXiv:2002.05516, 2020.

FLOW Talk #04

June 10, 2020 @ 1pm Coordinated Universal Time (UTC)

Stochastic Controlled Averaging for Federated Learning

host: Peter Richtárik

Abstract: Federated Averaging (FedAvg) has emerged as the algorithm of choice for federated learning due to its simplicity and low communication cost. However, in spite of recent research efforts, its performance is not fully understood. We obtain tight convergence rates for FedAvg and prove that it suffers from `client-drift' when the data is heterogeneous (non-iid), resulting in unstable and slow convergence.

As a solution, we propose a new algorithm (SCAFFOLD) which uses control variates (variance reduction) to correct for the `client-drift' in its local updates. We prove that SCAFFOLD requires significantly fewer communication rounds and is not affected by data heterogeneity or client sampling. Further, we show that (for quadratics) SCAFFOLD can take advantage of similarity in the client's data yielding even faster convergence. The latter is the first result to quantify the usefulness of local-steps in distributed optimization.

Paper: Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, Ananda Theertha Suresh. SCAFFOLD: Stochastic Controlled Averaging for Federated Learning, arXiv:1910.06378, 2019.

FLOW Talk #03

May 27, 2020 @ 1pm Coordinated Universal Time (UTC)

Dimitris Papailiopoulos (University of Wisconsin-Madison)

Robustness in Federated Learning May be Impossible Without an All-knowing Central Authority

host: Peter Richtárik

Abstract: In this talk we aim to quantify the robustness of distributed learning against adversarial nodes. We show that there is a gap between robustness guarantees, depending on whether adversarial nodes have full control of the hardware, the training data, or both. Using ideas from robust statistics and coding theory we establish robust and scalable training methods for centralized, parameter server systems. Perhaps unsurprisingly, we prove that robustness is impossible when a central authority does not own the training data, e.g., in federated learning systems. We then provide a set of whitebox and blackbox attacks that force federated models to exhibit poor performance on either the training, test, or corner-case data points. Our results and experiments cast doubts on the security presumed by federated learning system providers, and show that if you want robustness, you probably have to give up some of your data privacy.

Papers:

Lingjiao Chen, Hongyi Wang, Zachary Charles, Dimitris Papailiopoulos. DRACO: Byzantine-resilient Distributed Training via Redundant Gradients, arXiv:1803.09877, 2018.

Shashank Rajput, Hongyi Wang, Zachary Charles and Dimitris Papailiopoulos. DETOX: A Redundancy-based Framework for Faster and More Robust Gradient Aggregation, NeurIPS 2019.

FLOW Talk #02

May 20, 2020 @ 1pm Coordinated Universal Time (UTC)

Blake Woodworth (Toyota Technological Institute at Chicago)

Is Local SGD Better than Minibatch SGD?

host: Peter Richtárik

Abstract: We study local SGD (also known as parallel SGD and federated averaging), a natural and frequently used stochastic distributed optimization method. Its theoretical foundations are currently lacking and we highlight how all existing error guarantees in the convex setting are dominated by a simple baseline, minibatch SGD. (1) For quadratic objectives we prove that local SGD strictly dominates minibatch SGD and that accelerated local SGD is minimax optimal for quadratics; (2) For general convex objectives we provide the first guarantee that at least sometimes improves over minibatch SGD; (3) We show that indeed local SGD does not dominate minibatch SGD by presenting a lower bound on the performance of local SGD that is worse than the minibatch SGD guarantee.

Paper: Blake Woodworth, Kumar Kshitij Patel, Sebastian U. Stich, Zhen Dai, Brian Bullins, H. Brendan McMahan, Ohad Shamir, Nathan Srebro. Is Local SGD Better than Minibatch SGD?, arXiv:2002.07839, 2020.

FLOW Talk #01

May 13, 2020 @ 1pm Coordinated Universal Time (UTC)

Ahmed Khaled (Cairo University)

On the Convergence of Local SGD on Identical and Heterogeneous Data

host: Peter Richtárik

Abstract: Empirical risk minimization (ERM) is an important cornerstone of solving machine learning problems, and federated learning in particular poses the extra challenge of carrying out ERM in a communication efficient manner on data distributed over many nodes. Local Stochastic Gradient Descent (Local SGD) is a variant of SGD that alternates local update steps with relatively few communication rounds. Local SGD is at the heart of the popular Federated Averaging algorithm and has recently been the subject of extensive theoretical investigation. We revisit the convergence of Local SGD and present new theory for the algorithm in two settings: under identical and heterogeneous data, with the latter being of more interested to Federated Learning applications. An important focus of the talk will be reviewing the assumptions typically used in obtaining convergence bounds for FL algorithms and their practical relevance.

Papers:

Ahmed Khaled, Konstantin Mishchenko and Peter Richtárik. Tighter Theory for Local SGD on Identical and Heterogeneous Data, The 23rd International Conference on Artificial Intelligence and Statistics, 2020 (AISTATS 2020).

Ahmed Khaled, Konstantin Mishchenko and Peter Richtárik. First Analysis of Local GD on Heterogeneous Data, NeurIPS 2019 Workshop on Federated Learning for Data Privacy and Confidentiality, 2019 (NeurIPS 2020).