Federated Learning One World Seminar
Archive of Talks: FLOW Season 2021
FLOW Talk #62
December 22nd, 2021 @ 1pm Coordinated Universal Time (UTC)
A flexible framework for resource-aware compression in federated learning
host: Aurélien Bellet
Abstract: To alleviate communication bottleneck a number of recent works developed federated learning methods using compressed information. These federated compression methods attain guaranteed convergence toward optimal solutions. However, tuning hyper-parameters require expensive computational efforts to attain optimal practical performance. In this talk we present a flexible framework for resource-aware compression that automatically strikes the optimal trade-off between the communication efficiency and convergence performance. The central idea is to adjusts the compression level to the information at each iteration, while maximizing the improvement in the objective function per communicated bit. This framework enables the compression to be easily adapted from one technology to another by modeling how the communication cost depends on the compression level for the specific technology. Our experiments indicate that the automatic compression-tuning strategies significantly increase communication efficiency on several popular compression strategies.
Paper
Sarit Khirirat, Sindri Magnússon, Arda Aytekin, and Mikael Johansson. A Flexible Framework for Communication-Efficient Machine Learning. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 35, No. 9, pp. 8101-8109), 2021
FLOW Talk #61
December 1st, 2021 @ 1pm Coordinated Universal Time (UTC)
Unified and Improved Analysis of Decentralized Algorithms for ML
host: Samuel Horváth
Abstract: We focus on decentralized optimization problems that arise in machine learning, that is when the training data is distributed over several workers or devices.
In the first part of the talk we will present a unified view of several decentralized and distributed centralized optimization algorithms. We unify algorithms such as Local SGD, decentralized SGD over a fixed graph, repeated pairwise gossip-based SGD, and others, in a single framework. We provide theoretical convergence rates that recover and improve best-known previous results. Our convergence rate show that heterogeneous data (non-identical distributions on different workers) slow down the algorithms.
In the second part of the talk we will discuss an improved analysis of the Gradient Tracking algorithm—a decentralized optimization algorithm that is not affected by data heterogeneity.
Papers:
Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi and Sebastian U. Stich. A Unified Theory of Decentralized SGD with Changing Topology and Local Updates, ICML 2020
Anastasia Koloskova,Tao Lin EPFL and Sebastian U. Stich. An Improved Analysis of Gradient Tracking for Decentralized Machine Learning, NeurIPS 2021
FLOW Talk #60
November 24th, 2021 @ 1pm Coordinated Universal Time (UTC)
The Actual Risk of Poisoning Attacks on Federated Learning
host: Samuel Horváth
Abstract: Federated learning (FL) is increasingly adopted by various distributed platforms, in particular Google's Gboard and Apple's Siri use FL to train next word prediction models, and WeBank uses FL for credit risk predictions. A key feature that makes FL highly attractive in practice is that it allows training models in collaboration among mutually untrusted clients, e.g., Android users or competing banks. Unfortunately, this makes FL susceptible to a threat known as poisoning: a small fraction of (malicious) FL clients, who are either owned or controlled by an adversary, may act maliciously during the FL training process in order to corrupt the jointly trained global model.
In this talk, I will take a critical look at the existing literature on (mainly,untargeted) poisoning attacks under practical production FL environments, by carefully characterizing the set of realistic threat models and adversarial capabilities. I will discuss some rather surprising findings: contrary to the established belief, we show that FL is highly robust in threat models that represent production systems. I will also discuss some promising directions towards defending against poisoning, in particular by using supermask learning.
Papers:
Virat Shejwalkar, Amir Houmansadr, Peter Kairouz and Daniel Ramage. Back to the Drawing Board: A Critical Evaluation of Poisoning Attacks on Federated Learning, arXiv preprint arXiv:2108.10241, 2021
Hamid Mozaffari, Virat Shejwalkar and Amir Houmansadr. FSL: Federated Supermask Learning, arXiv preprint arXiv:2110.04350, 2021
FLOW Talk #59
November 17th, 2021 @ 1pm Coordinated Universal Time (UTC)
Using Hypernetworks to Personalize Federated Learning
host: Samuel Horváth
Abstract: In this talk, we will describe our recent work on using hypernetworks to personalize federated learning. This includes personalization to clients with new distributions, generalization to novel unsupervised clients at test time and generalization to clients with heterogeneous architectures.
Paper:
Aviv Shamsian, Aviv Navon, Ethan Fetaya, and Gal Chechik. Personalized Federated Learning using Hypernetworks, arXiv preprint arXiv:2103.04628, 2021
FLOW Talk #58
November 10th, 2021 @ 1pm Coordinated Universal Time (UTC)
Asynchronous Iterations in Optimization: New Sequence Results and Sharper Algorithmic Guarantees
[slides]
host: Mher Safaryan
Abstract: Machine learning problems are rapidly expanding in size and scope. Increasingly often, we face problems where data, computations, and decisions need to be distributed on multiple nodes. These nodes may be individual cores in a CPU, different processors in a multi-CPU platform, or devices in a federated learning system. Insisting that such multi-node systems operate synchronously limits their scalability, since the performance is then dictated by the slowest node, and the system becomes fragile to node failures. Hence, there is a strong current interest in developing asynchronous algorithms for optimal decision-making.
In this talk, we introduce novel convergence results for asynchronous iterations which appear in the analysis of parallel and distributed optimization algorithms. The results are simple to apply and give explicit estimates for how the degree of asynchrony impacts the convergence rates of the iterates. Our results shorten, streamline and strengthen existing convergence proofs for several asynchronous optimization methods, and allow us to establish convergence guarantees for popular algorithms that were thus far lacking a complete theoretical understanding. Specifically, we use our results to derive better iteration complexity bounds for proximal incremental aggregated gradient methods, to provide less conservative analyses of the speedup conditions for asynchronous block-coordinate implementations of Krasnoselskii-Mann iterations, and to quantify the convergence rates for totally asynchronous iterations under various assumptions on communication delays and update rates.
Paper:
Hamid Reza Feyzmahdavian and Mikael Johansson. Asynchronous Iterations in Optimization: New Sequence Results and Sharper Algorithmic Guarantees, arXiv preprint arXiv:2109.04522, 2021
FLOW Talk #57
November 3rd, 2021 @ 1pm Coordinated Universal Time (UTC)
Secure Distributed Training at Scale
host: Samuel Horváth
Abstract: Some of the hardest problems in deep learning can be solved with the combined effort of many independent parties, as is the case for volunteer computing and federated learning. These setups rely on high numbers of peers to provide computational resources or train on decentralized datasets. Unfortunately, participants in such systems are not always reliable. Any single participant can jeopardize the entire training run by sending incorrect updates, whether deliberately or by mistake. Training in presence of such peers requires specialized distributed training algorithms with Byzantine tolerance. These algorithms often sacrifice efficiency by introducing redundant communication or passing all updates through a trusted server. As a result, it can be infeasible to apply such algorithms to large-scale distributed deep learning, where models can have billions of parameters. In this work, we propose a novel protocol for secure (Byzantine-tolerant) decentralized training that emphasizes communication efficiency. We rigorously analyze this protocol: in particular, we provide theoretical bounds for its resistance against Byzantine and Sybil attacks and show that it has a marginal communication overhead. To demonstrate its practical effectiveness, we conduct large-scale experiments on image classification and language modeling in presence of Byzantine attackers.
Paper:
Eduard Gorbunov, Alexander Borzunov, Michael Diskin, Max Ryabinin. Secure Distributed Training at Scale, arXiv preprint arXiv:2106.11257, 2021
FLOW Talk #56
October 27th, 2021 @ 1pm Coordinated Universal Time (UTC)
New Methods for More Effective Bidirectional Compression for FL
host: Dan Alistarh
Abstract: Federated learning is a powerful distributed learning scheme that allows numerous edge devices to collaboratively train a model without sharing their data. However, training is resource-intensive for edge devices, and limited network bandwidth is often the main bottleneck. State-of-the-art techniques tackle these challenges by advanced compression techniques or delaying communication rounds according to predefined schedules. In this talk we will discuss two novel, complementary, methods:
We (i) present a new scheme that adaptively skips communication (broadcast and client uploads) by detecting slow-varying updates. The scheme automatically adjusts the communication frequency independently for each worker and the server. Secondly, (ii) we present a progressive learning framework for FL. By applying progressive training in a federated learning setting, we can benefit from highly reduced resource demands (the computation and communication load is drastically reduced when training shallow models). By growing the models to their original size, we can recover (and often surpass) the accuracies reached by standard training.
Papers:
Hui-Po Wang, Sebastian U. Stich, Yang He and Mario Fritz. ProgFed: Effective, Communication, and Computation Efficient Federated Learning by Progressive Training, arXiv preprint arXiv:2110.05323, 2021
Dmitrii Avdiukhin, Nikita Ivkin, Sebastian U Stich and Vladimir Braverman. Bi-directional Adaptive Communication for Heterogenous Distributed Learning, ICML Workshop on Federated Learning for User Privacy and Data Confidentiality, 2021
FLOW Talk #55
October 20th, 2021 @ 1pm Coordinated Universal Time (UTC)
A Stochastic Newton Algorithm for Distributed Convex Optimization
host: Samuel Horváth
Abstract: We propose and analyze a stochastic Newton algorithm for homogeneous distributed stochastic convex optimization, where each machine can calculate stochastic gradients of the same population objective, as well as stochastic Hessian-vector products (products of an independent unbiased estimator of the Hessian of the population objective with arbitrary vectors), with many such stochastic computations performed between rounds of communication. We show that our method can reduce the number, and frequency, of required communication rounds compared to existing methods without hurting performance, by proving convergence guarantees for quasi-self-concordant objectives (e.g., logistic regression), alongside empirical evidence.
Paper: Brian Bullins, Kumar Kshitij Patel, Ohad Shamir, Nathan Srebro, Blake Woodworth. A Stochastic Newton Algorithm for Distributed Convex Optimization. NeurIPS 2021.
FLOW Talk #54
October 13th, 2021 @ 1pm Coordinated Universal Time (UTC)
EF21 with Bells & Whistles: Practical Algorithmic Extensions of Modern Error Feedback
host: Samuel Horváth
Abstract: First proposed by Seide et al (2014) as a heuristic, error feedback (EF) is a very popular mechanism for enforcing convergence of distributed gradient-based optimization methods enhanced with communication compression strategies based on the application of contractive compression operators. However, existing theory of EF relies on very strong assumptions (e.g., bounded gradients), and provides pessimistic convergence rates (e.g., while the best known rate for EF in the smooth nonconvex regime, and when full gradients are compressed, is $O(1/T^{2/3})$, the rate of gradient descent in the same regime is $O(1/T)$). Recently, Richt\'{a}rik et al (2021) proposed a new error feedback mechanism, EF21, based on the construction of a Markov compressor induced by a contractive compressor. EF21 removes the aforementioned theoretical deficiencies of EF and at the same time works better in practice. In this work we propose six practical extensions of EF21, all supported by strong convergence theory: partial participation, stochastic approximation, variance reduction, proximal setting, momentum and bidirectional compression. Several of these techniques were never analyzed in conjunction with EF before, and in cases where they were (e.g., bidirectional compression), our rates are vastly superior.
Papers:
Ilyas Fatkhullin, Igor Sokolov, Eduard Gorbunov, Zhize Li, and Peter Richtárik. EF21 with Bells & Whistles: Practical Algorithmic Extensions of Modern Error Feedback, arXiv:2110.03294, 2021.
Peter Richtárik, Igor Sokolov, and Ilyas Fatkhullin. EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback, NeurIPS (oral), 2021.
FLOW Talk #53
October 6th, 2021 @ 1pm Coordinated Universal Time (UTC)
Tackling Computational Heterogeneity in Federated Learning
host: Dan Alistarh
Abstract: The future of machine learning lies in moving both data collection as well as model training to the edge. The emerging area of federated learning seeks to achieve this goal by orchestrating distributed model training using a large number of resource-constrained mobile devices that collect data from their environment. Due to limited communication capabilities as well as privacy concerns, the data collected by these devices cannot be sent to the cloud for centralized processing. Instead, the nodes perform local training updates and only send the resulting model to the cloud. A key aspect that sets federated learning apart from data-center-based distributed training is the inherent heterogeneity in data and local computation at the edge clients. In this talk, I will present our recent work on tackling computational heterogeneity in federated optimization, firstly in terms of heterogeneous local updates made by the edge clients, and secondly in terms of intermittent client availability.
FLOW Talk #52
September 29th, 2021 @ 1pm Coordinated Universal Time (UTC)
Compression Frameworks for Federated Learning
host: Peter Richtárik
Abstract: We develop a new approach to tackle communication constraints in a distributed learning problem with a central server. We propose and analyze a new algorithm that performs bidirectional compression and achieves the same convergence rate as algorithms using only uplink (from the local workers to the central server) compression. To obtain this improvement, we design MCM, an algorithm such that the downlink compression only impacts local models, while the global model is preserved. As a result, and contrary to previous works, the gradients on local servers are computed on perturbed models. Consequently, convergence proofs are more challenging and require a precise control of this perturbation. To ensure it, MCM additionally combines model compression with a memory mechanism. This analysis opens new doors, e.g. incorporating worker dependent randomized-models and partial participation.
Papers:
Constantin Philippenko and Aymeric Dieuleveut. Preserved Central Model for Faster Bidirectional Compression in Distributed Settings, arXiv preprint arXiv:2102.12528, 2021
FLOW Talk #51
September 15th, 2021 @ 1pm Coordinated Universal Time (UTC)
Scalable Federated Machine Learning with FedN
host: Samuel Horváth
Abstract: In this software talk we will detail FEDn - an open source framework for federated machine learning developed by Scaleout Systems and the Integrative Scalable Computing Laboratory at Uppsala University. In the design of FEDn, we see the individual model updates at clients as black boxes (machine learning framework agnostic), and we have payed particular attention to fundamental distributed computing aspects such as scalability, robustness and performance in a geographically distributed setting. The architecture is based on a tiered model where multiple servers, or combiners, are deployed to scale the network horizontally to be able to cater both to the needs of cross-silo cases with large model updates, and to cross-device cases where latency and geographical locality can be critical. We present an in-depth scalability analysis based on both large numbers of clients (cross-device) and large models (cross-silo), training models in real, heterogeneous distributed environments. FEDn is publicly available under the Apache2 license at https://github.com/scaleoutsystems/fedn.
Paper: Morgan Ekmefjord, Addi Ait-Mlouk, Sadi Alawadi, Mattias Åkesson, Desislava Stoyanova, Ola Spjuth, Salman Toor, and Andreas Hellander. Scalable federated machine learning with FEDn, arXiv:2012.14453, 2020
FLOW Talk #50
September 8th, 2021 @ 4pm Coordinated Universal Time (UTC)
Hiding Among the Clones: A Simple and Nearly Optimal Analysis of Privacy Amplification by Shuffling
host: Aurélien Bellet
Abstract: Recent work of Erlingsson et al (2019) demonstrates that random shuffling amplifies differential privacy guarantees of locally randomized data. Such amplification implies substantially stronger privacy guarantees for systems in which data is contributed anonymously and has lead to significant interest in the shuffle model of privacy.
We show that random shuffling of $n$ data records that are input to $\epsilon_0$-differentially private local randomizers results in an $(O((1-e^{-\epsilon_0})\sqrt{\frac{e^{\epsilon_0}\log(1/\delta)}{n}}), \delta)$-differentially private algorithm. This significantly improves over previous work and achieves the asymptotically optimal dependence in $\epsilon_0$. Our result is based on a new approach that is simpler than previous work and extends to approximate differential privacy with nearly the same guarantees. Importantly, our work also yields an algorithm for deriving tighter bounds on the resulting $\epsilon$ and $\delta$ as well as Renyi differential privacy guarantees. We show numerically that our algorithm gets to within a small constant factor of the optimal bound. As a direct corollary of our analysis we derive a simple and asymptotically optimal algorithm for discrete distribution estimation in the shuffle model of privacy. We also observe that our result implies the first asymptotically optimal privacy analysis of noisy stochastic gradient descent that applies to sampling without replacement.
Paper:
Vitaly Feldman, Audra McMillan, and Kunal Talwar. Hiding Among the Clones: A Simple and Nearly Optimal Analysis of Privacy Amplification by Shuffling, arXiv:2012.12803, 2020.
FLOW Talk #49
September 1st, 2021 @ 1pm Coordinated Universal Time (UTC)
Practical federated learning or how I learned to stop worrying and embrace deep learning
host: Samuel Horváth
Abstract: The theory of optimization for deep learning is still in its infancy. Tricks such as adaptivity, momentum, warmup etc. are critical for empirical success but their requirement is not well understood theoretically. This raises an important challenge addressed by this talk: if we do not understand why algorithms work in the centralized setting, how can we hope to design algorithms for the more challenging federated setting?
In Mime, instead of designing algorithms from scratch, we adapt arbitrary centralized algorithms to the federated setting in a principled manner. We try to ensure that each local update on the clients mimics that of the centralized algorithm run on iid data. Both theoretically and empirically, we demonstrate that Mime retains the convergence properties of the algorithm it is trying to mimic. In fact, when paired with momentum-based variance reduction, Mime obtains the first theoretical rates faster than any centralized algorithm.
Papers:
Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, and Ananda Theertha Suresh. Mime: Mimicking Centralized Stochastic Algorithms in Federated Learning, arXiv:2008.03606, 2020.
Sai Praneeth Karimireddy, Lie He, and Martin Jaggi. Learning from History for Byzantine Robust Optimization, arXiv:2012.10333, 2020
FLOW Talk #48
August 18th, 2021 @ 1pm Coordinated Universal Time (UTC)
QLSD: Quantised Langevin Stochastic Dynamics for Bayesian Federated Learning
host: Peter Richtárik
Abstract: Federated learning aims at conducting inference when data are decentralised and locally stored on several clients, under two main constraints: data ownership and communication overhead. In this paper, we address these issues under the Bayesian paradigm. To this end, we propose a novel Markov chain Monte Carlo algorithm coined QLSD built upon quantised versions of stochastic gradient Langevin dynamics. To improve performance in a big data regime, we introduce variance-reduced alternatives of our methodology referred to as QLSD* and QLSD++. We provide both non-asymptotic and asymptotic convergence guarantees for the proposed algorithms and illustrate their benefits on several federated learning benchmarks.
Paper:
Maxime Vono, Vincent Plassier, Alain Durmus, Aymeric Dieuleveut, and Eric Moulines. QLSD: Quantised Langevin stochastic dynamics for Bayesian federated learning, arXiv:2106.00797, 2021.
FLOW Talk #47
August 11th, 2021 @ 1pm Coordinated Universal Time (UTC)
FedPAGE: A Fast Local Stochastic Gradient Method for Communication-Efficient Federated Learning
host: Samuel Horváth
Abstract: Federated Averaging (FedAvg, also known as Local-SGD) [McMahan et al., 2017] is a classical federated learning algorithm in which clients run multiple local SGD steps before communicating their update to an orchestrating server.
We propose a new federated learning algorithm, FedPAGE, able to further reduce the communication complexity by utilizing the recent optimal PAGE method [Li et al., 2021] instead of plain SGD in FedAvg. We show that FedPAGE uses much fewer communication rounds than previous local methods for both federated convex and nonconvex optimization.
In the convex setting, the number of communication rounds of FedPAGE is $O(\frac{N^{3/4}}{S\epsilon})$, improving the best-known result $O(\frac{N}{S\epsilon})$ of SCAFFOLD [Karimireddy et al., 2020] by a factor of $N^{1/4}$, where $N$ is the total number of clients (usually very large in federated learning), $S$ is the number of clients participating in each communication round, and $\epsilon$ is the target error.
In the nonconvex setting, the number of communication rounds of FedPAGE is $O(\frac{\sqrt{N}+S}{S\epsilon^2})$, improving the best-known result $O(\frac{N^{2/3}}{S^{2/3}\epsilon^2})$ of SCAFFOLD [Karimireddy et al., 2020] by a factor of $N^{1/6}S^{1/3}$ if $S\leq \sqrt{N}$.
Note that in both settings, the communication cost for each round is the same for both FedPAGE and SCAFFOLD. As a result, FedPAGE achieves new state-of-the-art results in terms of communication complexity for both federated convex and nonconvex optimization.
Paper: Haoyu Zhao, Zhize Li and Peter Richtárik. FedPAGE: A Fast Local Stochastic Gradient Method for Communication-Efficient Federated Learning, preprint, 2021.
FLOW Talk #46
August 4th, 2021 @ 1pm Coordinated Universal Time (UTC)
FedNL: Making Newton-Type Methods Applicable to Federated Learning
host: Peter Richtárik
Abstract: Inspired by recent work of Islamov et al (2021), we propose a family of Federated Newton Learn (FedNL) methods, which we believe is a marked step in the direction of making second-order methods applicable to FL. In contrast to the aforementioned work, FedNL employs a different Hessian learning technique which i) enhances privacy as it does not rely on the training data to be revealed to the coordinating server, ii) makes it applicable beyond generalized linear models, and iii) provably works with general contractive compression operators for compressing the local Hessians, such as Top-K or Rank-R, which are vastly superior in practice. Notably, we do not need to rely on error feedback for our methods to work with contractive compressors. Moreover, we develop FedNL-PP, FedNL-CR and FedNL-LS, which are variants of FedNL that support partial participation, and globalization via cubic regularization and line search, respectively, and FedNL-BC, which is a variant that can further benefit from bidirectional compression of gradients and models, i.e., smart uplink gradient and smart downlink model compression. We prove local convergence rates that are independent of the condition number, the number of training data points, and compression variance. Our communication efficient Hessian learning technique provably learns the Hessian at the optimum. Finally, we perform a variety of numerical experiments that show that our FedNL methods have state-of-the-art communication complexity when compared to key baselines.
Papers:
Mher Safaryan, Rustem Islamov, Xun Qian and Peter Richtárik. FedNL: Making Newton-Type Methods Applicable to Federated Learning. arXiv:2106.02969, 2021.
Rustem Islamov, Xun Qian and Peter Richtárik. Distributed Second Order Methods with Fast Rates and Compressed Communication. ICML 2021. [This work was covered by a previous FLOW talk by Peter Richtárik, here is the video]
FLOW Talk #45
July 28th, 2021 @ 1pm Coordinated Universal Time (UTC)
EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback
host: Samuel Horváth
Abstract: Error feedback (EF), also known as error compensation, is an immensely popular convergence stabilization mechanism in the context of distributed training of supervised machine learning models enhanced by the use of contractive communication compression mechanisms, such as Top-k. First proposed by Seide et al (2014) as a heuristic, EF resisted any theoretical understanding until recently [Stich et al., 2018, Alistarh et al., 2018]. However, all existing analyses either i) apply to the single node setting only, ii) rely on very strong and often unreasonable assumptions, such global boundedness of the gradients, or iterate-dependent assumptions that cannot be checked a-priori and may not hold in practice, or iii) circumvent these issues via the introduction of additional unbiased compressors, which increase the communication cost. In this work we fix all these deficiencies by proposing and analyzing a new EF mechanism, which we call EF21, which consistently and substantially outperforms EF in practice. Our theoretical analysis relies on standard assumptions only, works in the distributed heterogeneous data setting, and leads to better and more meaningful rates. In particular, we prove that EF21 enjoys a fast O(1/T) convergence rate for smooth nonconvex problems, beating the previous bound of O(1/T^{2/3}), which was shown a bounded gradients assumption. We further improve this to a fast linear rate for PL functions, which is the first linear convergence result for an EF-type method not relying on unbiased compressors. Since EF has a large number of applications where it reigns supreme, we believe that our 2021 variant, EF21, can a large impact on the practice of communication efficient distributed learning.
Paper: Peter Richtárik, Igor Sokolov, and Ilyas Fatkhullin. EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback, arXiv:2106.05203, 2021.
Update (Sept 2021): The above paper was accepted to NeurIPS 2021 as an oral paper (less than 1% acceptance rate).
FLOW Talk #44
July 7th, 2021 @ 1pm Coordinated Universal Time (UTC)
Fair and Accurate Federated Learning under Heterogeneous Targets with Ordered Dropout
host: Samuel Horváth
Abstract: Federated Learning (FL) has been gaining significant traction across different ML tasks, ranging from vision to keyboard predictions. In largescale deployments, client heterogeneity is a fact, and constitutes a primary problem for fairness, training performance and accuracy. Although significant efforts have been made into tackling statistical data heterogeneity, the diversity in the processing capabilities and network bandwidth of clients, termed as system heterogeneity, has remained largely unexplored. Current solutions either disregard a large portion of available devices or set a uniform limit on the model’s capacity, restricted by the least capable participants. In this work, we introduce Ordered Dropout, a mechanism that achieves an ordered, nested representation of knowledge in Neural Networks and enables the extraction of lower footprint submodels without the need of retraining. We further show that for linear maps our Ordered Dropout is equivalent to SVD. We employ this technique, along with a self-distillation methodology, in the realm of FL in a framework called FjORD. FjORD alleviates the problem of client system heterogeneity by tailoring the model width to the client’s capabilities. Extensive evaluation on both CNNs and RNNs across diverse modalities shows that FjORD consistently leads to significant performance gains over state-of-the-art baselines, while maintaining its nested structure.
Paper:
Samuel Horvath, Stefanos Laskaridis, Mario Almeida, Ilias Leontiadis, Stylianos Venieris, and Nicholas Lane. FjORD: Fair and Accurate Federated Learning under Heterogeneous Targets with Ordered Dropout, arXiv:2102.13451, 2021.
FLOW Talk #43
June 30, 2021 @ 1pm Coordinated Universal Time (UTC)
On the efficiency and robustness of federated learning
host: Samuel Horváth
Abstract:
Federated learning (FL) emerges as a popular privacy preserving approach for decentralized learning. However, the current FL methods usually suffer from high communication costs and are vulnerable to adversarial attacks. In this presentation, I will talk about our recent findings on the efficiency and robustness of FL.
In the first part of my talk, I will introduce the Federated matched averaging (FedMA) algorithm. FedMA constructs the shared global model in a layer-wise manner by matching and averaging hidden elements (i.e., channels for convolution layers; hidden states for LSTM; neurons for fully connected layers) with similar feature extraction signatures. Our results indicate that FedMA not only outperforms popular FL algorithms on modern CNN and LSTM architectures trained on real world datasets, but also reduces the overall communication burden.
In the second part, I will discuss the robustness of FL. Our work demonstrates the vulnerability of FL to backdoor attacks. We theoretically establish that robustness to backdoors implies model robustness to adversarial examples, a major open problem. Furthermore, detecting the presence of a backdoor in a FL model is unlikely assuming first order oracles or polynomial time. We couple our theoretical results with a new family of backdoor attacks, which we refer to as edge-case backdoors. An edge-case backdoor forces a model to misclassify on seemingly easy inputs that are however unlikely to be part of the training, or test data, i.e., they live on the tail of the input distribution. We explain how these edge-case backdoors can lead to unsavory failures and may have serious repercussions on fairness, and exhibit that with careful tuning at the side of the adversary, one can insert them across a range of machine learning tasks.
Paper:
Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris Papailiopoulos and Yasaman Khazaeni. Federated Learning with Matched Averaging, ICLR, 2021.
Hongyi Wang, Kartik Sreenivasan, Shashank Rajput, Harit Vishwakarma, Saurabh Agarwal, Jy-yong Sohn, Kangwook Lee and Dimitris Papailiopoulos. Attack of the Tails: Yes, You Really Can Backdoor Federated Learning, NeurIPS, 2020.
FLOW Talk #42
May 12, 2021 @ 1pm Coordinated Universal Time (UTC)
Laurent Condat (KAUST)
From Local SGD to Local Fixed-Point Methods for Federated Learning
host: Peter Richtárik
Abstract: Most algorithms for solving optimization problems or finding saddle points of convex-concave functions are fixed-point algorithms. In this work we consider the generic problem of finding a fixed point of an average of operators, or an approximation thereof, in a distributed setting. Our work is motivated by the needs of federated learning. In this context, each local operator models the computations done locally on a mobile device. We investigate two strategies to achieve such a consensus: one based on a fixed number of local steps, and the other based on randomized computations. In both cases, the goal is to limit communication of the locally-computed variables, which is often the bottleneck in distributed frameworks. We perform convergence analysis of both methods and conduct a number of experiments highlighting the benefits of our approach.
Paper:
Grigory Malinovskiy, Dmitry Kovalev, Elnur Gasanov, Laurent Condat and Peter Richtárik. From Local SGD to Local Fixed-Point Methods for Federated Learning, ICML, 2021.
Sélim Chraibi, Ahmed Khaled, Dmitry Kovalev, Peter Richtárik, Adil Salim and Martin Takáč. Distributed fixed point methods with compressed iterates, arXiv:1912.09925, 2019.
FLOW Talk #41
April 21, 2021 @ 3pm Coordinated Universal Time (UTC)
Salman Avestimehr (University of Southern California)
Trustworthy and Scalable Federated Learning
host: Samuel Horváth
Abstract: Federated learning (FL) is a promising framework for enabling privacy preserving machine learning across many decentralized users. Its key idea is to leverage local training at each user without the need for centralizing/moving any device's dataset in order to protect users’ privacy. In this talk, I will highlight several exciting research challenges for making such a decentralized system trustworthy and scalable to a large number of resource-constrained users. In particular, I will discuss three directions: (1) resilient and secure model aggregation, which is a key component and performance bottleneck in FL; (2) FL of large models, via knowledge transfer, over resource-constrained users; and (3) FedML, our open-source research library and benchmarking ecosystem for FL research (fedml.ai).
Papers:
Jinhyun So, Basak Guler and A. Salman Avestimehr. Turbo-Aggregate: Breaking the Quadratic Aggregation Barrier in Secure Federated Learning, JSAIT, 2021.
Jinhyun So, Basak Guler and A. Salman Avestimehr. Byzantine-Resilient Secure Federated Learning, JSAC, 2020.
Chaoyang He, Murali Annavaram and A. Salman Avestimehr. Group Knowledge Transfer: Federated Learning of Large CNNs at the Edge, NeurIPS, 2020.
Chaoyang He, Murali Annavaram and A. Salman Avestimehr. Towards Non-I.I.D. and Invisible Data with FedNAS: Federated Deep Learning via Neural Architecture Search, CVPR-NAS, 2020.
Chaoyang He et al. FedML: A Research Library and Benchmark for Federated Machine Learning, NeurIPS-SpicyFL, 2020
FLOW Talk #40
April 14, 2021 @ 1pm Coordinated Universal Time (UTC)
Peter Richtárik (KAUST)
Beyond Local and Gradient Methods for Federated Learning
host: Samuel Horváth
Abstract: I will first say something unsavory about gradient and local methods, and will then spend the rest of the talk to describe where I think the FL community should be going.
Papers:
Rustem Islamov, Xun Qian and Peter Richtárik. Distributed second order methods with fast rates and compressed communication, arXiv:2102.0715, 2021.
Dmitry Kovalev, Konstantin Mishchenko and Peter Richtárik. Stochastic Newton and cubic Newton methods with simple local linear-quadratic rates, NeurIPS 2019 Workshop Beyond First Order Methods in ML, 2019.
FLOW Talk #39
April 7, 2021 @1pm Coordinated Universal Time (UTC)
Chaouki ben Issaid (University of Oulu)
Communication-Efficient Distributed Learning with Censored, Quantized, and Generalized Group ADMM
host: Aurélien Bellet
Abstract: In this talk, I will present a communication-efficiently decentralized machine learning framework that solves a consensus optimization problem defined over a network of inter-connected workers. The proposed algorithm, Censored and Quantized Generalized GADMM (CQ-GGADMM), leverages the worker grouping and decentralized learning ideas of Group Alternating Direction Method of Multipliers (GADMM), and pushes the frontier in communication efficiency by extending its applicability to generalized network topologies, while incorporating link censoring for negligible updates after quantization. We theoretically prove that CQ-GGADMM achieves the linear convergence rate when the local objective functions are strongly convex under some mild assumptions.
Papers:
Anis Elgabli, Jihong Park, Amrit S. Bedi, Mehdi Bennis and Vaneet Aggarwal. GADMM: Fast and Communication Efficient Framework for Distributed Machine Learning, Journal of Machine Learning Research, 2020.
Anis Elgabli, Jihong Park, Amrit S Bedi, Chaouki Ben Issaid, Mehdi Bennis and Vaneet Aggarwal. Q-GADMM: Quantized Group ADMM for Communication Efficient Decentralized Machine Learning, IEEE Transactions on Communications, 2021.
Chaouki Ben Issaid, Anis Elgabli, Jihong Park and Mehdi Bennis and Mérouane Debbah. Communication Efficient Distributed Learning with Censored, Quantized, and Generalized Group ADMM, arXiv:2009.06459, 2021.
FLOW Talk #38
March 31, 2021 @ 1pm Coordinated Universal Time (UTC)
CaPC Learning: Confidential and Private Collaborative Learning
host: Samuel Horváth
Abstract: Machine learning benefits from large training datasets, which may not always be possible to collect by any single entity, especially when using privacy-sensitive data. In many contexts, such as healthcare and finance, separate parties may wish to collaborate and learn from each other's data but are prevented from doing so due to privacy regulations. Some regulations prevent explicit sharing of data between parties by joining datasets in a central location (confidentiality). Others also limit implicit sharing of data, e.g., through model predictions (privacy). There is currently no method that enables machine learning in such a setting, where both confidentiality and privacy need to be preserved, to prevent both explicit and implicit sharing of data. Federated learning only provides confidentiality, not privacy, since shared gradients or parameters still contain private information. Differentially private learning assumes unreasonably large datasets. Furthermore, both of these learning paradigms produce a central model whose architecture was previously agreed upon by all parties rather than enabling collaborative learning where each party learns and improves their own local model. We introduce Confidential and Private Collaborative (CaPC) learning, the first method provably achieving both confidentiality and privacy in a collaborative setting. We leverage secure multi-party computation (MPC), homomorphic encryption (HE), and other techniques in combination with privately aggregated teacher models. We demonstrate how CaPC allows participants to collaborate without having to explicitly join their training sets or train a central model. Each party is able to improve the accuracy and fairness of their model, even in settings where each party has a model that performs well on their own dataset or when datasets are non-IID and model architectures are heterogeneous across parties.
Paper: Christopher A. Choquette-Choo, Natalie Dullerud, Adam Dziedzic, Yunxiang Zhang, Somesh Jha, Nicolas Papernot, and Xiao Wang. CaPC Learning: Confidential and Private Collaborative Learning, ICLR, 2021
FLOW Talk #37
March 24, 2021 @ 1pm Coordinated Universal Time (UTC)
Maruan Al-Shedivat (Carnegie Mellon University)
An Inferential Perspective on Federated Learning & Remarks On Data Efficiency of Meta-learning
host: Samuel Horváth
Abstract: In the first part of my talk, I'll present a new perspective on federated learning (FL), where we view it as a distributed inference problem. I'll draw connections between FL and some earlier work on distributed inference from Bayesian statistics literature. Then, I'll use this new perspective and present a new FL algorithm called federated posterior averaging (FedPA), which generalizes federated averaging (FedAvg), is computation- and communication-efficient, and works extremely well in cross-device FL.
In the second part, I'll briefly talk about meta-learning for personalized FL. Specifically, I'll discuss the sample efficiency of three popular meta-learning methods (MAML, Reptile, and ProtoNets). I'll present generalization bounds for these methods obtained using stability theory (Bosquet & Elisseeff, 2002) and discuss their practical implications.
Paper:
Maruan Al-Shedivat, Jennifer Gillenwater, Eric Xing, Afshin Rostamizadeh. Federated Learning via Posterior Averaging: A New Perspective and Practical Algorithms, ICLR, 2021
Maruan Al-Shedivat, Liam Li, Eric Xing, Ameet Talwalkar. On Data Efficiency of Meta-learning. AISTATS, 2021
Abstract: Deep learning models are increasingly popular in many machine learning applications where the training data may contain sensitive information. To provide formal and rigorous privacy guarantees, many learning systems now incorporate differential privacy in their model training. I will talk about our recent studies on the gradient distributions in private training trajectories, which often exhibit low-dimensional and symmetric structures. I will present how our results leverage these geometric structures to derive more accurate training methods and characterize their convergence behaviors.
Papers:
Yingxue Zhou, Zhiwei Steven Wu, Arindam Banerjee. Bypassing the Ambient Dimension: Private SGD with Gradient Subspace Identification, ICLR, 2021
Xiangyi Chen, Zhiwei Steven Wu, Mingyi Hong. Understanding Gradient Clipping in Private SGD: A Geometric Perspective, NeurIPS, 2020
FLOW Talk #35
March 10, 2021 @ 1pm Coordinated Universal Time (UTC)
Eduard Gorbunov (Moscow Institute of Physics and Technology)
MARINA: Faster Non-Convex Distributed Learning with Compression
host: Peter Richtárik
Abstract: We develop and analyze MARINA: a new communication efficient method for non-convex distributed learning over heterogeneous datasets. MARINA employs a novel communication compression strategy based on the compression of gradient differences which is reminiscent of but different from the strategy employed in the DIANA method of Mishchenko et al (2019). Unlike virtually all competing distributed first-order methods, including DIANA, ours is based on a carefully designed biased gradient estimator, which is the key to its superior theoretical and practical performance. To the best of our knowledge, the communication complexity bounds we prove for MARINA are strictly superior to those of all previous first order methods. Further, we develop and analyze two variants of MARINA: VR-MARINA and PP-MARINA. The first method is designed for the case when the local loss functions owned by clients are either of a finite sum or of an expectation form, and the second method allows for partial participation of clients -- a feature important in federated learning. All our methods are superior to previous state-of-the-art methods in terms of the oracle/communication complexity. Finally, we provide convergence analysis of all methods for problems satisfying the Polyak-Lojasiewicz condition.
Paper: Eduard Gorbunov, Konstantin Burlachenko, Zhize Li, and Peter Richtárik. MARINA: Faster Non-Convex Distributed Learning with Compression, arXiv:2102.07845, 2021
FLOW Talk #34
March 3, 2021 @ 1pm Coordinated Universal Time (UTC)
Konstantin Mishchenko (KAUST)
Proximal and Federated Random Reshuffling
host: Peter Richtárik
Abstract: Random Reshuffling (RR), also known as Stochastic Gradient Descent (SGD) without replacement, is a popular and theoretically grounded method for finite-sum minimization. We propose two new algorithms: Proximal and Federated Random Reshuffing (ProxRR and FedRR). The first algorithm, ProxRR, solves composite convex finite-sum minimization problems in which the objective is the sum of a (potentially non-smooth) convex regularizer and an average of n smooth objectives. We obtain the second algorithm, FedRR, as a special case of ProxRR applied to a reformulation of distributed problems with either homogeneous or heterogeneous data. We study the algorithms' convergence properties with constant and decreasing stepsizes, and show that they have considerable advantages over Proximal and Local SGD. In particular, our methods have superior complexities and ProxRR evaluates the proximal operator once per epoch only. When the proximal operator is expensive to compute, this small difference makes ProxRR up to n times faster than algorithms that evaluate the proximal operator in every iteration. We give examples of practical optimization tasks where the proximal operator is difficult to compute and ProxRR has a clear advantage. Finally, we corroborate our results with experiments on real data sets.
Papers:
Konstantin Mishchenko, Ahmed Khaled, and Peter Richtárik. Proximal and Federated Random Reshuffling, arXiv:2102.06704, 2021
Konstantin Mishchenko, Ahmed Khaled, and Peter Richtárik. Random Reshuffling: Simple Analysis with Vast Improvements, NeurIPS 2020
FLOW Talk #33
February 24, 2021 @ 1pm Coordinated Universal Time (UTC)
Dan Alistarh (IST Austria)
Decentralized SGD with Asynchronous, Local and Quantized Updates
host: Peter Richtárik
Abstract: The ability to scale distributed optimization to large node counts has been one of the main enablers of recent progress in machine learning. To this end, several techniques have been explored, such as asynchronous, decentralized, or quantized communication--which significantly reduce the cost of synchronization, and the ability for nodes to perform several local model updates before communicating--which reduces the frequency of synchronization. In this paper, we show that these techniques, which have so far been considered independently, can be jointly leveraged to minimize distribution cost for training neural network models via stochastic gradient descent (SGD). We consider a setting with minimal coordination: we have a large number of nodes on a communication graph, each with a local subset of data, performing independent SGD updates onto their local models. After some number of local updates, each node chooses an interaction partner uniformly at random from its neighbors, and averages a possibly quantized version of its local model with the neighbor's model. Our first contribution is in proving that, even under such a relaxed setting, SGD can still be guaranteed to converge under standard assumptions. The proof is based on a new connection with parallel load-balancing processes, and improves existing techniques by jointly handling decentralization, asynchrony, quantization, and local updates, and by bounding their impact. On the practical side, we implement variants of our algorithm and deploy them onto distributed environments, and show that they can successfully converge and scale for large-scale image classification and translation tasks, matching or even slightly improving the accuracy of previous methods.
Paper: Giorgi Nadiradze, Amirmojtaba Sabour, Peter Davies, Ilia Markov, Shigang Li, and Dan Alistarh. Decentralized SGD with Asynchronous, Local and Quantized Updates, arXiv:1910.12308, 2019
FLOW Talk #32
February 17, 2021 @ 1pm Coordinated Universal Time (UTC)
Hamed Hassani (University of Pennsylvania)
Systems Heterogeneity in Federated Learning: Algorithms and Theoretical Limits
host: Dan Alistarh
Abstract: One of the important challenges in federated learning is systems or device heterogeneity. Broadly speaking, this challenge refers to clients having different computation speeds due to several factors such as variability in hardware (CPU, memory), power (battery level), and network connectivity. Such heterogeneity in clients' computation speeds has a negative effect on the scalability of federated learning algorithms and causes significant slow-down in their runtime due to the existence of stragglers. In this talk, we first provide a formal (mathematical) characterization of the problem of systems heterogeneity in federated learning from two different perspectives: the deadline-based model and the wall-clock model. For each of these models, we provide new algorithmic approaches as well as theoretical guarantees on the convergence rates. In particular, for the wall-clock model, we introduce FLANP which is a novel straggler-resilient federated learning method that incorporates statistical characteristics of the clients' data to adaptively select the clients in order to speed up the learning procedure. For the deadline-based model, we propose FedLin which carefully exploits past gradients and uses client-specific learning rates. FedLin matches centralized convergence rates despite arbitrary statistical and systems heterogeneity and, in particular, guarantees linear convergence to the global minimum (under strong convexity and smoothness)--a property that is absent in previous approaches to the deadline-based model such as FedNova.
Paper: Amirhossein Reisizadeh, Isidoros Tziotis, Hamed Hassani, Aryan Mokhtari, and Ramtin Pedarsani. Straggler-Resilient Federated Learning: Leveraging the Interplay Between Statistical Accuracy and System Heterogeneity, arXiv:2012.14453, 2020
FLOW Talk #31
January 27, 2021 @ 9:30am Coordinated Universal Time (UTC)
Richard Nock (Google Brain & Australian National University)
The Impact of Record Linkage on Learning from Feature Partitioned Data
host: Aurélien Bellet
Abstract: There has been recently a significant boost to machine learning with distributed data, in particular with the success of federated learning. A common and very challenging setting is that of {vertical} or {feature partitioned} data, when multiple data providers hold different features about common entities. In general, training needs to be preceded by record linkage (RL), a step that finds the correspondence between the observations of the datasets. RL is prone to mistakes in the real world. Despite the importance of the problem, there has been so far no formal assessment of the way in which RL errors impact learning models. Work in the area either use heuristics or assume that the optimal RL is known in advance. I will present what is to our knowledge the first formal assessment of the problem for supervised learning. For wide sets of losses, we provide technical conditions under which the classifier learned after noisy RL converges (with the data size) to the best classifier that would be learned from mistake-free RL. This yields new insights on the way the pipeline RL + ML operates (Work started in and done with the Confidential Computing team at Data61 / NICTA).
Paper:
FLOW Talk #30
January 20, 2021 @ 1pm Coordinated Universal Time (UTC)
Adam Smith (Boston University)
When is Memorization of Irrelevant Training Data Necessary for High-Accuracy Learning?
host: Peter Richtárik
Abstract: Modern machine learning models are complex, and frequently encode surprising amounts of information about individual inputs. In extreme cases, complex models appear to memorize entire input examples, including seemingly irrelevant information (social security numbers from text, for example). In this paper, we aim to understand whether this sort of memorization is necessary for accurate learning. We describe natural prediction problems in which every sufficiently accurate training algorithm must encode, in the prediction model, essentially all the information about a large subset of its training examples. This remains true even when the examples are high-dimensional and have entropy much higher than the sample size, and even when most of that information is ultimately irrelevant to the task at hand. Further, our results do not depend on the training algorithm or the class of models used for learning.
Our problems are simple and fairly natural variants of the next-symbol prediction and the cluster labeling tasks. These tasks can be seen as abstractions of image- and text-related prediction problems. To establish our results, we reduce from a family of one-way communication problems for which we prove new information complexity lower bounds.
Joint work with Gavin Brown, Mark Bun, Vitaly Feldman, and Kunal Talwar.
Paper: Gavin Brown, Mark Bun, Vitaly Feldman, Adam Smith, Kunal Talwar. When is Memorization of Irrelevant Training Data Necessary for High-Accuracy Learning?, arXiv preprint arXiv:2012.06421, 2020
FLOW Talk #29
January 13, 2021 @ 1pm Coordinated Universal Time (UTC)
Andrew Trask (Oxford/OpenMined)
Privacy-Preserving AI
host: Aurélien Bellet
Abstract: In this talk, Andrew will discuss the Federated Learning ecosystem at a high level, focusing on the key use-cases we can expect in the coming years as well as key technologies and research agendas which are bottlenecks to its widespread adoption. He will also demonstrate Duet, which can facilitate peer-to-peer federated learning amongst data scientists on a shared video call.