FLOW Talk #90
December 21, 2022 @ 1pm Coordinated Universal Time (UTC)
Abstract: The celebrated FedAvg algorithm of McMahan et al. (2017) is based on three components: client sampling (CS), data sampling (DS) and local training (LT). While the first two are reasonably well understood, the third component, whose role is to reduce the number of communication rounds needed to train the model, resisted all attempts at a satisfactory theoretical explanation. Malinovsky et al. (2022) identified four distinct generations of LT methods based on the quality of the provided theoretical communication complexity guarantees. Despite a lot of progress in this area, none of the existing works were able to show that it is theoretically better to employ multiple local gradient-type steps (i.e., to engage in LT) than to rely on a single local gradient-type step only in the important heterogeneous data regime. In a recent breakthrough embodied in their ProxSkip method and its theoretical analysis, Mishchenko et al. (2022) showed that LT indeed leads to provable communication acceleration for arbitrarily heterogeneous data, thus jump-starting the 5th generation of LT methods. However, while these latest generation LT methods are compatible with DS, none of them support CS. We resolve this open problem in the affirmative. In order to do so, we had to base our algorithmic development on new algorithmic and theoretical foundations.
Paper:
Michał Grudzień, Grigory Malinovsky and Peter Richtárik. Can 5th Generation Local Training Methods Support Client Sampling? Yes! arXiv preprint, 2022.
FLOW Talk #89
December 14, 2022 @ 5pm Coordinated Universal Time (UTC)
Abstract: The COVID-19 pandemic has emphasized the need for large-scale collaborations between the clinical and scientific communities to tackle global healthcare challenges. However, regulatory constraints around data sharing and patient privacy might hinder access to genuinely representative patient populations. Federated learning (FL) allows us to work around such restrictions while keeping patient privacy in mind. This talk will show how FL was used to predict clinical outcomes in patients with COVID-19 while allowing collaborators to retain governance over their data (Nature Medicine 2021). Furthermore, I will introduce several recent advances in FL, including quantifying potential data leakage, automated machine learning (AutoML), and personalization that can allow us to build more accurate and robust AI models for medical image analysis.
Papers:
Roth, Holger R., Yan Cheng, Yuhong Wen, Isaac Yang, Ziyue Xu, Yuan-Ting Hsieh, Kristopher Kersten et al. NVIDIA FLARE: Federated Learning from Simulation to Real-World. arXiv preprint arXiv:2210.13291, 2022.
Guo, Pengfei, Dong Yang, Ali Hatamizadeh, An Xu, Ziyue Xu, Wenqi Li, Can Zhao et al. Auto-FedRL: Federated Hyperparameter Optimization for Multi-institutional Medical Image Segmentation. arXiv preprint arXiv:2203.06338, 2022.
Xu, An, Wenqi Li, Pengfei Guo, Dong Yang, Holger R. Roth, Ali Hatamizadeh, Can Zhao, Daguang Xu, Heng Huang, and Ziyue Xu. Closing the Generalization Gap of Cross-silo Federated Medical Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
Hatamizadeh, Ali, Hongxu Yin, Pavlo Molchanov, Andriy Myronenko, Wenqi Li, Prerna Dogra, Andrew Feng et al. Do Gradient Inversion Attacks Make Federated Learning Unsafe?. arXiv preprint arXiv:2202.06924, 2022.
Dayan, Ittai, Holger R. Roth, Aoxiao Zhong, Ahmed Harouni, Amilcare Gentili, Anas Z. Abidin, Andrew Liu et al. Federated learning for predicting clinical outcomes in patients with COVID-19. Nature medicine, 2021.
Rieke, Nicola, Jonny Hancox, Wenqi Li, Fausto Milletari, Holger R. Roth, Shadi Albarqouni, Spyridon Bakas et al. The future of digital health with federated learning. NPJ digital medicine, 2020.
FLOW Talk #88
December 7, 2022 @ 1pm Coordinated Universal Time (UTC)
Abstract: In this work, we study distributed optimization algorithms that reduce the high communication costs of synchronization by allowing clients to perform multiple local gradient steps in each communication round. Recently, Mishchenko et al. (2022) proposed a new type of local method, called ProxSkip, that enjoys an accelerated communication complexity without any data similarity condition. However, their method requires all clients to call local gradient oracles with the same frequency. Because of statistical heterogeneity, we argue that clients with well-conditioned local problems should compute their local gradients less frequently than clients with ill-conditioned local problems. Our first contribution is the extension of the original ProxSkip method to the setup where clients are allowed to perform a different number of local gradient steps in each communication round. We prove that our modified method, GradSkip, still converges linearly, has the same accelerated communication complexity, and the required frequency for local gradient computations is proportional to the local condition number. Next, we generalize our method by extending the randomness of probabilistic alternations to arbitrary unbiased compression operators and considering a generic proximable regularizer. This generalization, GradSkip+, recovers several related methods in the literature. Finally, we present an empirical study to confirm our theoretical claims.
Paper:
Artavazd Maranjyan, Mher Safaryan and Peter Richtárik. GradSkip: Communication-accelerated local gradient methods with better computational complexity. arXiv:2210.16402, 2022.
FLOW Talk #87
November 30, 2022 @ 5pm Coordinated Universal Time (UTC)
Abstract: In current times, with federated learning being quite a popular option for data parallelism in machine learning, several variants of split learning (a variant of federated learning) bring in another dimension of parallelism in the training, with parallelization among parts of a model shared across clients and a server. These variants include AdaSplit, PrivateMail, LocFedMix-SL, DP-CutMixSL, NoPeek-Infer, DISCO, AirMixML and ExpertMatcher. Split learning provides gains in resource efficiency making it especially suitable for resource constrained edge devices as clients. Earlier and updated variants of split learning were ported into OpenMined’s popular framework of PySyft, into FedML and in pipelines of startups such as TripleBlind, Acuratio and have been considered in several large companies. This talk covers the landscape of these engineered variants along with privacy frameworks for split learning such as Posthoc privacy: Formal Privacy Guarantees for Neural Network queries by estimating local Lipschitz constant, PrivateMail, and DP-CutMixSL.
Abstract: Federated data analytics is a framework for distributed data analysis where a server compiles noisy responses from a group of distributed low-bandwidth user devices to estimate aggregate statistics. Two major challenges in this framework are privacy, since user data is often sensitive, and compression, since the user devices have low network bandwidth. Prior work has addressed these challenges separately by combining standard compression algorithms with known privacy mechanisms. In this work, we take a holistic look at the problem and design a family of privacy-aware compression mechanisms that work for any given communication budget. We first propose a mechanism for transmitting a single real number that has optimal variance under certain conditions. We then show how to extend it to metric differential privacy for location privacy use-cases, as well as vectors, for application to federated learning. Our experiments illustrate that our mechanism can lead to better utility vs. compression trade-offs for the same privacy loss in a number of settings.
Paper:
Chaudhuri, Guo & Rabbat. Privacy-Aware Compression for Federated Data Analysis. UAI 2022.
FLOW Talk #85
November 16, 2022 @ 1pm Coordinated Universal Time (UTC)
Abstract: The starting point of this paper is the discovery of a novel and simple error-feedback mechanism, which we call EF21-P, for dealing with the error introduced by a contractive compressor. Unlike all prior works on error feedback, where compression and correction operate in the dual space of gradients, our mechanism operates in the primal space of models. While we believe that EF21-P may be of interest in many situations where it is often advantageous to perform model perturbation prior to the computation of the gradient (e.g., randomized smoothing and generalization), in this work we focus our attention on its use as a key building block in the design of communication-efficient distributed optimization methods supporting bidirectional compression. In particular, we employ EF21-P as the mechanism for compressing and subsequently error-correcting the model broadcast by the server to the workers. By combining EF21-P with suitable methods performing worker-to-server compression, we obtain novel methods supporting bidirectional compression and enjoying new state-of-the-art theoretical communication complexity for convex and nonconvex problems. For example, our bounds are the first that manage to decouple the variance/error coming from the workers-to-server and server-to-workers compression, transforming a multiplicative dependence to an additive one. In the convex regime, we obtain the first bounds that match the theoretical communication complexity of gradient descent. Even in this convex regime, our algorithms work with biased gradient estimators, which is non-standard and requires new proof techniques that may be of independent interest. Finally, our theoretical results are corroborated through suitable experiments.
Paper:
Kaja Gruntkowska, Alexander Tyurin and Peter Richtárik. EF21-P and friends: Improved theoretical communication complexity for distributed optimization with bidirectional compression, arXiv:2209.15218, 2022.
FLOW Talk #84
November 9, 2022 @ 1pm Coordinated Universal Time (UTC)
Abstract: In distributed or federated optimization and learning, communication between the different computing units is often the bottleneck and gradient compression is widely used to reduce the number of bits sent within each communication round of iterative methods. There are two classes of compression operators and separate algorithms making use of them. In the case of unbiased random compressors with bounded variance (e.g., rand-k), the DIANA algorithm of Mishchenko et al. (2019), which implements a variance reduction technique for handling the variance introduced by compression, is the current state of the art. In the case of biased and contractive compressors (e.g., top-k), the EF21 algorithm of Richtárik et al. (2021), which instead implements an error-feedback mechanism, is the current state of the art. These two classes of compression schemes and algorithms are distinct, with different analyses and proof techniques. In this paper, we unify them into a single framework and propose a new algorithm, recovering DIANA and EF21 as particular cases. Our general approach works with a new, larger class of compressors, which has two parameters, the bias and the variance, and includes unbiased and biased compressors as particular cases. This allows us to inherit the best of the two worlds: like EF21 and unlike DIANA, biased compressors, like top-k, whose good performance in practice is recognized, can be used. And like DIANA and unlike EF21, independent randomness at the compressors allows to mitigate the effects of compression, with the convergence rate improving when the number of parallel workers is large. This is the first time that an algorithm with all these features is proposed. We prove its linear convergence under certain conditions. Our approach takes a step towards better understanding of two so-far distinct worlds of communication-efficient distributed learning.
Paper:
Laurent Condat, Kai Yi and Peter Richtárik. EF-BV: A unified theory of error feedback and variance reduction mechanisms for biased and unbiased compression in distributed optimization, NeurIPS 2022.
FLOW Talk #83
October 26, 2022 @ 1pm Coordinated Universal Time (UTC)
Abstract: Most existing federated learning (FL) frameworks assume a mutualistic relationship between participants, i.e., all the participating clients behave honestly and benefit from one another. However, just like in nature, real-world applications of FL often have to encounter other forms of relationships such as commensalism (some participants benefit while others are not harmed), parasitism (a few participants benefit at the expense of others), and competition (all the participants compete to maximize their individual benefits). This is especially true in the cross-silo setting, which typically requires collaboration between competitors (e.g., financial and healthcare institutions) driven by ulterior business interests. To operate in non-mutualistic scenarios, FL algorithms must satisfy additional properties such as collaboration fairness and Byzantine robustness. In this talk, we first discuss a local credibility mutual evaluation mechanism that ensures collaborative fairness between participants by delivering individual models that are commensurate with their contributions. Additionally, we attempt to minimize privacy leakage using synthetic samples generated by a differentially private generative adversarial network for local credibility evaluation. Next, we introduce a distance-based outlier suppression mechanism to improve the robustness of FL algorithms against poisoning attacks by malicious clients. Finally, we highlight the need for new definitions of utility in a collaborative setting and novel methods for assessing them in a fair, transparent, and privacy-preserving manner.
FLOW Talk #82
October 19, 2022 @ 5pm Coordinated Universal Time (UTC)
Abstract: We consider two federated learning algorithms for training partially personalized models, where the shared and personal parameters are updated either simultaneously or alternately on the devices. Both algorithms have been proposed in the literature, but their convergence properties are not fully understood, especially for the alternating variant. We provide convergence analyses of both algorithms in the general nonconvex setting with partial participation and delineate the regime where one dominates the other. Our experiments on real-world image, text, and speech datasets demonstrate that (a) partial personalization can obtain most of the benefits of full model personalization with a small fraction of personal parameters, and, (b) the alternating update algorithm outperforms the simultaneous update algorithm by a small but consistent margin.
Paper:
Krishna Pillutla, Kshitiz Malik, Abdel-Rahman Mohamed, Mike Rabbat, Maziar Sanjabi, and Lin Xiao. Federated Learning with Partial Model Personalization, ICML, 2022
FLOW Talk #81
October 12, 2022 @ 1pm Coordinated Universal Time (UTC)
Abstract: A central challenge in federated learning (FL) is to capture shared information across clients’ data distributions despite the fact that these distributions are heterogeneous. A promising avenue to overcome this challenge is through representation learning, as clients’ data often share a common representation, or set of important features, that enables quickly solving their task, even in heterogeneous settings. Unfortunately, the standard FL objective of finding a single model that minimizes the average loss across tasks does not necessitate learning a shared representation. To remedy this issue, we propose a novel algorithm, FedRep, that aims to learn a shared representation and personalized (client-specific) heads, or last layer(s) of the model. We show that FedRep recovers the ground-truth representation at a linear rate in the multi-task linear regression setting with efficient samples per client, and demonstrate that FedRep achieves test accuracy improvements over a variety of personalized FL baselines on image classification and sentiment analysis benchmarks. Nevertheless, an interesting observation from our experiments is that FedAvg [McMahon et al. 2017] plus fine-tuning performs nearly as well as FedRep, suggesting that it too learns an expressive shared representation, even though its objective of find a single model to minimize the average loss across tasks is not tied to representation learning. We explain this phenomenon by again analyzing the multi-task linear regression setting, and reach a surprising conclusion: FedAvg’s representation learning success is directly tied to the fact that it does not actually solve its objective. Our analysis reveals that FedAvg’s multiple local updates between communication rounds leverage client diversity to learn the shared representation, while simultaneously preventing the learned model from minimizing the average loss. Further, we prove that exactly minimizing the average loss via Distributed Gradient Descent (D-GD) cannot recover the shared representation. We support this analysis with experiments on neural networks showing that FedAvg learns representations that generalize drastically better after fine-tuning to new clients than does D-SGD, highlighting the importance of local updates to representation learning.
Paper:
Liam Collins, Hamed Hassani, Aryan Mokhtari, and Sanjay Shakkottai. Exploiting Shared Representations for Personalized Federated Learning, arXiv preprint arXiv:2102.07078, 2021.
Liam Collins, Hamed Hassani, Aryan Mokhtari, and Sanjay Shakkottai. FedAvg with Fine Tuning: Local Updates Lead to Representation Learning, arXiv preprint arXiv:2205.13692, 2022.
FLOW Talk #80
September 14, 2022 @ 1pm Coordinated Universal Time (UTC)
Abstract: We study the problem of distributed stochastic non-convex optimization with intermittent communication. We consider the full participation setting where M machines work in parallel over R communication rounds and the partial participation setting where m machines are sampled independently every round from some meta-distribution over machines. We propose and analyze a new algorithm that improves existing methods by requiring fewer and lighter variance reduction operations. We also present lower bounds, showing our algorithm is either optimal or almost optimal in most settings.
FLOW Talk #79
August 31, 2022 @ 1pm Coordinated Universal Time (UTC)
Abstract: We study the asynchronous stochastic gradient descent algorithm, for distributed training over $n$ workers that might be heterogeneous. In this algorithm, workers compute stochastic gradients in parallel at their own pace and return them to the server without any synchronization. Existing convergence rates of this algorithm for non-convex smooth objectives depend on the maximum delay $\tau_{\max}$ and reach an $\epsilon$-stationary point after $O(\sigma^2\epsilon^{-2}+ \tau_{\max}\epsilon^{-1})$ iterations, where $\sigma$ is the variance of stochastic gradients. In this work (i) we obtain a tighter convergence rate of $O(\sigma^2\epsilon^{-2}+ \sqrt{\tau_{\max}\tau_{avg}}\epsilon^{-1})$ without any change in the algorithm where $\tau_{avg}$ is the average delay, which can be significantly smaller than $\tau_{\max}$. We also provide (ii) a simple delay-adaptive learning rate scheme, under which asynchronous SGD achieves a convergence rate of $O(\sigma^2\epsilon^{-2}+ \tau_{avg}\epsilon^{-1})$, and does not require any extra hyperparameter tuning nor extra communications. Our result allows to show for the first time that asynchronous SGD is always faster than mini-batch SGD. In addition, (iii) we consider the case of heterogeneous functions motivated by federated learning applications and improve the convergence rate by proving a weaker dependence on the maximum delay compared to prior works.
Paper:
Anastasia Koloskova, Sebastian Stich and Martin Jaggi, Sharper Convergence Guarantees for Asynchronous SGD for Distributed and Federated Learning, arXiv preprint arXiv:2206.08307, 2022
FLOW Talk #78
July 6, 2022 @ 1pm Coordinated Universal Time (UTC)
Abstract: Federated learning (FL) aims to minimize the communication cost of training a model over heterogeneous data distributed across many clients. A common approach is local update methods, where clients take multiple optimization steps over local data before communicating with the server (e.g., FedAvg). When clients’ data is homogeneous, local methods converge fast with small number of communications. On the other hand, when clients are highly heterogeneous, global update methods, where clients take one optimization step per communication round (e.g., SGD), significantly outperform local methods. A fundamental question in FL is how to exploit the benefits of both local and global approaches to achieve near-optimal convergence rates for all heterogeneity regimes. To this end, we propose a federated chaining framework; we first run a local-update method up to a heterogeneity-induced error floor and then switch to a global-update method. Federated chaining is the first approach to achieve near-optimal convergence rates for strongly convex functions and non-convex functions with PL conditions. Empirical results support this theoretical gain over existing methods.
Paper:
Charlie Hou, Kiran K. Thekumparampil, Giulia Fanti, and Sewoong Oh. Reducing the Communication Cost of Federated Learning through Multistage Optimization, arXiv preprint arXiv:2108.06869, 2021.
FLOW Talk #77
June 29, 2022 @ 1pm Coordinated Universal Time (UTC)
Abstract: Some of the most impactful machine learning tasks involve data that is very large and/or comes from many different sources. For these tasks, parallel and distributed computing is necessary to make learning tractable. One way to speed up distributed learning—and make it feasible on edge devices—is decentralization. Decentralization removes the bottleneck and single-point-of-failure of a centralized parameter server by having learners communicate directly with each other over a sparse network topology, usually one that mirrors the physical connectivity of the devices. This talk will cover the basic principles of distributed and decentralized ML, present some of our results on designing optimal learning algorithms for decentralized learning, and describe how low-precision arithmetic can be used to further accelerate learning over low-bandwidth networks.
Paper:
Optimal Complexity in Decentralized Training
Yucheng Lu, Christopher De Sa
In ICML: the Thirty-eighth International Conference on Machine Learning, July 2021
Outstanding Paper (Honorable Mention)
http://proceedings.mlr.press/v139/lu21a.html
FLOW Talk #76
June 22, 2022 @ 1pm Coordinated Universal Time (UTC).
Abstract: Collaborative, secure ML paradigms have emerged as a compelling alternative for sensitive applications in the last few years. These paradigms eliminate the need to pool data centrally for ML training and thus ensure data sovereignty and alleviate the risks associated with the large-scale collection of sensitive data. Although they provide many privacy benefits, these systems amplify ML robustness issues by exposing the learning process to an active attacker that can be present throughout the entire training process. In this talk, I will give an overview of the security and robustness challenges of Collaborative Learning Systems and highlight why a definitive solution to robustness in these systems is challenging.
Paper:
Abstract: This talk discusses federated learning with privacy and system constraints. In the first part of the talk, we show a new DP-FTRL algorithm can achieve good privacy utility trade-offs without relying on sampling for simplification, which helps us train production neural networks directly on user data with a formal DP guarantee. In the second part of the talk, we discuss how to model and tackle periodic distribution shifts observed in practical cross-device FL systems. Finally, we briefly discuss new insights on data heterogeneity.
Papers:
Peter Kairouz, Brendan McMahan, Shuang Song, Om Thakkar, Abhradeep Thakurta, Zheng Xu. Practical and Private (Deep) Learning without Sampling or Shuffling. ICML 2021
Federated Learning with Formal Differential Privacy Guarantees. Google AI Blog, 2022
Chen Zhu, Zheng Xu, Mingqing Chen, Jakub Konečný, Andrew Hard, Tom Goldstein. Diurnal or Nocturnal? Federated Learning of Multi-branch Networks from Periodically Shifting Distributions. ICLR 2022
Abstract: In a lot of real-world machine learning applications, data are provided by multiple providers and each maintains private records of different feature sets about common entities. The privacy-preserving federated learning for vertically partitioned data has shown promising results as the solution of the emerging multi-party joint modeling application, in which the data holders collaborate throughout the learning process rather than relying on a trusted third party to hold data. This talk will discuss the challenge of data privacy of vertical federated learning (VFL) compared to the horizontal one. Instead of using traditional secure techniques (such as Encryption, secure multi-party computation, and differential privacy), we utilize the idea of isolating local data as much as possible to achieve lightweight and lossless data privacy. We will practice this idea to achieve fast and secure VFLs.
Paper:
Qinsong Zhang, Bin Gu, Cheng Deng, and Heng Huang. Desirable Companion for Vertical Federated Learning: New Zeroth-Order Gradient Based Algorithm. CIKM, 2021.
Qinsong Zhang, Bin Gu, Cheng Deng, and Heng Huang. Secure Bilevel Asynchronous Vertical Federated Learning with Backward Updating, AAAI, 2021.
Bin Gu, An Xu, Zhouyuan Huo, Cheng Deng, and Heng Huang. Privacy-Preserving Asynchronous Vertical Federated Learning Algorithms for Multi-Party Collaborative Learning. TNNLS. 2021.
FLOW Talk #73
May 25, 2022 @ 1pm Coordinated Universal Time (UTC)
Abstract: Federated Learning has great potential to be a key component in the quest for data privacy. However, a federated learning protocol between a group of users and a central server protects the privacy of these users from the server only if the server cannot recover user data from updates sent during training. So far, most investigations into such attacks have focused on an "honest-but-curious" threat model for the server. This talk will discuss recent investigations into "malicious server" threat models. We argue that these threat models are highly relevant from a user-centric perspective and much more vulnerable to attacks against privacy.
Paper:
Liam H Fowl, Jonas Geiping, Wojciech Czaja, Micah Goldblum, Tom Goldstein. Robbing the Fed: Directly Obtaining Private Data in Federated Learning with Modified Models. ICLR 2022
Liam Fowl, Jonas Geiping, Steven Reich, Yuxin Wen, Wojtek Czaja, Micah Goldblum, Tom Goldstein. Decepticons: Corrupted Transformers Breach Privacy in Federated Learning for Language Models. arXiv:2201.12675
Yuxin Wen, Jonas Geiping, Liam Fowl, Micah Goldblum, Tom Goldstein. Fishing for User Data in Large-Batch Federated Learning via Gradient Magnification. arXiv:2202.00580
FLOW Talk #72
May 11, 2022 @ 1pm Coordinated Universal Time (UTC)
Abstract: In distributed learning, local SGD (also known as federated averaging) and its simple baseline minibatch SGD are widely studied optimization methods. Most existing analyses of these methods assume independent and unbiased gradient estimates obtained via with-replacement sampling. In contrast, we study shuffling-based variants: minibatch and local Random Reshuffling, which draw stochastic gradients without replacement and are thus closer to practice. For smooth functions satisfying the Polyak-Łojasiewicz condition, we obtain convergence bounds (in the large epoch regime) which show that these shuffling-based variants converge faster than their with-replacement counterparts. Moreover, we prove matching lower bounds showing that our convergence analysis is tight. Finally, we propose an algorithmic modification called synchronized shuffling that leads to convergence rates faster than our lower bounds in near-homogeneous settings.
Paper:
Chulhee Yun, Shashank Rajput, and Suvrit Sra. Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and Beyond, International Conference on Learning Representations (ICLR), 2022.
FLOW Talk #71
May 4, 2022 @ 1pm Coordinated Universal Time (UTC)
Abstract: We introduce ProxSkip---a surprisingly simple and provably efficient method for minimizing the sum of a smooth ($f$) and an expensive nonsmooth proximable ($\psi$) function. The canonical approach to solving such problems is via the proximal gradient descent (ProxGD) algorithm, which is based on the evaluation of the gradient of $f$ and the prox operator of $\psi$ in each iteration. In this work we are specifically interested in the regime in which the evaluation of prox is costly relative to the evaluation of the gradient, which is the case in many applications. ProxSkip allows for the expensive prox operator to be skipped in most iterations: while its iteration complexity is $\cO(\kappa \log \nicefrac{1}{\varepsilon})$, where $\kappa$ is the condition number of $f$, the number of prox evaluations is $\cO(\sqrt{\kappa} \log \nicefrac{1}{\varepsilon})$ only. Our main motivation comes from federated learning, where evaluation of the gradient operator corresponds to taking a local GD step independently on all devices, and evaluation of prox corresponds to (expensive) communication in the form of gradient averaging. In this context, ProxSkip offers an effective acceleration of communication complexity. Unlike other local gradient-type methods, such as FedAvg, Scaffold, S-Local-GD and FedLin, whose theoretical communication complexity is worse than, or at best matching, that of vanilla GD in the heterogeneous data regime, we obtain a provable and large improvement without any heterogeneity-bounding assumptions.
Paper:
Konstantin Mishchenko, Grigory Malinovsky, Sebastian Stich and Peter Richtárik, ProxSkip: Yes! Local Gradient Steps Provably Lead to Communication Acceleration! Finally!, arXiv preprint arXiv:2202.09357, 2022.
FLOW Talk #70
April 20, 2022 @ 1pm Coordinated Universal Time (UTC)
Abstract: There is a dearth of convergence results for differentially private federated learning (FL) with non-Lipschitz objective functions (i.e., when gradient norms are not bounded). The primary reason for this is that the clipping operation (i.e., projection onto an l_2 ball of a fixed radius called the clipping threshold) for bounding the sensitivity of the average update to each client's update introduces bias depending on the clipping threshold and the number of local steps in FL, and analyzing this is not easy. For Lipschitz functions, the Lipschitz constant serves as a trivial clipping threshold with zero bias. However, Lipschitzness does not hold in many practical settings; moreover, verifying it and computing the Lipschitz constant is hard. Thus, the choice of the clipping threshold is non-trivial and requires a lot of tuning in practice.
In this work, we provide the first convergence result for private FL with clipping on smooth convex objectives for a general clipping threshold-- without assuming Lipschitzness. We also look at a simpler alternative to clipping (for bounding sensitivity) which is normalization -- where we use only a scaled version of the unit vector along the client updates, completely discarding the magnitude information. The resulting normalization-based private FL algorithm is theoretically shown to have better convergence than its clipping-based counterpart on smooth convex functions.
Paper:
Rudrajit Das, Abolfazl Hashemi, Sujay Sanghavi, Inderjit S. Dhillon. On the Convergence of Differentially Private Federated Learning on Non-Lipschitz Objectives, and with Normalized Client Updates. arXiv preprint arXiv:2106.07094, 2021.
FLOW Talk #69
April 13, 2022 @ 5pm Coordinated Universal Time (UTC)
Abstract: Motivated by differentially-private (DP) training of machine learning models and other applications, we investigate the problem of computing prefix sums in the online (streaming) setting with DP. This problem has previously been addressed by special-purpose tree aggregation schemes with hand-crafted estimators. We show that these previous schemes can all be viewed as specific instances of a broad class of matrix-factorization-based DP mechanisms, and that in fact much better mechanisms exist in this class.
In particular, we characterize optimal factorizations of linear queries under online constraints, deriving existence, uniqueness, and explicit expressions that allow us to efficiently compute optimal mechanisms, including for online prefix sums. These solutions improve over the existing state-of-the-art by a significant constant factor, and avoid some of the artifacts introduced by the use of the tree data structure.
Paper:
Brendan McMahan, Keith Rush, and Abhradeep Guha Thakurta. Private Online Prefix Sums via Optimal Matrix Factorizations, arXiv preprint arXiv:2202.08312, 2022.
FLOW Talk #68
April 6, 2022 @ 1pm Coordinated Universal Time (UTC)
Abstract: We present a theoretical study of server-side optimization in federated learning. Our results are the first to show that the widely popular heuristic of scaling the client updates with an extra parameter is very useful in the context of Federated Averaging (FedAvg) with local passes over the client data. Each local pass is performed without replacement using Random Reshuffling, which is a key reason we can show improved complexities. In particular, we prove that whenever the local stepsizes are small, and the update direction is given by FedAvg in conjunction with Random Reshuffling over all clients, one can take a big leap in the obtained direction and improve rates for convex, strongly convex, and non-convex objectives. In particular, in non-convex regime we get an enhancement of the rate of convergence from O(ε^{-3}) to O(ε^{−2}). This result is new even for Random Reshuffling performed on a single node. In contrast, if the local stepsizes are large, we prove that the noise of client sampling can be controlled by using a small server-side stepsize. To the best of our knowledge, this is the first time that local steps provably help to overcome the communication bottleneck. Together, our results on the advantage of large and small server-side stepsizes give a formal justification for the practice of adaptive server-side optimization in federated learning. Moreover, we consider a variant of our algorithm that supports partial client participation, which makes the method more practical.
Paper:
Grigory Malinovsky, Konstantin Mishchenko, and Peter Richtárik. Server-Side Stepsizes and Sampling Without Replacement Provably Help in Federated Optimization, arXiv preprint arXiv:2201.11066, 2022.
FLOW Talk #67
March 30, 2022 @ 1pm Coordinated Universal Time (UTC)
Abstract: Computation, communication and privacy are three important challenges in federated learning (FL). In this talk, we will discuss two approaches that tackle these challenges. The first one is FedSparse, which jointly tackles communication and computational cost reduction at the edge via model pruning. It is developed through the lens of a graphical model perspective for FL, in which the server parameters act as a prior over the local client parameters. In such a framework, with Gaussian priors we can obtain the well-known FedProx algorithm (and thus FedAvg as a special case) whereas with a spike-and-slab prior we can obtain FedSparse. In the second part of the talk we will discuss about private and communication-efficient FL. By realising that both compression and privacy limit the amount of information encoded in a user-update, we develop DP-REC, a novel method that exploits this property. We show that by imposing differential privacy (DP) constraints we directly affect the amount of bits we need to send with DP-REC, thus, in practice, we can achieve DP-guarantees with extreme compression (e.g., 7-bits per tensor).
Papers:
Christos Louizos, Matthias Reisser, Joseph Soriaga, and Max Welling. An Expectation-Maximization Perspective on Federated Learning, arXiv preprint arXiv:2111.10192, 2021.
Aleksei Triastcyn, Matthias Reisser, and Christos Louizos. DP-REC: Private & Communication-Efficient Federated Learning, arXiv preprint arXiv:2111.05454, 2021.
FLOW Talk #66
February 16, 2022 @ 1pm Coordinated Universal Time (UTC)
Abstract: Federated Learning (FL) is an increasingly popular machine learning paradigm in which multiple nodes try to collaboratively learn under privacy, communication and multiple heterogeneity constraints. A persistent problem in federated learning is that it is not clear what the optimization objective should be: the standard average risk minimization of supervised learning is inadequate in handling several major constraints specific to federated learning, such as communication adaptivity and personalization control. We identify several key desiderata in frameworks for federated learning and introduce a new framework, FLIX, that takes into account the unique challenges brought by federated learning. FLIX has a standard finite-sum form, which enables practitioners to tap into the immense wealth of existing (potentially non-local) methods for distributed optimization. Through a smart initialization that does not require any communication, FLIX does not require the use of local steps but is still provably capable of performing dissimilarity regularization on par with local methods. We give several algorithms for solving the FLIX formulation efficiently under communication constraints. Finally, we corroborate our theoretical results with extensive experimentation.
Papers:
Elnur Gasanov, Ahmed Khaled, Samuel Horváth and Peter Richtárik. FLIX: A Simple and Communication-Efficient Alternative to Local Methods in Federated Learning, The 25th International Conference on Artificial Intelligence and Statistics, 2022 (AISTATS 2022).
FLOW Talk #65
February 9th, 2022 @ 1pm Coordinated Universal Time (UTC)
Abstract: We study the MARINA method of Gorbunov et al. (2021) -- the current state-of-the-art distributed non-convex optimization method in terms of theoretical communication complexity. Theoretical superiority of this method can be largely attributed to two sources: the use of a carefully engineered biased stochastic gradient estimator, which leads to a reduction in the number of communication rounds, and the reliance on independent stochastic communication compression operators, which leads to a reduction in the number of transmitted bits within each communication round. In this paper we i) extend the theory of MARINA to support a much wider class of potentially correlated compressors, extending the reach of the method beyond the classical independent compressors setting, ii) show that a new quantity, for which we coin the name Hessian variance, allows us to significantly refine the original analysis of MARINA without any additional assumptions, and iii) identify a special class of correlated compressors based on the idea of random permutations, for which we coin the term PermK, the use of which leads to O(√n) (resp. O(1+d/√n)) improvement in the theoretical communication complexity of MARINA in the low Hessian variance regime when d≥n (resp. d≤n), where n is the number of workers and d is the number of parameters describing the model we are learning. We corroborate our theoretical results with carefully engineered synthetic experiments with minimizing the average of nonconvex quadratics, and on autoencoder training with the MNIST dataset.
Paper: Rafał Szlendak, Alexander Tyurin, and Peter Richtárik. Permutation Compressors for Provably Faster Distributed Nonconvex Optimization, International Conference on Learning Representations (ICLR 2022), 2022.
FLOW Talk #64
January 26th, 2022 @ 4pm Coordinated Universal Time (UTC)
Abstract:
Cross-device Federated Learning (FL) is a distributed learning paradigm with several challenges that differentiate it from traditional distributed learning, variability in the system characteristics on each device, and millions of clients coordinating with a central server being primary ones. Most FL systems described in the literature are synchronous - they perform a synchronized aggregation of model updates from individual clients. Scaling synchronous FL is challenging since increasing the number of clients training in parallel leads to diminishing returns in training speed, analogous to large-batch training. Moreover, stragglers hinder synchronous FL training.
In first part of this talk, I will describe a buffered asynchronous aggregation method, FedBuff that combines the best properties of synchronous and asynchronous FL while being compatible with privacy-preserving technologies such as Secure Aggregation and differential privacy.
In the second part of the talk, I will outline a production asynchronous FL system design. Our work tackles the aforementioned issues, sketches of some of the system design challenges and their solutions, and touches upon principles that emerged from building a production FL system for millions of clients.
Papers:
Nguyen, John, Kshitiz Malik, Hongyuan Zhan, Ashkan Yousefpour, Michael Rabbat, Mani Malek Esmaeili, and Dzmitry Huba. Federated Learning with Buffered Asynchronous Aggregation, arXiv preprint arXiv:2106.06639 (2021).
Huba, Dzmitry, John Nguyen, Kshitiz Malik, Ruiyu Zhu, Mike Rabbat, Ashkan Yousefpour, Carole-Jean Wu et al. Papaya: Practical, Private, and Scalable Federated Learning, arXiv preprint arXiv:2111.04877 (2021).
FLOW Talk #63
January 12th, 2022 @ 1pm Coordinated Universal Time (UTC)
Abstract:
Federated learning has drawn significant attention from academia and industry, and is even being deployed in real systems. Unfortunately, recent works have shown that adversaries can execute attacks to retrieve sensitive information from the model parameters/gradients. Prominent examples of such attacks are data reconstruction and various types of inference attacks. In this work, we proposed a practical, privacy-preserving federated learning framework, which protects clients’ private information against known privacy-related attacks. This framework adopts greedy layer-wise training and updates layers always inside Trusted Execution Environments (TEEs) at both server and clients. We implemented this system with mobile-like TEE (i.e., TrustZone) and server-like TEE (i.e., Intel SGX) and empirically tested its performance. We showed the possibility of fully guaranteeing privacy and achieving comparable ML model utility with regular end-to-end federated learning, without significant communication and system overhead.
Papers:
PPFL: privacy-preserving federated learning with trusted execution environments, arXiv preprint arXiv:2104.14380 (2021).
DarkneTZ: Towards Model Privacy at the Edge using Trusted Execution Environments, arXiv preprint arXiv:2004.05703 (2020).
Layer-wise Characterization of Latent Information Leakage in Federated Learning, arXiv preprint arXiv: 2010.08762 (2020).