Federated Learning One World Seminar
Archive of Talks: FLOW Season 2024
FLOW Talk #122
October 23, 2024 @ 5pm Coordinated Universal Time (UTC)
Federated Learning in The Age of Foundation Models
host: Samuel Horvath
[slides]
Abstract: In the ever-evolving landscape of artificial intelligence, handling and leveraging data effectively has been and will continue to be a critical challenge, especially in the age of foundation models. Recent developments in utilizing them, e.g. large language models (LLMs), have opened new horizons in the research. Although most algorithms are trained in a centralized fashion, access to necessary data can be restricted due to various factors such as privacy, regulation, geopolitics, and the sheer effort to move the datasets. Given the fundamentals of federated learning (FL) addressing the pivotal balance between data access and the collaborative enhancement of AI models, in this talk, we explore how FL can address the challenges with easy and scalable integration capabilities. Enabled by practical frameworks like NVIDIA FLARE, we will discuss the special challenges and solutions for embedding FL in foundation model development and customizations to enhance their accuracy and robustness. Ultimately, this talk underscores the transformative potential of FL in foundation models, offering insights into its current achievements and future possibilities.
FLOW Talk #121
October 16, 2024 @ 1pm Coordinated Universal Time (UTC)
Federated Learning Can Find Friends That Are Advantageous and Help with Low-Recourse Machine Translation
host: Samuel Horvath
[slides]
Abstract: In Federated Learning (FL), the distributed nature and heterogeneity of client data present both opportunities and challenges. While collaboration among clients can significantly enhance the learning process, not all collaborations are beneficial; some may even be detrimental. In this talk, I will discuss our novel algorithm that assigns adaptive aggregation weights to clients participating in FL training, identifying those with data distributions most conducive to a specific learning objective. As I will explain during the talk, the proposed aggregation method converges no worse than the method that aggregates only the updates received from clients with the same data distribution. Furthermore, empirical evaluations consistently reveal that collaborations guided by the proposed algorithm outperform traditional FL approaches. Moreover, in the second part of my talk, I will explain how the proposed approach can be adjusted to the training of LLMs on Low-Resource languages.
Papers:
N. Tupitsa, S. Horváth, M. Takáč, E. Gorbunov. Federated Learning Can Find Friends That Are Advantageous, arXiv preprint, 2024.
V. Moskvoretskii, N. Tupitsa, C. Biemann, S. Horváth, E. Gorbunov, I. Nikishina. Low-Resource Machine Translation through the Lens of Personalized Federated Learning, to appear in EMNLP 2024 (Findings).
Abstract: As established scaling laws indicate, the future performance improvements of LLMs depend on the amount of computing and data sources we can leverage. Where will we get the necessary compute and data to drive the continued advances in LLMs that the world now has grown to expect? I believe all roads lead to federated learning. Federated and de-centralized approaches to machine learning will be how the strongest LLMs (and foundation models more generally) are trained in the relatively near future; and in time, we will see federated as one of the core enablers of the entire AI revolution. In this talk, I will describe why the future of AI will be federated, and describe early solutions developed by Flower Labs and CaMLSys that address the underlying technical challenges that the world will face as we shift from a centralized data-center mindset to de-centralized alternatives.
FLOW Talk #119
July 17, 2024 @ 1pm Coordinated Universal Time (UTC)
Overcoming the Challenges of Batch Normalization in Federated Learning
host: Sebastian Stich
Abstract: Batch normalization has proven to be a very beneficial mechanism to accelerate the training and improve the accuracy of deep neural networks in centralized environments. Yet, the scheme faces significant challenges in federated learning, especially under high data heterogeneity. Essentially, the main challenges arise from external covariate shifts and inconsistent statistics across clients. We introduce in this paper Federated BatchNorm (FBN), a novel scheme that restores the benefits of batch normalization in federated learning. Essentially, FBN ensures that the batch normalization during training is consistent with what would be achieved in a centralized execution, hence preserving the distribution of the data, and providing running statistics that accurately approximate the global statistics. FBN thereby reduces the external covariate shift and matches the evaluation performance of the centralized setting. We also show that, with a slight increase in complexity, we can robustify FBN to mitigate erroneous statistics and potentially adversarial attacks.
FLOW Talk #118
July 10, 2024 @ 1pm Coordinated Universal Time (UTC)
Fast Proximal-Point methods for Federated Optimization
host: Sebastian Stich
Abstract: In developing efficient optimization algorithms, it is crucial to account for communication constraints—a significant challenge in modern federated learning settings. In this talk, I will first revisit DANE, a distributed proximal point algorithm, and show that it can exploit second-order dissimilarity and achieve the desired communication reduction under such conditions. However, its local computation efficiency is sub-optimal. I will then introduce a novel distributed algorithm S-DANE. This method adopts a more stabilized prox-center in the proximal step and matches DANE’s communication complexity. Moreover, the accuracy requirement for solving its subproblem is weaker than DANE, leading to enhanced local computation efficiency. Finally, I will introduce how to accelerate S-DANE, and show that the resulting algorithm achieves the best-known communication complexity among all existing methods for convex distributed optimization, with the same improved local computation efficiency as S-DANE.
FLOW Talk #117
June 12, 2023 @ 5pm Coordinated Universal Time (UTC)
Revisiting the Convergence Theory for Local SGD and the Role of Personalization
host: Sebastian Stich
Abstract: Local SGD is a popular optimization method in distributed learning, often outperforming other algorithms like mini-batch SGD in practice. Despite its success, theoretically proving the dominance of local SGD in scenarios with reasonable data heterogeneity has been challenging, creating a gap between theory and practice. In this talk, we will discuss new lower bounds for local SGD under existing first-order data heterogeneity assumptions, demonstrating that most of these assumptions are insufficient to prove the effectiveness of local update steps. Furthermore, under these same assumptions, we demonstrate the min-max optimality of accelerated mini-batch SGD, which resolves our theoretical understanding of distributed optimization for several problem classes. Our results emphasize the need for better models of data heterogeneity to understand the effectiveness of local SGD in practice. To address this, we consider higher-order smoothness and heterogeneity assumptions, providing new upper bounds that imply the dominance of local SGD over mini-batch SGD when data heterogeneity is low. In particular, in the strongly convex setting, we study the fixed-point discrepancy of local SGD and how data heterogeneity assumptions control it. Finally, we present a simple personalized variant of local SGD that rectifies several convergence issues, including the fixed-point discrepancy, and offers a better computation-communication trade-off.
Paper:
The Limits and Potentials of Local SGD for Distributed Heterogeneous Learning with Intermittent Communication: Kumar Kshitij Patel, Margalit Glasgow, Ali Zindari, Lingxiao Wang, Sebastian U. Stich, Ziheng Cheng, Nirmit Joshi, Nathan Srebro, arXiV, 2024.
FLOW Talk #116
March 27, 2023 @ 1pm Coordinated Universal Time (UTC)
Byzantine Robustness and Partial Participation Can Be Achieved Simultaneously: Just Clip Gradient Differences
host: Samuel Horvath
Abstract: Distributed learning has emerged as a leading paradigm for training large machine learning models. However, in real-world scenarios, participants may be unreliable or malicious, posing a significant challenge to the integrity and accuracy of the trained models. Byzantine fault tolerance mechanisms have been proposed to address these issues, but they often assume full participation from all clients, which is not always practical due to the unavailability of some clients or communication constraints. In our work, we propose the first distributed method with client sampling and provable tolerance to Byzantine workers. The key idea behind the developed method is the use of gradient clipping to control stochastic gradient differences in recursive variance reduction. This allows us to bound the potential harm caused by Byzantine workers, even during iterations when all sampled clients are Byzantine. Furthermore, we incorporate communication compression into the method to enhance communication efficiency. Under quite general assumptions, we prove convergence rates for the proposed method that match the existing state-of-the-art (SOTA) theoretical results.
Paper:
Byzantine Robustness and Partial Participation Can Be Achieved Simultaneously: Just Clip Gradient Differences: Grigory Malinovsky, Peter Richtárik, Samuel Horváth, Eduard Gorbunov, arXiV, 2024.
FLOW Talk #115
March 13, 2023 @ 5pm Coordinated Universal Time (UTC)
Provably Personalized and Robust Federated Learning
host: Samuel Horvath
Abstract: Federated learning is a powerful distributed optimization framework in which multiple clients collaboratively train a global model without sharing their raw data. In this work, we tackle the personalized version of the federated learning problem. In particular, we ask: throughout the training process, can each client in a federated system identify a subset of similar clients and collaboratively train with just those clients? In the affirmative, we formalize this problem as a stochastic optimization problem, achieving optimal convergence rates for a large class of loss functions. We propose simple iterative algorithms which identify clusters of similar clients and train a personalized model-per-cluster, using local client gradients and flexible constraints on the clusters. The convergence rates of our algorithms asymptotically match those obtained if we knew the true underlying clustering of the clients and are provably robust in the Byzantine setting where some fraction of the clients are malicious.
Paper:
An Efficient Framework for Clustered Federated Learning: Avishek Ghosh, Jichan Chung, Dong Yin, Kannan Ramchandran, NeurIPS, 2020.
Provably Personalized and Robust Federated Learning: Mariel Werner, Lie He, Michael Jordan, Martin Jaggi, Sai Praneeth Karimireddy, arXiV, 2023.
Abstract: Data sharing and collaborative (machine) learning have use-cases where individual "agents" (e.g., researchers, organizations) have limited capabilities or resources to conduct large-scale data collection and thus turn to each other for collaboration. A motivating example is in medicine (NEJM, Nature Journal) where the data are incredibly valuable and costly, and are under stringent privacy regulations. Hence, we formally study the problem of data sharing and/or collaborative learning to analyze the desiderata (in light of these practical considerations) and propose principled solutions to achieve them. We identify several important desiderata, especially the fairness of the collaboration, which is motivated by the fact that each agent incurs a non-trivial (and sometimes significant) cost from procuring, acquiring or otherwise collecting their data, so it is imperative that their effort is fairly recognized and rewarded, in the form of some specific incentives. In this talk, I will talk about some precise formalizations of fairness (e.g., the Shapley value), how it is applied to incentivize data sharing/collaborative learning in some specific learning contexts (e.g., federated learning), and discuss some future directions.
Paper:
Gradient Driven Rewards to Guarantee Fairness in Collaborative Machine Learning: Xinyi Xu, Lingjuan Lyu, Xingjun Ma, Chenglin Miao, Chuan Sheng Foo, Bryan Kian Hsiang Low, NeurIPS, 2021.
On the Convergence of the Shapley Value in Parametric Bayesian Learning Games: Lucas Agussurja, Xinyi Xu, Bryan Kian Hsiang Low, ICML, 2022.
Fair yet Asymptotically Equal Collaborative Learning: Xiaoqiang Lin, Xinyi Xu, See-Kiong Ng, Chuan-Sheng Foo, Bryan Kian Hsiang Low, ICML, 2o23.
FLOW Talk #113
February 21, 2023 @ 5pm Coordinated Universal Time (UTC)
How to Make Federated Learning Work with Challenging Client Participation Patterns?
host: Samuel Horvath
Abstract: A main challenge in many practical scenarios of federated learning is that the clients are only intermittently available to participate in learning. In this talk, I will present our recent results on understanding and overcoming this challenge. I will first explain the importance of aggregation weight adaptation and introduce a new algorithm that improves federated averaging (FedAvg) by adaptively weighting the client updates. The adaptation is based on online estimates of the optimal weights, where the statistics of client participation are heterogeneous and unknown a priori. Then, for the case with very infrequently participating clients, I will present an "amplification" mechanism that is applied to the model updates. The talk will cover both theoretical and empirical findings on this topic and also discuss further insights.
Paper:
A lightweight method for tackling unknown participation statistics in federated averaging: Shiqiang Wang, Mingyue Ji, ICLR, 2024.
A unified analysis of federated learning with arbitrary client participation: Shiqiang Wang, Mingyue Ji, NeurIPS, 2022.
FLOW Talk #112
February 7, 2023 @ 1pm Coordinated Universal Time (UTC)
Variance Reduction for Byzantine-Robust Distributed Optimization
host: Samuel Horvath
Abstract: Byzantine robustness has been gaining a lot of attention due to the growth of the interest in collaborative and federated learning. To address this issue, different Byzantine-robust mechanisms have been proposed in recent years. In this talk, I will focus on the approaches based on variance reduction and, in particular, present our recent paper where variance reduction allows us to significantly improve prior convergence guarantees.
Paper:
Variance Reduction is an Antidote to Byzantines: Better Rates, Weaker Assumptions and Communication Compression as a Cherry on the Top: Eduard Gorbunov, Samuel Horváth, Peter Richtárik, Gauthier Gidel, ICLR 2023.