Federated Learning One World Seminar

Archive of Talks: FLOW Season 2023

FLOW Talk #111

November 15, 2023 @ 1pm Coordinated Universal Time (UTC) 

The Future of Consumer Edge-AI Computing

[slides]

host: Samuel Horvath

Abstract: Deep Learning has proliferated dramatically across consumer devices in less than a decade, but has been largely powered through the hardware acceleration within isolated devices. Nonetheless, clear signals exist that the next decade of consumer intelligence will require levels of resources, a mixing of modalities and a collaboration of devices that will demand a significant pivot beyond hardware alone. To accomplish this, we believe a new Edge-AI paradigm will be necessary for this transition to be possible in a sustainable manner, without trespassing user-privacy or hurting quality of experience.

Paper:  

FLOW Talk #110

October 4, 2023 @ 1pm Coordinated Universal Time (UTC) 

Jongho Park (KAUST)

DualFL: A duality-based federated learning algorithm with communication acceleration in the general convex regime

[slides]

host: Sebastian Stich

Abstract: In this talk, we propose a novel training algorithm called DualFL (Dualized Federated Learning), for solving a distributed optimization problem in federated learning. Our approach is based on a specific dual formulation of the federated learning problem. DualFL achieves communication acceleration under various settings on smoothness and strong convexity of the problem. Moreover, it theoretically guarantees the use of inexact local solvers, preserving its optimal communication complexity even with inexact local solutions. DualFL is the first federated learning algorithm that achieves communication acceleration, even when the cost function is either nonsmooth or non-strongly convex. 

This is a joint work with Jinchao Xu.

FLOW Talk #109

August 30, 2023 @ 5pm Coordinated Universal Time (UTC) 

Qinbin Li (UC Berkeley)

FedTree: A Federated Learning System For Trees

[slides]

host: Sebastian Stich

Abstract: While the quality of machine learning services largely relies on the volume of training data, data regulations such as the General Data Protection Regulation (GDPR) impose stringent requirements on data transfer. Federated learning has emerged as a popular approach for enabling collaborative machine learning without sharing raw data. To facilitate the rapid development of federated learning, efficient and user-friendly federated learning systems are essential. Despite many existing federated learning systems designed for deep learning, tree-based federated learning systems have not been well exploited. This paper presents a tree-based federated learning system under a histogram-sharing scheme, named FedTree, that supports both horizontal and vertical federated training of GBDTs with configurable privacy protection techniques. Our extensive experiments show that FedTree achieves competitive accuracy to centralized training while incurring much less computational cost than the other generic federated learning systems.

Paper:  

FLOW Talk #108

August 16, 2023 @ 1pm Coordinated Universal Time (UTC) 

Song Han (MIT)

On-Device Training under 256KB of Memory

[slides]

host: Dan Alistarh

Abstract: On-device training enables the model to adapt to new data collected from the sensors. Users can benefit from customized AI models without having to transfer the data to the cloud, preserving privacy. However, the training memory footprint is prohibitive for IoT devices. I’ll present "Tiny Transfer Learning”(NeurIPS’20) and "On-Device Learning under 256KB Memory” (NeurIPS’22) to solve this issue.  I’ll first analyze the memory bottleneck, showing that we should reduce the activations, not just trainable parameters for efficient on-device learning. I’ll then introduce Quantization-Aware Scaling (QAS) to calibrate the gradient scales and stabilize 8-bit quantized training, and "sparse update" to skip the gradient computation of less important layers and sub-tensors to save activation memory. The algorithm innovation is implemented by a lightweight training system, Tiny Training Engine, which prunes the backward computation graph to support sparse updates and offload the runtime auto-differentiation to compile time. Deployed on STM32H746 microcontroller, our framework uses less than 1/1000 of the training memory of Tensorflow and Pytorch while matching the accuracy. Our study enables IoT devices to not only perform inference but also continuously adapt to new data for on-device lifelong learning.

FLOW Talk #107

August 2, 2023 @ 5pm Coordinated Universal Time (UTC)

Boxin Wang (University of Illinois, Urbana-Champaign)

Can Public Large Language Models Help Private Cross-device Federated Learning?

[slides]

host: Virginia Smith

Abstract: We study (differentially) private federated learning (FL) of language models. The language models in cross-device FL are relatively small, which can be trained with meaningful formal user-level differential privacy (DP) guarantees when massive parallelism in training is enabled by the participation of a moderate size of users. Recently, public data has been used to improve privacy-utility trade-offs for both large and small language models. In this talk, we will cover our systematic study of using large-scale public data and LLMs to help differentially private training of on-device FL models, and further improve the privacy-utility tradeoff by techniques of distillation. Moreover, we propose a novel distribution matching algorithm with theoretical grounding to sample public data close to private data distribution, which significantly improves the sample efficiency of (pre-)training on public data. The proposed method is efficient and effective for training private model by taking advantage of public data, especially for customized on-device architectures that do not have ready-to-use pre-trained models.

Paper:  

FLOW Talk #106

July 12, 2023 @ 1pm Coordinated Universal Time (UTC)

Kevin Kuo (Carnegie Mellon University)

On Noisy Evaluation in Federated Hyperparameter Tuning

host: Virginia Smith

Abstract: Hyperparameter tuning is critical to the success of federated learning applications. Unfortunately, appropriately selecting hyperparameters is challenging in federated networks, as issues of scale, privacy, and heterogeneity introduce noise in the tuning process and make it difficult to faithfully evaluate the performance of various hyperparameters. In this work we perform the first systematic study on the effect of noisy evaluation in federated hyperparameter tuning. We first identify and rigorously explore key sources of noise, including client subsampling, data and systems heterogeneity, and data privacy. Surprisingly, our results indicate that even small amounts of noise can have a significant impact on tuning methods—reducing the performance of state-of-the-art approaches to that of naive baselines. To address noisy evaluation in such scenarios, we propose a simple and effective approach that leverages public proxy data to boost evaluation signal. Our work establishes general challenges, baselines, and best practices for future work in federated hyperparameter tuning.

Paper:  

FLOW #105

June 7, 2023 @ 1pm Coordinated Universal Time (UTC)

On the 5th Generation of Local Training Methods in Federated Learning

[slides]

host: Samuel Horváth

Abstract: I will outline the history of the theoretical development of the local training “trick” employed in virtually all successful federated learning algorithms. In particular, I will identify five distinct generations of methods and results: 1) heuristic, 2) homogeneous, 3) sublinear, 4) linear and 5) accelerated. The 5th generation, initiated by the ProxSkip algorithm by Mishchenko et al (ICML 2022), finally led to the proof that local training, if carefully executed, leads to provable acceleration of communication complexity, without requiring any data homogeneity assumptions. Because these latest advances are very new, there are many opportunities to develop the 5th generation of local training methods further. I will give a brief overview of what we know now, and what problems still remain open.

Papers:  

FLOW Talk #104

May 31, 2023 @ 1pm Coordinated Universal Time (UTC)

Dan Alistarh (IST Austria)

Federated Averaging Made Asynchronous and Communication-Efficient

[slides]

host: Samuel Horváth

Abstract: In this work, we take steps towards addressing two of the main practical challenges when scaling federated optimization to large node counts: the need for tight synchronization between the central authority and individual computing nodes, and the large communication cost of transmissions between the central server and clients. Specifically, we present a new variant of the classic federated averaging (FedAvg) algorithm, which supports both asynchronous communication and communication compression. We provide a new analysis technique showing that, in spite of these system relaxations, our algorithm can provide similar convergence to FedAvg in some parameter regimes.  Experimental results in the LEAF benchmark on setups of up to 300 nodes show that our algorithm ensures fast  convergence for standard federated tasks, improving upon prior quantized and asynchronous approaches. 


Paper:  


FLOW Talk #103

May 24, 2023 @ 5pm Coordinated Universal Time (UTC)

Jianyu Wang (Apple)

On the Unreasonable Effectiveness of Federated Averaging with Heterogeneous Data

[slides]

host: Samuel Horváth

Abstract: Existing theory predicts that data heterogeneity will degrade the performance of the Federated Averaging (FedAvg) algorithm in federated learning. However, in practice, the simple FedAvg algorithm converges very well. In order to explain the seemingly unreasonable effectiveness of FedAvg that contradicts the previous theoretical predictions, this paper introduces the client consensus hypothesis: on some federated datasets, the average of client model updates starting from the optimum is very small and close to zero. We prove that under client consensus hypothesis, data heterogeneity can have no negative impact on the convergence of FedAvg. Moreover, we show that client consensus hypothesis holds on a simple quadratic problem and many naturally heterogeneous datasets (such as FEMNIST and StackOverflow). Therefore, the hypothesis is realistic and can lead to better understanding of the empirical success of FedAvg.

Papers:  

FLOW Talk #102

April 19, 2023 @ 1pm Coordinated Universal Time (UTC)

Samuel Maddock (University of Warwick)

CANIFE: Crafting Canaries for Empirical Privacy Measurement in Federated Learning

host: Aurélien Bellet

Abstract: Federated Learning (FL) is a setting for training machine learning models in distributed environments where the clients do not share their raw data but instead send model updates to a server. However, model updates can be subject to attacks and leak private information. Differential Privacy (DP) is a leading mitigation strategy which involves adding noise to clipped model updates, trading off performance for strong theoretical privacy guarantees. Previous work has shown that the threat model of DP is conservative and that the obtained guarantees may be vacuous or may overestimate information leakage in practice. In this paper, we aim to achieve a tighter measurement of the model exposure by considering a realistic threat model. We propose a novel method, CANIFE, that uses canaries - carefully crafted samples by a strong adversary to evaluate the empirical privacy of a training round. We apply this attack to vision models trained on CIFAR-10 and CelebA and to language models trained on Sent140 and Shakespeare. In particular, in realistic FL scenarios, we demonstrate that the empirical per-round epsilon obtained with CANIFE is 4-5x lower than the theoretical bound. 

Paper:  

FLOW Talk #101

April 12, 2023 @ 1pm Coordinated Universal Time (UTC)

FLECS: A Federated Learning Second-Order Framework via Compression and Sketching

host: Samuel Horváth

Abstract: Inspired by the recent work FedNL (Safaryan et al, FedNL: Making Newton-Type Methods Applicable to Federated Learning), we propose a new communication efficient second-order framework for Federated learning, namely FLECS. The proposed method reduces the high-memory requirements of FedNL by the usage of an L-SR1 type update for the Hessian approximation which is stored on the central server. A low dimensional `sketch' of the Hessian is all that is needed by each device to generate an update, so that memory costs as well as number of Hessian-vector products for the agent are low. Biased and unbiased compressions are utilized to make communication costs also low. Convergence guarantees for FLECS are provided in both the strongly convex, and nonconvex cases, and local linear convergence is also established under strong convexity. Numerical experiments confirm the practical benefits of this new FLECS algorithm.

Paper:  

FLOW Talk #100

March 29, 2023 @ 4pm Coordinated Universal Time (UTC)

Amrita Roy Chowdhury (UC San Diego)

EIFFeL: Ensuring Integrity for Federated Learning

[slides]

host: Aurélien Bellet

Abstract: Federated learning (FL) enables clients to collaborate with a server to train a machine learning model. To ensure privacy, the server performs secure aggregation of model updates from the clients. Unfortunately, this prevents verification of the well-formedness (integrity) of the updates as the updates are masked. Consequently, malformed updates designed to poison the model can be injected without detection. In this talk, I will formalize the problem of ensuring both update privacy and integrity in FL and present a new system,  EIFFeL, that enables secure aggregation of verified updates. EIFFeL is a general framework that can enforce arbitrary integrity checks and remove malformed updates from the aggregate, without violating privacy. Further, EIFFeL is practical for real-world usage. For instance, with 100 clients and 10% poisoning, EIFFeL can train an MNIST classification model to the same accuracy as that of a non-poisoned federated learner in just 2.4s per iteration.

Paper:  

FLOW Talk #99

March 22, 2023 @ 5pm Coordinated Universal Time (UTC)

Berivan Isik (Stanford)

Sparse Random Networks for Communication-Efficient Federated Learning

host: Peter Richtárik

Abstract: One main challenge in federated learning is the large communication cost of exchanging weight updates from clients to the server at each round. While prior work has made great progress in compressing the weight updates through gradient compression methods, we propose a radically different approach that does not update the weights at all. Instead, our method freezes the weights at their initial random values and learns how to sparsify the random network for the best performance. To this end, the clients collaborate in training a stochastic binary mask to find the optimal sparse random network within the original one. At the end of the training, the final model is a sparse network with random weights – or a sub-network inside the dense random network. We show improvements in accuracy, communication (less than 1 bit per parameter (bpp)), convergence speed, and final model size (less than 1 bpp) over relevant baselines on MNIST, EMNIST, CIFAR-10, and CIFAR-100 datasets, in the low bitrate regime.

Paper:  

FLOW Talk #98

March 15, 2023 @ 5pm Coordinated Universal Time (UTC)

Where to Begin? On the Impact of Pre-Training and Initialization in Federated Learning

[slides]

host: Samuel Horváth

Abstract: An oft-cited challenge of federated learning (FL) is the presence of heterogeneity. The data at different clients may follow very different distributions, gives rise to data heterogeneity. And client devices may have very different capabilities (compute, memory, network bandwidth) giving rise to system heterogeneity. The predominant training paradigm is local-update methods such as Federated Averaging, and several modifications have been proposed to address sources of heterogeneity. Empirical evaluations in these studies usually start federated training from a random initialization. However, in many practical applications of FL, the server may have access to some proxy data for the task that can be used to pre-train a model before starting federated training. We empirically study the impact of starting from a pre-trained model in FL. Unsurprisingly, starting from a pre-trained model reduces the training time required to reach a target error rate and enables the training of more accurate models than is possible when starting from a random initialization. Surprisingly, we also find that starting federated learning from a pre-trained initialization reduces the effect of both data and system heterogeneity. This study raises several questions for further work on understanding the role of heterogeneity and initialization in federated training.

Paper:  

FLOW Talk #97

March 8, 2023 @ 5pm Coordinated Universal Time (UTC)

Yaodong Yu (UC Berkeley)

TCT: Convexifying Federated Learning using Bootstrapped Neural Tangent Kernels

host: Sebastian Stich

Abstract: State-of-the-art federated learning methods can perform far worse than their centralized counterparts when clients have dissimilar data distributions. For neural networks, even when centralized SGD easily finds a solution that is simultaneously performant for all clients, current federated optimization methods fail to converge to a comparable solution. We show that this performance disparity can largely be attributed to optimization challenges presented by nonconvexity. Specifically, we find that the early layers of the network do learn useful features, but the final layers fail to make use of them. That is, federated optimization applied to this non-convex problem distorts the learning of the final layers. Leveraging this observation, we propose a Train-Convexify-Train (TCT) procedure to sidestep this issue: first, learn features using off-the-shelf methods (e.g., FedAvg); then, optimize a convexified problem obtained from the network's empirical neural tangent kernel approximation. Our technique yields accuracy improvements of up to +36% on FMNIST and +37% on CIFAR10 when clients have dissimilar data.

(Joint work with Alexander Wei, Sai Praneeth Karimireddy, Yi Ma, and Michael I. Jordan)

Paper:  

FLOW Talk #96

March 1, 2023 @ 5pm Coordinated Universal Time (UTC)

Federated Automatic Differentiation

[slides]

host: Peter Richtárik

Abstract: Federated learning (FL) is a general framework for learning across heterogeneous clients while preserving data privacy, under the orchestration of a central server. FL methods often compute gradients of loss functions purely locally (ie. entirely at each client, or entirely at the server), typically using automatic differentiation (AD) techniques. We propose a federated automatic differentiation (FAD) framework that 1) enables computing derivatives of functions involving client and server computation as well as communication between them and 2) operates in a manner compatible with existing federated technology. In other words, FAD computes derivatives across communication boundaries. We show, in analogy with traditional AD, that FAD may be implemented using various accumulation modes, which introduce distinct computation-communication trade-offs and systems requirements. Further, we show that a broad class of federated computations is closed under these various modes of FAD, implying in particular that if the original computation can be implemented using privacy-preserving primitives, its derivative may be computed using only these same primitives. We then show how FAD can be used to create algorithms that dynamically learn components of the algorithm itself. In particular, we show that FedAvg-style algorithms can exhibit significantly improved performance by using FAD to adjust the server optimization step automatically, or by using FAD to learn weighting schemes for computing weighted averages across clients.

Paper:  

FLOW Talk #95

February 22, 2023 @ 1pm Coordinated Universal Time (UTC)

Chandra Thapa (CSIRO Data61)

Combining federated learning and split learning, and a distributed machine learning framework with strict access control techniques for privacy and security

host: Samuel Horváth

Abstract: Federated learning (FL) and split learning (SL) provide default data privacy by following a model-to-data scenario; clients train and test machine learning models without sharing raw data. For faster model training in a resourced-constrained environment with several clients, FL and SL need to be blended to leverage their advantages jointly. In this regard, we present splitfed learning (SFL). Moreover, we further discuss the comparative training performance of FL, SL and SFL under real-world device settings, e.g., Raspberry Pi. FL, SL and SFL are suitable for model development under the consideration of highly sensitive, illegal to possess and psychologically harmful data; however, additional measures, including strict control, monitoring, and examination of all the activities involved, including communication, execution, and release of algorithms, datasets, outputs, and results are required within the machine learning framework. Thus, we present a new multi-zoned framework called MaLFraDA. MaLFraDA has soft air gaps between its zones to isolate and control communication in and out of the framework. 

Paper:  

FLOW Talk #94

February 15, 2023 @ 1pm Coordinated Universal Time (UTC)

Convergence of First-Order Algorithms for Meta-Learning with Moreau Envelopes

host: Samuel Horváth

Abstract: In this work, we consider the problem of minimizing the sum of Moreau envelopes of given functions, which has previously appeared in the context of meta-learning and personalized federated learning. In contrast to the existing theory that requires running subsolvers until a certain precision is reached, we only assume that a finite number of gradient steps is taken at each iteration. As a special case, our theory allows us to show the convergence of First-Order Model-Agnostic Meta-Learning (FO-MAML) to the vicinity of a solution of Moreau objective. We also study a more general family of first-order algorithms that can be viewed as a generalization of FO-MAML. Our main theoretical achievement is a theoretical improvement upon the inexact SGD framework. In particular, our perturbed-iterate analysis allows for tighter guarantees that improve the dependency on the problem's conditioning. In contrast to the related work on meta-learning, ours does not require any assumptions on the Hessian smoothness, and can leverage smoothness and convexity of the reformulation based on Moreau envelopes. Furthermore, to fill the gaps in the comparison of FO-MAML to the Implicit MAML (iMAML), we show that the objective of iMAML is neither smooth nor convex, implying that it has no convergence guarantees based on the existing theory.

Paper:  

FLOW Talk #93

February 8, 2023 @ 1pm Coordinated Universal Time (UTC)

Maxime Vono (Criteo AI)

FedPop: A Bayesian Approach for Personalised Federated Learning

host: Aurélien Bellet

Abstract: Personalised federated learning (FL) aims at collaboratively learning a machine learning model taylored for each client. Albeit promising advances have been made in this direction, most of existing approaches works do not allow for uncertainty quantification which is crucial in many applications. In addition, personalisation in the cross-device setting still involves important issues, especially for new clients or those having small number of observations. This paper aims at filling these gaps. To this end, we propose a novel methodology coined FedPop by recasting personalised FL into the population modeling paradigm where clients' models involve fixed common population parameters and random effects, aiming at explaining data heterogeneity. To derive convergence guarantees for our scheme, we introduce a new class of federated stochastic optimisation algorithms which relies on Markov chain Monte Carlo methods. Compared to existing personalised FL methods, the proposed methodology has important benefits: it is robust to client drift, practical for inference on new clients, and above all, enables uncertainty quantification under mild computational and memory overheads. We provide non-asymptotic convergence guarantees for the proposed algorithms and illustrate their performances on various personalised federated learning tasks. 

Paper:  

FLOW Talk #92

February 1, 2023 @ 1pm Coordinated Universal Time (UTC)

Ashkan Panahi & Firooz Shahriari-Mehr (Chalmers University of Technology)

Decentralized Constrained Optimization, Double Averaging and Gradient Projection

host: Samuel Horváth

Abstract: We consider a generic decentralized constrained optimization problem over static, directed communication networks, where each agent has exclusive access to only one convex, differentiable, local objective term and one convex constraint set. For this setup, we propose a novel decentralized algorithm, called DAGP (Double Averaging and Gradient Projection), based on local gradients, projection onto local constraints, and local averaging. We achieve global optimality through a novel distributed tracking technique we call distributed null projection. Further, we show that DAGP can also be used to solve unconstrained problems with non-differentiable objective terms, by employing the so-called epigraph projection operators (EPOs). In this regard, we introduce a new fast algorithm for evaluating EPOs. We study the convergence of DAGP and establish (1/K‾‾√) convergence in terms of feasibility, consensus, and optimality. For this reason, we forego the difficulties of selecting Lyapunov functions by proposing a new methodology of convergence analysis in optimization problems, which we refer to as aggregate lower-bounding. To demonstrate the generality of this method, we also provide an alternative convergence proof for the gradient descent algorithm for smooth functions. Finally, we present numerical results demonstrating the effectiveness of our proposed method in both constrained and unconstrained problems.

Paper:  

FLOW Talk #91

January 18, 2023 @ 5pm Coordinated Universal Time (UTC)

Leveraging Spatial and Temporal Correlations in Distributed Learning

host: Samuel Horváth

Abstract: Distributed mean estimation is a central component of federated learning. In this talk, I will present work on the problem of estimating at a central server the mean of a set of vectors distributed across several nodes (one vector per node). When the vectors are high-dimensional, the communication cost of sending entire vectors may be prohibitive, and it may be imperative for them to use sparsification techniques. While most existing work on sparsified mean estimation is agnostic to the characteristics of the data vectors, there may be spatial correlations (similarities in the vectors sent by different nodes) or temporal correlations (similarities in the data sent by a single node over different iterations of the algorithm) in the data vectors. We leverage these correlations by simply modifying the decoding method used by the server to estimate the mean. We provide an analysis of the resulting estimation error as well as experiments to show that our estimators consistently outperform more sophisticated and expensive sparsification methods.

Paper: