ALADDIN: Asymmetric Centralized Training for Distributed Deep Learning

Abstract

As the sizes of deep neural network (DNN) models and data increase, training such DNN models requires longer hours and more massive resources. To speed up such a training, in recent years, distributed training has been widely studied and adopted. In general, a centralized training, a type of distributed training, suffers from the communication bottleneck between a parameter server (PS) and workers. On the other hand, a decentralized training avoids the bottleneck problem via peer-to-peer communication, but may experience increased parameter variance among workers that in turn causes slower model convergence. Addressing this dilemma, in this work, we propose a novel centralized training algorithm, ALADDIN, employing "asymmetric" communication between PS and workers for the PS bottleneck problem and novel updating strategies for both local and global parameters to mitigate the increased variance problem. Through a convergence analysis, we show that the convergence rate of ALADDIN is comparable to those of existing state-of-the-art algorithms on the non-convex problem. The empirical evaluation using ResNet-50 and VGG-16 models demonstrates that (1) ALADDIN shows significantly better training throughput with up to 191% and 34% improvement compared to a synchronous algorithm and the state-of-the-art decentralized algorithm, respectively, (2) models trained by ALADDIN converge to the accuracies, comparable to those of the synchronous algorithm, within the shortest time, and (3) the convergence of ALADDIN is most robust under various heterogeneous environments.

The proofs of Lemmas 1 and 2, and Theorem 1 are included in the tabs: Lemma1, Lemma2, and Theorem1.

Please click the above tabs.