(2020-2023) Coded Computing

Developing Efficient Coded Computing Schemes for Elastic Resources

Coded distributed computing [1]–[3], built upon algorithmic fault tolerance [4], is a recently emerging paradigm where computation redundancy is employed to tackle the straggler effect, that is, one slow machine may become a bottleneck of the whole system. As a toy example [1], to perform a matrix-vector multiplication Ax, a master machine first partitions the matrix A into two equal-sized submatrices A1 and A2 and then distributes A1, A2, and A1 + A2 to three worker machines, respectively. These machines also receive the vector x and perform three multiplications A1x, A2x, and (A1 + A2)x in parallel. Clearly, Ax can be recovered by the master from the computation outcomes of any two workers. Coded distributed computing has been shown to work for not just linear functions but also for polynomials and recently for any differentiable non-linear computation.

The proposed research was motivated by a recent development in the cloud computing industry that provides customers with an opportunity to have large computing resources at a fraction of the cost of the normal on-demand service, albeit at the cost of low priority in the sense that the inexpensive machines can be removed on short notice (e.g. Amazon EC2 Spot and Microsoft Azure Batch). Most of the research in the literature of coded distributed computing, however, assume that the computing resources are fixed, e.g. the set of working machines remains unchanged throughout the computation, and homogeneous, e.g. all machines have the same storage size and computation speed. The aim of this project is to develop novel coded computing schemes that work efficiently with elastic/flexible computing resources, thus, broadening the applicability of coded computing paradigm in practise. See [5] for our preliminary work on this topic.


References

[1] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding up distributed machine learning using codes,” IEEE Transactions on Information Theory, volume 64, number 3, pages 1514–1529, 2018.

[2] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “Coded MapReduce,” in the Proceedings of the Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 964–971, 2015.

[3] R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis, “Gradient coding: Avoiding stragglers in distributed learning,” in the Proceedings of the International Conference on Machine Learning (ICML), pages. 3368–3376, 2017.

[4] K. Huang and J. Abraham, “Algorithm-based fault tolerance for matrix operations,” IEEE Transactions on Computers, volume C-33, number 6, pages 518–528, 1984.

[5] Hoang Dau, Ryan Gabrys, Yu-Chih Huang, Chen Feng, Quang-Hung Luu, Eidah Alzahrani, and Zahir Tari, “Optimizing the Transition Waste in Coded Elastic Computing,” available online at https://arxiv.org/abs/1910.00796, 2019.


Why the project is suitable for a PhD student?

The problem of designing efficient redundancy for distributed computing has attracted lost of attention from the research community in the past few years. Different mathematical tools can be applied to push forward the theory side of this problem while carefully developed algorithms can be deployed in real cloud computing platforms. This provides a balanced mix of theory and practise for this project and makes it suitable for a PhD project.

Duration of the project: 3 years (2020-2023)

Contact: please email to me at sonhoang.dau@rmit.edu.au with your CV, Bachelor/Master degree certificates and transcripts.