ML scale-up is frequently underestimated. What does it really take to train an ML model, originally implemented for a single CPU/GPU, on multiple machines? A few pain points are: (1) Many new lines of code need to be written to convert the code to a distributed version; (2) Need to heavily tune the code to the satisfying system/statistical performance, which is an added process to model development; (3) Decide which/how many hardware resources to use to train and deploy the model; (4) From an organization’s perspective, automate resource sharing between many users and jobs to satisfy user needs while maximizing resource utilization and minimizing cost.
In this tutorial, we will present improved techniques to automate distributed ML infrastructure. The tutorial covers three areas critical to ML parallelization: (1) Cataloging and standardizing parallel ML building blocks; (2) Representations and software frameworks for ML parallelism; (3) Algorithms and systems to automate ML parallelization and the resource allocation of ML jobs on shared clusters. By exposing unique characteristics of ML programs, and by dissecting successful cases to reveal how they can be harnessed, we present opportunities for ML researchers and practitioners to further shape and grow the area of SysML.
The audience should be familiar with ML and DL basics. Knowledge of TensorFlow, PyTorch, and distributed ML techniques is also helpful but not required.
Interested in the CASL open source projects from the tutorial?
Visit casl-project.ai and petuum.com to learn more!
University of California, Berkeley and Petuum, Inc.
Hao Zhang is currently a postdoc scholar at RISE lab, the University of California, Berkeley. Hao’s general research interest is in scalable machine learning. Hao completed his Ph.D. at CMU. Hao’s past works including AutoDist, Poseidon, Cavs are now being commercialized at the Pittsburgh-based startup Petuum Inc.
Carnegie Mellon University and Petuum, Inc.
Aurick Qiao is a Ph.D. candidate at CMU. His research interest is in resource management for distributed ML in shared-resource computing environments. Several of Aurick’s work including Litz and Bosen are parts of the Petuum project. Aurick is also an Engineering Lead at Petuum, building scalable and easy-to-use systems for ML which “just works.”
Petuum, Inc.
Qirong Ho is Co-Founder and CTO at Petuum, Inc., a technology startup from the Petuum distributed ML team at Carnegie Mellon University. He holds a Ph.D. from CMU, and his research interests are in distributed ML systems with a view towards correctness, performance guarantees, robustness, programmability and usability.
Mohamed bin Zayed University of Artificial Intelligence, Carnegie Mellon University and Petuum Inc.
Eric Xing is the President at MBZUAI, Professor at CMU, and Founder, Chairman, and Chief Scientist of Petuum Inc. His main research interests are the development of ML and statistical methodology, and large-scale computational system and architectures. He is an AAAI Fellow and an IEEE Fellow.