Open Positions - Postdocs
Our multi-faculty lab at MBZUAI, the Center for Integrative Artificial Intelligence, is looking for postdocs in the areas of ML systems, ML for Healthcare, and ML for Computational Biology - http://ciai.site/vacancies/
The Composable, Automatic and Scalable Learning Workshop
We've just completed the 1st CASL Workshop on Building Ecosystems for AI at Scale, for which I am an organizer.
Slides and video recordings are available here: https://workshop.casl-project.ai/
Research Overview and Current Projects
I work on distributed software systems for Machine Learning at Big Data and Big Model scales, with a view towards performance guarantees, theoretical correctness, and practical needs like robustness, programmability and usability. These systems form part of the CASL (Composable, Automatic and Scalable Learning) open source project.
I received my PhD from Carnegie Mellon University in 2014. My advisor was Eric P. Xing.
Automatic Strategies for Distributed Training & Inference
Resource Scheduling and
Cost- and Pipeline-Aware
CASL Project: Alpa
Large-scale deep models with 10s to 100s of billions of parameters (e.g. prompt models) require systems for distributed training and even inference.
The variety and complexity of these models incentivizes systems that treat training and inference as a strategy composition problem, which can be numerically optimized.
The resulting automatically-generated training and inference strategies are equivalent (or sometimes better) than the best hand-tuned systems. This means that novel deep models can be quickly set up for distributed training and inference, even by novices.
CASL Project: AdaptDL
The scalability of ML and deep learning training jobs is highly sensitive to job progress (i.e. early vs late stage training), number of parallel devices, and learning algorithm hyperparameters.
By actively measuring and forecasting goodput - a new measure of ML training progress that accounts for speed and quality - a system can schedule and adjust the parallelism of a workload of ML jobs in an adaptive, real-time manner. This allows the entire workload to complete faster than simply running the jobs one-at-a-time with maximum parallelism.
ML programs often require auxiliary code such as preprocessing and post-processing stages. Hyperparameter optimization systems rarely account for the impact of such auxiliary code on (1) measures of ML goodness (e.g. validation loss or accuracy); (2) the time cost of the ML program.
This project applies Bayesian Optimization to perform hyperparameter tuning on ML programs consisting of multiple code stages - i.e. a pipeline. By strategically re-using (memoizing) the outputs of earlier code stages, our system can tune entire ML pipelines (as opposed to merely tuning the learning algorithm) with substantially lower time cost.
Systems for Resource Scheduling and Job Right-Sizing
AdaptDL/Pollux, our scheduling system that manages deep learning on clusters to make it faster & cheaper, has won the Jay Lepreau Best Paper at OSDI '21! We've also open sourced AdaptDL/Pollux at our CASL website!
Qiao, Aurick, Sang Keun Choe, Suhas Jayaram Subramanya, Willie Neiswanger, Qirong Ho, Hao Zhang, Gregory R. Ganger, and Eric P. Xing. "Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning." OSDI 2021
Systems for Distributed Training & Inference
Xie, Pengtao, Jin Kyu Kim, Qirong Ho, Yaoliang Yu, and Eric Xing. "Orpheus: Efficient distributed machine learning via system and algorithm co-design." SoCC 2018
Xu, Shizhen, Hao Zhang, Graham Neubig, Wei Dai, Jin Kyu Kim, Zhijie Deng, Qirong Ho, Guangwen Yang, and Eric P. Xing. "Cavs: An efficient runtime system for dynamic neural networks." USENIX ATC 2018
Qiao, Aurick, Abutalib Aghayev, Weiren Yu, Haoyang Chen, Qirong Ho, Garth A. Gibson, and Eric P. Xing. "Litz: Elastic framework for high-performance distributed machine learning." USENIX ATC 2018
Zhang, Hao, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P. Xing. "Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters." USENIX ATC 2017
Kim, Jin Kyu, Qirong Ho, Seunghak Lee, Xun Zheng, Wei Dai, Garth A. Gibson, and Eric P. Xing. "Strads: A distributed framework for scheduled model parallel machine learning." EuroSys 2016
Kumar, Abhimanu, Alex Beutel, Qirong Ho, and Eric Xing. "Fugue: Slow-worker-agnostic distributed learning for big models on big data." AISTATS 2014
Consistency Models in ML: the Stale Synchronous Parallel (SSP) family
Wei, Jinliang, Wei Dai, Aurick Qiao, Qirong Ho, Henggang Cui, Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, and Eric P. Xing. "Managed communication and consistency for fast data-parallel iterative analytics." SoCC 2015
Dai, Wei, Abhimanu Kumar, Jinliang Wei, Qirong Ho, Garth Gibson, and Eric Xing. "High-performance distributed ML at scale through parameter server consistency models." AAAI 2015
Ho, Qirong, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B. Gibbons, Garth A. Gibson, Greg Ganger, and Eric P. Xing. "More effective distributed ml via a stale synchronous parallel parameter server." NeurIPS 2013
2016 Overview of Strategies and Principles in Distributed ML
Xing, Eric P., Qirong Ho, Pengtao Xie, and Dai Wei. "Strategies and principles of distributed machine learning on big data." Engineering 2, no. 2 (2016): 179-195.
Scalable Models and Training Algorithms
Ho, Qirong, Junming Yin, and Eric P. Xing. "Latent space inference of internet-scale networks." Journal of Machine Learning Research 17, no. 1 (2016): 2756-2796.
Hu, Zhiting, Ho Qirong, Avinava Dubey, and Eric Xing. "Large-scale distributed dependent nonparametric trees." ICML 2015
Yuan, Jinhui, Fei Gao, Qirong Ho, Wei Dai, Jinliang Wei, Xun Zheng, Eric Po Xing, Tie-Yan Liu, and Wei-Ying Ma. "LightLDA: Big topic models on modest computer clusters." WWW 2015
Qirong Ho and Eric P. Xing. Analyzing Time-Evolving Networks using a Evolving Cluster Mixed Membership Stochastic Blockmodel. Handbook of Mixed Membership Models and its Applications (Chap 22), edited by E.M. Airoldi, D.M. Blei, E.A. Erosheva, and S.E. Fienberg, 2014.
Petuum industrializes AI, turning businesses into owners, builders and informed users
My startup, Petuum, creates the standardized building blocks for assembling AI affordably and sustainably.
We're humbled and thrilled to be part of the WEF Tech Pioneers 2018, the CB Insights AI 100 list in 2017 and 2018, the Pittsburgh Technology Council AI Innovator of the Year 2018, and the Timmy Awards 2018 Best Tech Startup Finalists. Info and videos:
ML Systems and Petuum:
The CASL Ecosystem - [Video]