Qirong Ho

Assistant Professor at Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)

Co-founder, CTO at Petuum


Google Scholar

Open Positions - Postdocs

Our multi-faculty lab at MBZUAI, the Center for Integrative Artificial Intelligence, is looking for postdocs in the areas of ML systems, ML for Healthcare, and ML for Computational Biology - http://ciai.site/vacancies/

The Composable, Automatic and Scalable Learning Workshop

We've just completed the 1st CASL Workshop on Building Ecosystems for AI at Scale, for which I am an organizer.

Slides and video recordings are available here: https://workshop.casl-project.ai/

Research Overview and Current Projects

I work on distributed software systems for Machine Learning at Big Data and Big Model scales, with a view towards performance guarantees, theoretical correctness, and practical needs like robustness, programmability and usability. These systems form part of the CASL (Composable, Automatic and Scalable Learning) open source project.

I received my PhD from Carnegie Mellon University in 2014. My advisor was Eric P. Xing.

Automatic Strategies for Distributed Training & Inference

Resource Scheduling and
Job Right-Sizing

Cost- and Pipeline-Aware

CASL Project: Alpa

Large-scale deep models with 10s to 100s of billions of parameters (e.g. prompt models) require systems for distributed training and even inference.

The variety and complexity of these models incentivizes systems that treat training and inference as a strategy composition problem, which can be numerically optimized.

The resulting automatically-generated training and inference strategies are equivalent (or sometimes better) than the best hand-tuned systems. This means that novel deep models can be quickly set up for distributed training and inference, even by novices.

CASL Project: AdaptDL

The scalability of ML and deep learning training jobs is highly sensitive to job progress (i.e. early vs late stage training), number of parallel devices, and learning algorithm hyperparameters.

By actively measuring and forecasting goodput - a new measure of ML training progress that accounts for speed and quality - a system can schedule and adjust the parallelism of a workload of ML jobs in an adaptive, real-time manner. This allows the entire workload to complete faster than simply running the jobs one-at-a-time with maximum parallelism.

CASL Projects: Tuun and PIRLib

ML programs often require auxiliary code such as preprocessing and post-processing stages. Hyperparameter optimization systems rarely account for the impact of such auxiliary code on (1) measures of ML goodness (e.g. validation loss or accuracy); (2) the time cost of the ML program.

This project applies Bayesian Optimization to perform hyperparameter tuning on ML programs consisting of multiple code stages - i.e. a pipeline. By strategically re-using (memoizing) the outputs of earlier code stages, our system can tune entire ML pipelines (as opposed to merely tuning the learning algorithm) with substantially lower time cost.

Selected Publications

Systems for Resource Scheduling and Job Right-Sizing

    • AdaptDL/Pollux, our scheduling system that manages deep learning on clusters to make it faster & cheaper, has won the Jay Lepreau Best Paper at OSDI '21! We've also open sourced AdaptDL/Pollux at our CASL website!

    • Qiao, Aurick, Sang Keun Choe, Suhas Jayaram Subramanya, Willie Neiswanger, Qirong Ho, Hao Zhang, Gregory R. Ganger, and Eric P. Xing. "Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning." OSDI 2021

Systems for Distributed Training & Inference

  • Xie, Pengtao, Jin Kyu Kim, Qirong Ho, Yaoliang Yu, and Eric Xing. "Orpheus: Efficient distributed machine learning via system and algorithm co-design." SoCC 2018

  • Xu, Shizhen, Hao Zhang, Graham Neubig, Wei Dai, Jin Kyu Kim, Zhijie Deng, Qirong Ho, Guangwen Yang, and Eric P. Xing. "Cavs: An efficient runtime system for dynamic neural networks." USENIX ATC 2018

  • Qiao, Aurick, Abutalib Aghayev, Weiren Yu, Haoyang Chen, Qirong Ho, Garth A. Gibson, and Eric P. Xing. "Litz: Elastic framework for high-performance distributed machine learning." USENIX ATC 2018

  • Zhang, Hao, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P. Xing. "Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters." USENIX ATC 2017

  • Kim, Jin Kyu, Qirong Ho, Seunghak Lee, Xun Zheng, Wei Dai, Garth A. Gibson, and Eric P. Xing. "Strads: A distributed framework for scheduled model parallel machine learning." EuroSys 2016

  • Kumar, Abhimanu, Alex Beutel, Qirong Ho, and Eric Xing. "Fugue: Slow-worker-agnostic distributed learning for big models on big data." AISTATS 2014

Consistency Models in ML: the Stale Synchronous Parallel (SSP) family

    • Wei, Jinliang, Wei Dai, Aurick Qiao, Qirong Ho, Henggang Cui, Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, and Eric P. Xing. "Managed communication and consistency for fast data-parallel iterative analytics." SoCC 2015

    • Dai, Wei, Abhimanu Kumar, Jinliang Wei, Qirong Ho, Garth Gibson, and Eric Xing. "High-performance distributed ML at scale through parameter server consistency models." AAAI 2015

    • Ho, Qirong, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B. Gibbons, Garth A. Gibson, Greg Ganger, and Eric P. Xing. "More effective distributed ml via a stale synchronous parallel parameter server." NeurIPS 2013

2016 Overview of Strategies and Principles in Distributed ML

    • Xing, Eric P., Qirong Ho, Pengtao Xie, and Dai Wei. "Strategies and principles of distributed machine learning on big data." Engineering 2, no. 2 (2016): 179-195.

Scalable Models and Training Algorithms

  • Ho, Qirong, Junming Yin, and Eric P. Xing. "Latent space inference of internet-scale networks." Journal of Machine Learning Research 17, no. 1 (2016): 2756-2796.

  • Hu, Zhiting, Ho Qirong, Avinava Dubey, and Eric Xing. "Large-scale distributed dependent nonparametric trees." ICML 2015

  • Yuan, Jinhui, Fei Gao, Qirong Ho, Wei Dai, Jinliang Wei, Xun Zheng, Eric Po Xing, Tie-Yan Liu, and Wei-Ying Ma. "LightLDA: Big topic models on modest computer clusters." WWW 2015

Book Chapters

  • Qirong Ho and Eric P. Xing. Analyzing Time-Evolving Networks using a Evolving Cluster Mixed Membership Stochastic Blockmodel. Handbook of Mixed Membership Models and its Applications (Chap 22), edited by E.M. Airoldi, D.M. Blei, E.A. Erosheva, and S.E. Fienberg, 2014.

My Google Scholar

Petuum industrializes AI, turning businesses into owners, builders and informed users

My startup, Petuum, creates the standardized building blocks for assembling AI affordably and sustainably.

We're humbled and thrilled to be part of the WEF Tech Pioneers 2018, the CB Insights AI 100 list in 2017 and 2018, the Pittsburgh Technology Council AI Innovator of the Year 2018, and the Timmy Awards 2018 Best Tech Startup Finalists. Info and videos: