Qirong Ho

Assistant Professor at Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)

Co-founder, CTO at Petuum


Google Scholar

Open Positions - Postdocs

Our multi-faculty lab at MBZUAI, the Center for Integrative Artificial Intelligence, is looking for postdocs in the areas of ML systems, ML for Healthcare, and ML for Computational Biology - http://ciai.site/vacancies/

The 2nd Composable, Automatic and Scalable Learning Workshop

We've just completed the 2nd CASL Workshop on Applications of AI at Scale, for which I am an organizer.

Slides and video recordings are available here: https://workshop.casl-project.ai/

Research Overview and Current Projects

I work on distributed software systems for Machine Learning at Big Data and Big Model scales, with a view towards performance guarantees, theoretical correctness, and practical needs like robustness, programmability and usability. These systems form part of the CASL (Composable, Automatic and Scalable Learning) open source project.

I received my PhD from Carnegie Mellon University in 2014. My advisor was Eric P. Xing.

Automatic Strategies for Distributed Training & Inference

Resource Scheduling and
Job Right-Sizing

Cost- and Pipeline-Aware

CASL Project: Alpa

Large-scale deep models with 10s to 100s of billions of parameters (e.g. prompt models) require systems for distributed training and even inference.

The variety and complexity of these models incentivizes systems that treat training and inference as a strategy composition problem, which can be numerically optimized.

The resulting automatically-generated training and inference strategies are equivalent (or sometimes better) than the best hand-tuned systems. This means that novel deep models can be quickly set up for distributed training and inference, even by novices.

CASL Project: AdaptDL

The scalability of ML and deep learning training jobs is highly sensitive to job progress (i.e. early vs late stage training), number of parallel devices, and learning algorithm hyperparameters.

By actively measuring and forecasting goodput - a new measure of ML training progress that accounts for speed and quality - a system can schedule and adjust the parallelism of a workload of ML jobs in an adaptive, real-time manner. This allows the entire workload to complete faster than simply running the jobs one-at-a-time with maximum parallelism.

CASL Projects: Tuun and PIRLib

ML programs often require auxiliary code such as preprocessing and post-processing stages. Hyperparameter optimization systems rarely account for the impact of such auxiliary code on (1) measures of ML goodness (e.g. validation loss or accuracy); (2) the time cost of the ML program.

This project applies Bayesian Optimization to perform hyperparameter tuning on ML programs consisting of multiple code stages - i.e. a pipeline. By strategically re-using (memoizing) the outputs of earlier code stages, our system can tune entire ML pipelines (as opposed to merely tuning the learning algorithm) with substantially lower time cost.

Selected Publications

Systems for Resource Scheduling and Job Right-Sizing

Systems for Distributed Training & Inference

Consistency Models in ML: the Stale Synchronous Parallel (SSP) family

2016 Overview of Strategies and Principles in Distributed ML

Scalable Models and Training Algorithms

Book Chapters

My Google Scholar

Petuum industrializes AI, turning businesses into owners, builders and informed users

My startup, Petuum, creates the standardized building blocks for assembling AI affordably and sustainably.

We're humbled and thrilled to be part of the WEF Tech Pioneers 2018, the CB Insights AI 100 list in 2017 and 2018, the Pittsburgh Technology Council AI Innovator of the Year 2018, and the Timmy Awards 2018 Best Tech Startup Finalists. Info and videos: