Haibin works on LLM and AGI infrastructure at Bytedance, focusing on optimizing training performance for LLMs & multimodal models (with more than 10k GPUs). Prior to the LLM era, Haibin worked on collective communication libraries (ByteCCL), and GPU-based recommendation model systems for Douyin & Tiktok in a team lead by Yibo Zhu. Before he joined Bytedance, he was at Amazon Web Services working on ML framework core (Apache MXNet), and large scale NLP model training, with a team led by Mu Li and Alex Smola. He finished his M.S. in Computer Science at Carnegie Mellon University Database Group, advised by Andy Pavlo. Haibin obtained his Bachelor's degree in Computer Science at Hong Kong University and Shanghai Jiao Tong University jointly.
Recently we are also working on veScale, a PyTorch native auto-parallelism framework (collaborations are welcome!)
Softwares (Python, C++, CUDA)
veGiantModel, a library for LLM training with 3-D parallelism | maintainer, 2020
GluonNLP, a toolkit for natural language processing | maintainer, 2018
horovod, a distributed training library | committer & TSC alumni, 2018
BytePS on ps-lite, a distributed training library for deep learning | maintainer, 2017
Apache MXNet, a deep learning framework | maintainer & PMC, 2016
Peloton, an in-memory database management system | committer, 2015
Papers
Large-scale distributed training systems & HPC
FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion | Preprint, 2024
MegaScale: Scaling Large Language Modeling Training to More Than 10,000 GPUs | NSDI, 2024
Hi-Speed DNN Training with Espresso: Unleashing the Full Potential of Gradient Compression with Near-Optimal Usage Strategies | EuroSys, 2023
dPRO: A Generic Performance Diagnosis and Optimization Toolkit for Expediting Distributed DNN Training | MLSys, 2022
Accelerated Large Batch Optimization of BERT Pretraining in 54 minutes | Preprint, 2020
Is Network the Bottleneck for Distributed Training? | SIGCOMM (Network Meets AI & ML), 2020
ML frameworks and toolkits
Towards PyTorch-Native Auto-Parallel Framework | MLSys (Young Professionals Symposium), 2024
LLM-PQ:Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization | PPoPP (poster), 2024
CDMPP: A Device-Model Agnostic Framework for Latency Prediction of Tensor Programs | EuroSys, 2024
GluonCV and GluonNLP: Deep Learning in Computer Vision and Natural Language Processing | JMLR, 2019
Deep Graph Library: Towards Efficient and Scalable Deep Learning on Graphs | ICLR (Representation Learning for Graph and Manifolds), 2019
Just-in-Time Dynamic-Batching | NeurIPS (Systems for Machine Learning), 2018
Deep learning
LEMON: Lossless model expansion | ICLR, 2024
ResNeSt: Split-Attention Networks | CVPR (Efficient Deep Learning for CV), 2022
Temporal-contextual Recommendation in Real Time | KDD (Best Paper Award), 2020
Database systems
Self-Driving Database Management Systems | CIDR, 2017
Distributed optimization algorithms
SAPipe: Staleness-Aware Pipeline for Data Parallel DNN Training | NeurIPS, 2022
CSER: Communication-efficient SGD with Error Reset | NeurIPs, 2020
Local AdaAlter: Communication-Efficient Stochastic Gradient Descent with Adaptive Learning Rates | NeurIPS (Optimization for ML), 2020
Dynamic Mini-batch SGD for Elastic Distributed Training: Learning in the Limbo of Resources | Preprint, 2019
Presentations
Megascale: Scaling Large Language Model Training to More Than 10,000 GPUs | Systems@Scale 2024
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs | Invited talk @Databricks 2024
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs | NSDI 2024
BytePS and ByteCCL for distributed training | Invited talk @Meta 2022
Accelerating recommendation model training using ByteCCL and UCX | UCF 2021
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet | NVIDIA GTC 2020
Amazon SageMaker and Apache MXNet: Tips & Tricks | AWS Re:invent 2019
Build State-of-the-art NLP Models with Amazon SageMaker and GluonNLP | AWS Re:invent 2019
Sparse Tensor for Large-scale Recommendation Systems and Natural Language Processing | Apache MXNet Summit 2018
Tutorials
Dive into Deep Learning for Natural Language Processing | EMNLP 2019
Everything You Need to Know to Reproduce SOTA Deep Learning Models from Hands-on Tutorial | ICCV 2019
From Shallow to Deep Language Representations: Pre-training, Fine-tuning, and Beyond | KDD 2019
Dive into Deep Learning for Natural Language Processing | JSALT 2019
Deep Learning and Natural Language Processing with Apache MXNet Gluon | KDD 2018
Blogs & Press
字节跳动开源大模型训练框架veGiantModel | 2021
BERT Inference on G4 Instances using Apache MXNet and GluonNLP: 1 Million Requests for 20 Cents | 2020
Amazon Scientists Help SK Telecom Create Korean-based Natural Language Processor | 2020
GluonNLP 0.6: Closing the Gap in Reproducible Research with BERT | 2019
Introducing Dynamic Training for Deep Learning with Amazon EC2 | 2018
Apache MXNet Release Adds Support for New NVIDIA Volta GPUs and Sparse Tensor | 2017
Patents
用于加速分布式DNN训练的通用分析和优化系统
基于梯度延迟感知的数据并行深度神经网络训练流水线
跨模型跨设备张量程序性能预测器
Awards & Services
KDD Best Paper Award (Applied Science Track) | 2020
Soong Ching Ling Scholarships | 2011 - 2015
Dean's Honors List | 2012 - 2015
HKUEAA Scholarships (Top 0.1%) | 2014
Reviewer for AISTATS 2021, VLDB 2023