Home

Haibin Lin | 林海滨

Scholar | Linkedin | Twitter

Haibin works on LLM and AGI infrastructure at Bytedance, focusing on optimizing training performance for LLMs & multimodal models (with more than 10k GPUs). Prior to the LLM era, Haibin worked on collective communication libraries (ByteCCL), and GPU-based recommendation model systems for Douyin & Tiktok in a team lead by Yibo Zhu. Before he joined Bytedance, he was at Amazon Web Services working on ML framework core (Apache MXNet), and large scale NLP model training, with a team led by Mu Li and Alex Smola. He finished his M.S. in Computer Science at Carnegie Mellon University Database Group, advised by Andy Pavlo. Haibin obtained his Bachelor's degree in Computer Science at Hong Kong University and Shanghai Jiao Tong University jointly.

Recently we are also working on veScale, a PyTorch native auto-parallelism framework (collaborations are welcome!)

Softwares (Python, C++, CUDA)

veGiantModel, a library for LLM training with 3-D parallelism | maintainer, 2020
GluonNLP, a toolkit for natural language processing | maintainer, 2018
horovod, a distributed training library | committer & TSC alumni, 2018
BytePS on ps-lite, a distributed training library for deep learning | maintainer, 2017
Apache MXNet, a deep learning framework | maintainer & PMC, 2016
Peloton, an in-memory database management system | committer, 2015

Papers

Large-scale distributed training systems & HPC

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion | Preprint, 2024
MegaScale: Scaling Large Language Modeling Training to More Than 10,000 GPUs | NSDI, 2024
Hi-Speed DNN Training with Espresso: Unleashing the Full Potential of Gradient Compression with Near-Optimal Usage Strategies | EuroSys, 2023
dPRO: A Generic Performance Diagnosis and Optimization Toolkit for Expediting Distributed DNN Training | MLSys, 2022
Accelerated Large Batch Optimization of BERT Pretraining in 54 minutes | Preprint, 2020
Is Network the Bottleneck for Distributed Training? | SIGCOMM (Network Meets AI & ML), 2020

ML frameworks and toolkits

Towards PyTorch-Native Auto-Parallel Framework | MLSys (Young Professionals Symposium), 2024
LLM-PQ:Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization | PPoPP (poster), 2024
CDMPP: A Device-Model Agnostic Framework for Latency Prediction of Tensor Programs | EuroSys, 2024
GluonCV and GluonNLP: Deep Learning in Computer Vision and Natural Language Processing | JMLR, 2019
Deep Graph Library: Towards Efficient and Scalable Deep Learning on Graphs | ICLR (Representation Learning for Graph and Manifolds), 2019
Just-in-Time Dynamic-Batching | NeurIPS (Systems for Machine Learning), 2018

Deep learning

LEMON: Lossless model expansion | ICLR, 2024
ResNeSt: Split-Attention Networks | CVPR (Efficient Deep Learning for CV), 2022

Temporal-contextual Recommendation in Real Time | KDD (Best Paper Award), 2020

Database systems

Self-Driving Database Management Systems | CIDR, 2017

Distributed optimization algorithms

SAPipe: Staleness-Aware Pipeline for Data Parallel DNN Training | NeurIPS, 2022
CSER: Communication-efficient SGD with Error Reset | NeurIPs, 2020
Local AdaAlter: Communication-Efficient Stochastic Gradient Descent with Adaptive Learning Rates | NeurIPS (Optimization for ML), 2020
Dynamic Mini-batch SGD for Elastic Distributed Training: Learning in the Limbo of Resources | Preprint, 2019

Presentations

Megascale: Scaling Large Language Model Training to More Than 10,000 GPUs | Systems@Scale 2024
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs | Invited talk @Databricks 2024
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs | NSDI 2024
BytePS and ByteCCL for distributed training | Invited talk @Meta 2022
Accelerating recommendation model training using ByteCCL and UCX | UCF 2021
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet | NVIDIA GTC 2020
Amazon SageMaker and Apache MXNet: Tips & Tricks | AWS Re:invent 2019
Build State-of-the-art NLP Models with Amazon SageMaker and GluonNLP | AWS Re:invent 2019
Sparse Tensor for Large-scale Recommendation Systems and Natural Language Processing | Apache MXNet Summit 2018

Tutorials

Dive into Deep Learning for Natural Language Processing | EMNLP 2019
Everything You Need to Know to Reproduce SOTA Deep Learning Models from Hands-on Tutorial | ICCV 2019
From Shallow to Deep Language Representations: Pre-training, Fine-tuning, and Beyond | KDD 2019
Dive into Deep Learning for Natural Language Processing | JSALT 2019
Deep Learning and Natural Language Processing with Apache MXNet Gluon | KDD 2018

Blogs & Press

Patents

用于加速分布式DNN训练的通用分析和优化系统
基于梯度延迟感知的数据并行深度神经网络训练流水线
跨模型跨设备张量程序性能预测器

Awards & Services

KDD Best Paper Award (Applied Science Track) | 2020
Soong Ching Ling Scholarships | 2011 - 2015
Dean's Honors List | 2012 - 2015
HKUEAA Scholarships (Top 0.1%) | 2014
Reviewer for AISTATS 2021, VLDB 2023

Page updated

Google Sites

Report abuse