Large-scale Training

Our goal is to develop novel optimization algorithms for training machine learning models. Previously we have worked on algorithms and software packages for linear SVM (LIBLINEAR), kernel SVM (Divide-and-conquer SVM), graphical model learning (QUIC), and matrix factorization (LIBPMF). More recently we are working on the following topics:

Distributed training for deep neural networks:

Recently many researchers/companies favor synchronized training for deep neural network, because it's easier to implement and is often more stable. To run SGD with synchronized training, assume a single machine can compute the average gradient for 512 samples, then 10 machines can compute the average gradient for 5120 samples. This is equivalent to large-batch SGD. It's true that increasing batch size from 32 to 256 can help the convergence, but it's not true when you want to use batch size 16384 or 65536---the SGD will converge slower in that region, and even worse it will converge to bad solutions. There could be two reasons: 1) each update just need a good-enough gradient estimator, so once the gradient is good enough, more samples in the batch will not be useful, and 2) large-batch means smaller noise, so potentially harder to escape from sharp minimum.

However, with better step size scheduling, it is still possible to scale up batch size. See our successful story in

ImageNet Training in Minutes (ICPP '18, Best paper award)

If you don't have access to a good data center, in some distributed systems with weak network connections, the centralized communication in large-batch SGD will be prohibitive. We can then apply de-centralized algorithms to handle these cases:

Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent (NIPS '17, Oral Presentation)

(for more work on decentralized training, please refere to more recent work by Ji Liu and Wotao Yin!)

Another work discussing the effect of batch size in training linear (convex) models:

Fast Variance Reduction Method with Stochastic Batch Size (ICML '18)

Previously we also worked on fast training for other linear and kernelized models:

(Multi-core) asynchronous training for kernel SVM: Asynchronous Parallel Greedy Coordinate Descent (NIPS '16)

(Distributed) synchronized training for kernel SVM: Communication-Efficient Distributed Block Minimization for Nonlinear Kernel Machines (KDD '17)

(Multi-core) asynchronous training for linear models: Fixing the Convergence Problems in Parallel Asynchronous Dual Coordinate Descent (ICDM '16)

(Multi-core) decentralized training: HogWild++: A New Mechanism for Decentralized Asynchronous Stochastic Gradient Descent (ICDM '16)

Thanks to my students (Huan Zhang, Xuanqing Liu, Minhao Cheng) and collaborators (You Yang, Si Si, Inderjit Dhillon, James Demmel, Tao Zhang, Kurt Keutzer, Ji Liu, Xiangru Lian, Ce Zhang, Venkatesh Akella) for all of these interesting work.