Research

In data science and machine learning, computational tractability constrains the questions we can ask. For example, although Bayesian inference provides a natural paradigm to understand uncertainty of models given data, it is often avoided when model and data complexity are huge. This is largely due to the preconception that Bayesian inference is harder to implement and less tractable than its frequentist counterpart. Lack of simple methods to obtain efficient algorithms and their convergence guarantees are the long-standing challenges to the field. My research has been evolving to address these challenges in the realm of data science.

Sampling can be faster than optimization...

We studied locally non-convex objective functions, which appear in mixture models and multi-stable physical systems. We found that simple MCMC methods like the Langevin algorithms can provide mean estimates and uncertainty quantifications within a number of steps scaling linearly in dimension d.

On the other hand, finding the optimum for such objective functions is in general an NP-hard task. This striking dichotomy revealed that when there is complex local structure, sampling can be a more computationally tractable task than optimization. [PNAS paper]

Locally Non-convex Objective Functions

...and can be accelerated

Framing MCMC algorithms as optimization on the space of probability measures has allowed us to bring tools of optimization to bear on the problems of sampling. It was previously known that the Langevin algorithm performs gradient descent over the Kullback Leibler divergence in the space of measures. In the same vein, we found the underdamped Langevin algorithm (with an auxiliary momentum variable) to perform accelerated gradient descent over the KL-divergence. This algorithm accelerates the previous O(d) convergence rate to O(√d) for locally non-convex targets. [preprint]

We further discovered that via leveraging higher order Langevin dynamics (with multiple auxiliary variables), even faster Langevin algorithms can be constructed. [preprint]

Accelerated Langevin dynamics

output.avi

Unifying framework for gradient-based MCMC

We discovered a complete recipe for MCMC algorithms using continuous Markov dynamics. Design of the entire algorithm reduces to choosing two matrices: one positive semi-definite matrix D(z), and one skew-symmetric matrix Q(z) that has not been fully explored. [NIPS paper] [Stats. & Comput. paper]

It can be shown that the skew-symmetric matrix Q(z) corresponds to the irreversible dynamics and always improves convergence. I am happy to see many other researchers using our recipe to build new algorithms. It would be of great practical and theoretical interest to see how these new methods can improve convergence in different settings. Of course answer to this question depends on how the continuous Markov dynamics is simulated and implemented in the algorithms.

Complete Recipe for Continuous Dynamics MCMC

Irreversible MH and MALA

If the requirement for accuracy is high, using the simulated Markov dynamics as proposal distributions in the Metropolis-Hastings (MH) algorithm can provide fast convergence. But the MH algorithm has a reversibility property that eliminates the benefits of the skew-symmetric matrix. Hence we first generalize the MH algorithm to be irreversible.

Irreversible jump (I-Jump) sampler: With Emily Fox, Tianqi Chen, and Lei Wu, I propose the I-Jump sampler, akin to MH—with the same ease of implementation—but allowing for irreversible dynamics. We see benefits of irreversibility across a broad range of target distributions.

Irreversible MALA (I-MALA): We then incorporate the discretized general continuous dynamics (described above) as a proposal distribution in the irreversible jump sampler. The resulting I-MALA is a generalization of the Metropolis adjusted Langevin Algorithm (MALA) to be irreversible and able to include any continuous dynamics as proposal. [Stats. & Comput. paper]

Simple Irreversible Jump Algorithm

Stochastic gradient MCMC and variance reduction

If the datasets are large and the requirement for accuracy is relatively low, we appeal to the stochastic gradient MCMC methods. By a new choice of the skew-symmetric matrix Q(z), we directly obtain a novel stochastic gradient Riemann HMC sampler to achieve state of the art performance for streaming Wikipedia analysis. [NIPS paper] [poster]

When the accuracy demand is higher, variance reduction methods can help in that regime. We provide convergence guarantees to guide the algorithm choices over variance reduced (SAGA), control variate (CV), stochastic gradient (SG), and original Langevin dynamics (LD) algorithms suited to different scenarios according to dimension d, dataset size N, and accuracy ε. [ICML paper]

We contributed the implementation of stochastic gradient Hamiltonian Monte Carlo (SG-HMC) algorithm to Edward as a scalable and efficient inference tool.

Performance Comparison in Different Scenarios

Stochastic gradient MCMC for time series

A critical assumption in the use of stochastic gradient MCMC methods is that data must be independent and identically distributed. This is often violated in reality. As a first trial to overcome this restriction, we considered temporally correlated data.

For hidden Markov models, with Emily Fox and Nick Foti, I constructed new stochastic gradient procedures to use a small portion of the data in each update while controlling the error from breaking the temporal dependencies. Significant benefits were observed when analyzing an ion channel recording. [ICML paper] [poster]

Later we extended the method to more general state space dynamics models. We used the new inference method to analyze the intracranial EEG data measuring canine seizure behaviors and historical city weather data with switching dynamical systems models and observed sizable gains in efficiency. [SIMODS paper] [code]

Stochastic Gradient MCMC for Time Series

Studies on models from applications

I have also studied specific models arising from scientific and technological applications. For stochastic process models of bio-chemical and ecological systems, I took a non-equilibrium statistical mechanics point of view to derive a global, macroscopic picture of the systems. Using the fact that the dynamics preserve the invariance measures, I defined macroscopic quantities whose relationships provided a global characterization of the system at very low dimensions. [New J. Phys. paper] [Proc. Royal Soc. A paper] [poster]

The statistical mechanics approach also helped me and my collaborators to understand the global Lyapunov stability of generic dynamical systems [paper]. We applied this approach to the control problem of a piecewise linear recurrent network model with limit cycle oscillations emerging from different bifurcations [preprint]. We also constructed Lyapunov functions for chaotic systems to reveal the structure of the chaotic attractors [paper]. Contrary to popular belief, they have a separate dynamic origin than the strange attractors.

On the technological side, I helped understand how neural networks can use wider but sparser channels to gain representation power without increasing their computational cost [paper]. I also helped genome researchers infer RNA transcript counts from stochastic single cell measurements of RNA sequence expression levels [Nature Methods paper].