Advances in Variational Inference

NIPS 2014 Workshop

13 December 2014 **♦** Level 5 **♦** Room 510 a

Convention and Exhibition Center, Montreal, Canada

**Invited Talks**

The Expectation Maximization (EM) algorithm is a crucial tool to perform parameter estimation in latent data models. However, when processing large data sets, the EM algorithm becomes intractable since it requires the whole data set to be available at each iteration. In this talk, an online EM algorithm is proposed to perform maximum likelihood estimation in general parametric hidden Markov Models. This new algorithm updates the parameter estimate after a block of observations is processed (online). The convergence of this new algorithm is established, and the rate of convergence is studied, illustrating the impact of the block-size sequence.

The talk also outlines a new estimation procedure for nonparametric hidden Markov models. It is assumed that the only information on the stationary hidden states (Xk) is given by the process (Yk), where Yk is a noisy observation of f(Xk). The talk will introduce a maximum pseudo-likelihood procedure to estimate the function f and the distribution of (X1,...,Xb) using blocks of observations of length b. The identifiability of the model is studied in the particular cases b = 1 and b = 2 and the consistency of the estimators of f and of the distribution of the hidden states as the number of observations grows to infinity is established.

**Durk Kingma**, Stochastic Backpropagation, Deep Variational Bayesian Inference and an Application to Semi-Supervised Learning

We explain how backpropagation can be applied to continuous stochastic variables, and how this allows us to perform efficient variational Bayesian inference by gradient ascent. This method also allows us to learn the parameters of many probabilistic models in which this was previously impractical. We illustrate the method by fulfilling the initial promose of the Helmholtz Machine, where an inference model is trained concurrently with a generative model, this time using correct gradients. We also illustate the inference approach with an application to semi-supervised learning, and show how we beat the state of the art on benchmark tasks by a large margin. We conclude with visualisations of analogy-making with our deep inference procedure and generative models.

**David Knowles**, Easier Variational Inference for Non-conjugate Models

Bayesian probabilistic models are attractive for interpretable data analysis, allowing automated learning of hyperparameters and rigorousassessment of uncertainty. However, MCMC, the standard workhorse for inference poses multiple challenges to the practitioner: assessingconvergence and mixing remains a dark art, for latent variable models summarizing posterior samples is non-trivial and even the simplestmethods are trickier to implement than their point estimation analogues. Variational inference significantly alleviates the firsttwo: convergence can be assessed using the lower bound and a parametric form for the posterior is naturally obtained. I willdiscuss two contributions that aim to address the final problem by enabling easier implementation of variational methods for generalmodels: firstly, non-conjugate variational message passing, now part of the Infer.NET software package, and secondly, a version ofstochastic variational inference for exponential family approximating distributions which leverages connections to linear regression. I willshow applications to models for finding gene by environment interactions using RNA-seq data and to a convolutional NMF for histology images.

**Matt Hoffman**, Why Variational Inference Gives Bad Parameter Estimates

Variational inference often produces qualitatively and quantitatively worse parameter estimates than other inference methods such as Markov chain Monte Carlo (MCMC). Two explanations are possible: either the global optimum of the variational inference objective is a poor estimator, or the objective is riddled with poor local optima that make it very difficult to find near-optimal solutions. In this talk I will present mathematical and empirical arguments for the latter explanation, and suggest some ways around the local optima problem.

Applications of statistical machine learning increasingly involve datasets with rich hierarchical, temporal, spatial, or relational structure. Bayesian nonparametric models offer the promise of effective learning from big datasets, but standard inference algorithms often fail in subtle and hard-to-diagnose ways. We explore this issue via variants of a popular and general model family, the hierarchical Dirichlet process. We propose a framework for "memoized" online optimization of variational learning objectives, which achieves computational scalability by processing local batches of data, while simultaneously adapting the global model structure in a coherent fashion. Using this approach, we build improved models of text, image, and social network data.

A challenge faced by variational inference methods is concerned with computational scaling to massive datasets. However, an equally important challenge is concerned with scaling to massive probabilistic models that can have very large or even infinite number of parameters such as the Bayesian non-parametric models. Here, we will address the second challenge and discuss the variational inducing variable approximation that has been developed in the Gaussian process community. We will first provide a review of this approximation, discussing some new theoretical results, and then we will extend this variational framework for learning Determinantal point processes.