Contributed Papers

All papers will be represented by posters at the workshop, and we will have short spotlight talks for 5 papers that were rated as especially strong.


Paper Title: Tensor Analyzers [pdf]
Author Names: Yichuan Tang*, University of Toronto; Ruslan Salakhutdinov, University of Toronto; Geoffrey Hinton, University of Toronto
Abstract: Factor Analysis is a statistical method that seeks to explain linear variations in data by using unobserved latent variables. Due to its additive nature, it is not suitable for modeling data that is generated by multiple groups of latent factors which interact multiplicatively. In this paper, we introduce Tensor Analyzers which are a multilinear generalization of Factor Analyzers. We describe a fairly efficient way of sampling from the posterior distribution over factor values and we demonstrate that these samples can be used in the EM algorithm for learning interesting mixture models of natural image patches and of images containing a variety of simple shapes that vary in size and color. Tensor Analyzers can also accurately recognize a face under significant pose and illumination variations when given only one previous image of that face. We also show that mixtures of Tensor Analyzers outperform mixtures of Factor Analyzers at modeling natural image patches and artificial data produced using multiplicative interactions.

Paper Title: Temporal Autoencoding Restricted Boltzmann Machine [arxiv]
Author Names: Chris Hausler, Freie Universität Berlin; Alex Susemihl*, Berlin Institute of Technology
Abstract: Much work has been done refining and characterizing the kinds of receptive fields learned by deep learning algorithms. A lot of this work has focused on the development of Gabor-like filters learned when enforcing sparsity constraints on a natural image dataset. Little work however has investigated how these filters might expand to the temporal domain, namely through training on natural movies. Here we investigate exactly this problem in established temporal deep learning algorithms as well as a new learning paradigm suggested here, the Temporal Autoencoding Restricted Boltzmann Machine (TARBM). 

Paper Title: Deep Gaussian processes [arxiv]
Author Names: Andreas Damianou*, University of Sheffield; Neil Lawrence, University of Sheffield
Abstract: In this paper we introduce deep Gaussian process (GP) models. Deep GPs are a deep belief network based on Gaussian process mappings. Data is modeled as the output of a multivariate GP. The inputs to that Gaussian process are then governed by another GP. A single layer model is equivalent to a standard GP or the GP latent variable model (GPLVM). We perform inference in the model by approximate variational marginalization. This results in a strict lower bound on the marginal likelihood of the model which we use for model selection (number of layers and nodes per layer). Deep belief networks are typically applied to relatively large data sets using stochastic gradient descent for optimization. Our fully Bayesian treatment allows for the application of deep models even when data is scarce. Model selection by our variational bound shows that a five layer hierarchy is justified even when modelling a digit data set containing only 150 examples.

Paper Title: Modeling Laminar Recordings from Visual Cortex with Semi-Restricted Boltzmann Machines [pdf]
Author Names: Urs Koster*, UC Berkeley; Jascha Sohl-Dickstein, UC Berkeley; Bruno Olshausen, UC Berkeley
Abstract: The proliferation of high density recording techniques presents us with new challenges for characterizing the statistics of neural activity over populations of many neurons. The Ising model, which is the maximum entropy model for pairwise correlations, has been used to model the instantaneous state of a population of neurons.  This model suffers from two major limitations: 1) Estimation for large models becomes computationally intractable, and 2) it cannot capture higher-order dependencies.  We propose applying a more general maximum entropy model, the semi-restricted Boltzmann machine (sRBM), which extends the Ising model to capture higher order dependencies using hidden units. Estimation of large models is made practical using minimum probability flow, a recently developed parameter estimation method for energy-based models. The partition functions of the models are estimated using annealed importance sampling, which allows for comparing models in terms of likelihood.  Applied to 32-channel polytrode data recorded from cat visual cortex, these higher order models significantly outperform Ising models. In addition, extending the model to spatiotemporal sequences of states allows us to predict spiking based on network history. Our results highlight the importance of modeling higher order interactions across space and time to characterize activity in cortical networks.

Paper Title: A Two-stage Pretraining Algorithm for Deep Boltzmann Machines [pdf]
Author Names: Kyunghyun Cho*, Aalto University; Tapani Raiko, Aalto University; Alexander Ilin, Aalto University; Juha Karhunen, Aalto University
Abstract: A deep Boltzmann machine (DBM) is a recently introduced Markov random field model that has multiple layers of hidden units. It has been shown empirically that it is difficult to train a DBM with approximate maximum-likelihood learning using the stochastic gradient unlike its simpler special case, restricted Boltzmann machines (RBM). In this paper, we propose a novel pretraining algorithm that consists of two stages; obtaining approximate posterior distributions over hidden units from a simpler model and maximizing the variational lower-bound given the fixed hidden posterior distributions. We show empirically that the proposed method overcomes the difficulty in training DBMs from randomly initialized parameters and results in a better, or comparable, generative model when compared to the conventional pretraining algorithm.

Paper Title: Linear-Nonlinear-Poisson Neurons Can Do Inference On Deep Boltzmann Machines [pdf]
Author Names: Louis Shao*, The Ohio State University
Abstract: One conjecture in both deep learning and classical connectionist viewpoint is that the biological brain implements certain kinds of deep networks as its back-end. However, to our knowledge, a detailed correspondence has not yet been set up, which is important if we want to bridge between neuroscience and machine learning. Recent researches emphasized the biological plausibility of Linear-Nonlinear-Poisson (LNP) neuron model. We show that with neurally plausible choices of parameters, the whole neural network is capable of representing any Boltzmann machine and performing a semi-stochastic Bayesian inference algorithm lying between Gibbs sampling and variational inference.

Paper Title: Crosslingual Distributed Representations of Words [pdf]
Author Names: Alexandre Klementiev*, Saarland University; Ivan Titov, Saarland University; Binod Bhattarai, Saarland University
Abstract: Distributed representations of words have proven extremely useful in numerous natural language processing tasks.  Their appeal is that they can help alleviate data sparsity problems common to supervised learning.  Methods for inducing these representations require only unlabeled language data, which are plentiful for many natural languages.  In this work, we induce distributed representations for a pair of languages jointly.  We treat it as a multitask learning problem where each task corresponds to a single word, and task relatedness is derived from co-occurrence statistics in bilingual parallel data.  These representations can be used for a number of crosslingual learning tasks, where a learner can be trained on annotations present in one language and applied to test data in another.  We show that our representations are informative by using them for crosslingual document classification, where classifiers trained on these representations substantially outperform strong baselines when applied to a new language.

Paper Title: Deep Attribute Networks [arxiv]
Author Names: Junyoung Chung*, KAIST; Donghoon Lee, KAIST; Youngjoo Seo, KAIST; Chang D. Yoo, KAIST
Abstract: Obtaining compact and discriminative features is one of the major challenges in many of the real-world image classification tasks such as face verification and object recognition. One possible approach is to represent input image on the basis of high-level features that carry semantic meaning that humans can understand. In this paper, a model coined deep attribute network (DAN) is proposed to address this issue. For an input image, the model outputs the attributes of the input image without performing any classification. The efficacy of the proposed model is evaluated on unconstrained face verification and real-world object recognition tasks using the LFW and the a-PASCAL datasets. We demonstrate the potential of deep learning for attribute-based classification by showing comparable results with existing state-of-the-art results. Once properly trained, the DAN is fast and does away with calculating low-level features which are maybe unreliable and computationally expensive.

Paper Title: Accelerating sparse restricted Boltzmann machine training using non-Gaussianity measures [pdf]
Author Names: Sander Dieleman*, Ghent University; Benjamin Schrauwen
Abstract: In recent years, sparse restricted Boltzmann machines have gained popularity as unsupervised feature extractors. Starting from the observation that their training process is biphasic, we investigate how it can be accelerated: by determining when it can be stopped based on the non-Gaussianity of the distribution of the model parameters, and by increasing the learning rate when the learnt filters have locked on to their preferred configurations. We evaluated our approach on the CIFAR-10, NORB and GTZAN datasets.

Paper Title: Not all signals are created equal: Dynamic Objective Auto-Encoder for Multivariate Data [pdf]
Author Names: Martin Längkvist*, Örebro University; Amy Loutfi, Örebro University
Abstract: There is a representational capacity limit in a neural network defined by the number of hidden units. For multimodal time-series data, there could exist signals with various complexity and redundancy. One way of getting a higher representational capacity for such input data is to increase the number of units in the hidden layer. We propose a step towards dynamically change the number of units in the visible layer so that there is less focus on signals that are difficult to reconstruct and more focus on signals that are easier to reconstruct with the goal to improve classification accuracy and also better understand the data itself. A comparison with state-of-the-art architectures show that our model achieves a slightly better classification accuracy on the task of classifying various styles of human motion. 

Paper Title: When Does a Mixture of Products Contain a Product of Mixtures? [arxiv]
Author Names: Guido Montufar Cuartas*, Pennsylvania State University; Jason Morton, Pennsylvania State University
Abstract: We prove results on the relative representational power of mixtures of products and products of mixtures; more precisely restricted Boltzmann machines. In particular we find that an exponentially larger mixture model, requiring an exponentially larger number of parameters, is required to represent the distributions that can be represented by the restricted Boltzmann machine. This formally confirms a common intuition. 
Tools of independent interest are mode-based polyhedral approximations sensitive enough to compare even full-dimensional models, and characterizations of possible mode and support sets of both model classes. The title question is intimately related to questions in coding theory and the theory of hyperplane arrangements.  

Paper Title: Kernels and Submodels of Deep Belief Networks [arxiv]
Author Names: Guido Montufar Cuartas*, Pennsylvania State University; Jason Morton, Pennsylvania State University
Abstract: We describe mixture of products represented by layered networks from the perspective of linear stochastic maps, or kernel transitions of probability distributions. This gives a unified picture of distributed representations arising from  Deep Belief Networks (DBN) and other networks without lateral interactions. We describe combinatorial and geometric properties of the set of kernels and concatenations of kernels realizable by DBNs as the parameters vary. We present explicit classes of probability distributions that can be learned by DBNs depending on the number of hidden layers and units that they contain. We use these submodels to bound the maximal and the expected Kullback-Leibler approximation errors of DBNs from above. 

Paper Title: Online Representation Search and Its Interactions with Unsupervised Learning [pdf]
Author Names: Ashique Mahmood*, University of Alberta; Richard Sutton, University of Alberta
Abstract: We consider the problem of finding good hidden units, or features, for use in multilayer neural networks. Solution methods that generate candidate features, evaluate them, and retain the most useful ones (such as cascade correlation and NEAT), we call representation search methods. In this paper, we explore novel representation search methods in an online setting, compare them with two simple unsupervised learning algorithms that also scale online. We demonstrate that the unsupervised learning methods are effective only at the initial learning period. However, when combined with search strategies, they are able to improve representation with more data and perform better than either of search and unsupervised learning alone. We conclude that search has enabling effects on unsupervised learning in continual learning tasks.

Paper Title: Learning global properties of scene images from hierarchical representations [pdf]
Author Names: Wooyoung Lee*, Carnegie Mellon University; Michael Lewicki, Case Western Reserve University
Abstract: Scene images with similar spatial layout properties often display characteristic statistical regularities on a global scale. In order to develop an efficient code for these global properties that reflects their inherent regularities, we train a hierarchical probabilistic model to infer conditional correlational information from scene images. Fitting a model to a scene database yields a compact representation of global information that encodes salient visual structures with low dimensional latent variables. Using perceptual ratings and scene similarities based on spatial layouts of scene images, we demonstrate that the model representation is more consistent with perceptual similarities of scene images than the metrics based on the state-of-the-art visual features. 

Paper Title: Theano: new features and speed improvements [arxiv]
Author Names: Pascal Lamblin*, Université de Montréal; Frédéric Bastien, Université de Montréal; Razvan Pascanu, Universite de Montreal; James Bergstra, Harvard University; Ian Goodfellow, Université de Montréal; Arnaud Bergeron, Université de Montréal; Nicolas Bouchard, Université de Montréal; David Warde-Farley, Université de Montréal; Yoshua Bengio, University of Montreal
Abstract: Theano is a linear algebra compiler that optimizes a user's symbolically-specified mathematical computations to produce efficient low-level implementations.  In this paper, we present new features and efficiency improvements to Theano, and benchmarks demonstrating Theano's performance relative to Torch7, a recently introduced machine learning library, and to RNNLM, a C++ library targeted at recurrent neural networks.

Paper Title: Deep Target Algorithms for Deep Learning [pdf]
Author Names: Pierre Baldi*, UCI; Peter Sadowski, UCI
Abstract: There are many algorithms for training shallow architectures, such as peceptrons, SVMs, and shallow neural networks. Backpropagation (gradient descent) works well for a few layers but breaks down beyond a certain depth due to the well-known problem of vanishing or exploding gradients, and similar observations can be made for other shallow training algorithms. Here we introduce a novel class of algorithms for training deep architectures. This class reduces the difficult problem of training a deep architecture to the easier problem of training many shallow architectures by providing suitable targets for each hidden layer without backpropagating gradients, hence the name of deep target algorithms. This approach is very general, in that it works with both differentiable and non-differentiable functions, and can be shown to be convergent under reasonable assumptions. It is demonstrated here by training a four-layer autoencoder of non-differentiable threshold gates and a a 21-layer neural network on the MNIST handwritten digit dataset. 

Paper Title: Knowledge Matters: Importance of Prior Information for Optimization[pdf]
Author Names: Caglar Gulcehre*, University of Montreal; Yoshua Bengio, University of Montreal
Abstract: We explore the effect of introducing prior information into the intermediate level of neural networks for a learning task on which all the state-of-the-art machine learning algorithms tested failed to learn. We motivate our work from the hypothesis that humans learn such intermediate concepts from other individuals via a form of supervision or guidance using a curriculum. The experiments we have conducted provide positive evidence in favor of this hypothesis. In our experiments, a two-tiered MLP architecture is trained on a dataset with 64x64 binary inputs images, each image with three sprites. The final task is to decide whether all the sprites are the same or one of them is different. Sprites are pentomino tetris shapes and they are placed in an image with different locations using scaling and rotation transformations. The first level of the two-tiered MLP is pre-trained with intermediate level targets being the presence of sprites at each location, while the second level takes the output of the first level as input and predicts the final task target binary event. The two-tiered MLP architecture, with a few tens of thousand examples, was able to learn the task perfectly, whereas all other algorithms (include unsupervised pre-training, but also traditional algorithms like SVMs, decision trees and boosting) all perform no better than chance. We hypothesize that the optimization difficulty involved when the intermediate pre-training is not performed is due to the composition of two highly non-linear tasks. Our findings are also consistent with hypotheses on cultural learning inspired by the observations of optimization problems with deep learning, presumably because of effective local minima.

Paper Title: Understanding the exploding gradient problem  [pdf]
Author Names: Razvan Pascanu*, Universite de Montreal; Tomas Mikolov, Brno University of Technology; Yoshua Bengio, University of Montreal
Abstract: The process of training Recurrent Neural Networks suffers from several issues, making this otherwise elegant model hard to use in practice. 
In this paper we focus on one such issue, namely the exploding gradient.Beside a careful and insightful description of the problem we propose a simple yet efficient solution, which by altering the direction of the gradient avoids taking large steps while still following a descent direction.

Paper Title: Joint Training of Partially-Directed Deep Boltzmann Machines [pdf]
Author Names: Ian Goodfellow*,  Universite de Montreal; Aaron Courville, Universite de Montreal; Yoshua Bengio, Universite de Montreal
Abstract: We introduce a deep probabilistic model which we call the partially directed deep Boltzmann machine (PD-DBM). The PD-DBM is a model of real-valued data based on the deep Boltzmann machine (DBM) and the spike-and-slab sparse coding (S3C) model. We offer a hypothesis for why DBMs may not be trained succesfully without greedy layerwise training, and motivate the PD-DBM as a modified DBM that can be trained jointly.

Paper Title: Robust Subspace Clustering [pdf]
Author Names: Mahdi Soltanolkotabi*, Stanford; Ehsan Elhamifar, ; Emmanuel Candes, Stanford
Abstract: Subspace clustering is the problem of finding a multi-subspace representation that best fits a collection of points taken from a high-dimensional space. In this paper, we show that robust subspace clustering is possible using a tractable algorithm, which is a natural extension of Sparse Subspace Clustering (SSC). We prove that our methodology can learn the underlying subspaces under minimal requirements on the orientation of the subspaces, and on the number of samples needed per subspace. Stated differently, this work shows that it is possible to denoise a full-rank matrix if the columns lie close to a union of lower dimensional subspaces. We also provide synthetic as well as real data experiments demonstrating the effectiveness of our approach.

Paper Title: Regularized Auto-Encoders Estimate Local Statistics [pdf]
Author Names: Guillaume Alain, Universite de Montreal; Yoshua Bengio*, University of Montreal; Salah Rifai, University of Montreal
Abstract: What do auto-encoders learn about the underlying data generating distribution? Recent work suggests that some auto-encoder variants do a good job of capturing the local manifold structure of the unknown data generating density.  This paper clarifies these previous intuitive observations by showing that minimizing a particular form of regularized reconstruction error yields a reconstruction function that locally characterizes the shape of the data generating density. More precisely, we show that the auto-encoder captures the local mean and local covariance (the latter being related to the tangent plane of a manifold near which density concentrates) as well as the first and second derivatives of the density, thereby connecting to previous work linking denoising auto-encoders and   score matching for a particular form of energy function.  Instead, the theorems provided here are completely generic and do not depend on the parametrization of the auto-encoder: they show what the auto-encoder would tend to if given enough capacity and examples. These results are for a training criterion that is locally equivalent to the denoising auto-encoder training criterion, and involves a contractive penalty, but applied on the whole reconstruction function rather than just on the encoder.  One can consider the proposed training criterion as a convenient alternative to maximum likelihood, i.e., without partition function, similarly to score matching. Finally, we make the connection to existing sampling algorithms for such autoencoders, based on an MCMC walking near the high-density manifold.

Paper Title: Attribute Based Object Identification [pdf]
Author Names: Yuyin Sun, University of Washington; Liefeng Bo*, Intel Science and Technology C; Dieter Fox
Abstract: Over the last years, the robotics community has made substantial progress in detection and 3D pose estimation of known and unknown objects. However, the question of how to identify objects based on language descriptions has not been
investigated in detail. While the computer vision community recently started to investigate the use of attributes for object recognition, these approaches do not consider the task settings typically observed in robotics, where a combination of appearance attributes and object names might be used to identify specific objects in a scene. In this paper, we introduce an approach for identifying objects based on appearance and name attributes. To learn rich RGB-D features needed for attribute classification, we extend recently introduced sparse coding techniques so as to automatically learn attribute specific color and depth features. We use Mechanical Turk to collect a large data set of attribute descriptions of objects in the RGB-D object dataset. Our experiments show that learned attribute classifiers outperform previous instance based techniques for object identification. We also demonstrate that attribute specific features provide significantly better generalization to previously unseen attribute values, thereby enabling more rapid learning of new attribute values.

Paper Title: Learning High-Level Concepts by Training A Deep Network on Eye Fixations [pdf] [poster]
Author Names: Chengyao Shen*, National University of Singapo; Mingli Song, Zhejiang University; Qi Zhao, National University of Singapore
Abstract: Visual attention is the ability to select visual stimuli that are most behaviorally relevant among the many others. It allows us to allocate our limited processing resources to the most informative part of the visual scene. In this paper, we learn general high-level concepts with the aid of selective attention in a principled unsupervised framework, where a three layer deep network is built and greedy layer-wise training is applied to learn mid- and high- level features from salient regions of images. The network is demonstrated to be able to successfully learn meaningful high-level concepts such as faces and texts in the third-layer and mid-level features like junctions, textures, and parallelism in the second-layer. Unlike pre-trained object detectors that are recently included in saliency models to predict semantic objects, the higher-level features we learned are general base features that are not restricted to one or few object categories. A saliency model built upon the learned features demonstrates its competitive predictive power in natural scenes compared with existing methods.

Paper Title: Multipath Sparse Coding Using Hierarchical Matching Pursuit [pdf]
Author Names: Liefeng Bo*, Intel Science and Technology C; Xiaofeng Ren, ; Dieter Fox
Abstract: Complex real-world signals, such as images, contain discriminative structures that differ in many aspects including scale, invariance, and data channel. While progress in deep learning shows the importance of learning features through multiple layers, it is equally important to learn features through multiple paths. We propose Multipath Hierarchical Matching Pursuit (M-HMP), a novel feature learning architecture that combines a collection of hierarchical sparse features for image classification to capture multiple aspects of discriminative structures. Our building blocks are KSVD and batch orthogonal matching pursuit (OMP), and we apply them recursively at varying layers and scales. The result is a highly discriminative
image representation that leads to large improvements to the state-of-the-art on many standard benchmarks, e.g. Caltech-101, Caltech-256, MIT-Scenes and Caltech-UCSD Bird-200.

Paper Title: Jointly Learning and Selecting Features via Conditional Point-wise Mixture RBMs   [pdf]
Author Names: Kihyuk Sohn*, University of Michigan; Guanyu Zhou, ; Honglak Lee, University of Michigan
Abstract: Feature selection is an important technique for finding relevant features from high-dimensional data. However, the performance of feature selection methods is often limited by the raw feature representation. On the other hand, unsupervised feature learning has recently emerged as a promising tool for extracting useful features from data. Although supervised information can be exploited in the process of supervised fine-tuning (preceded by unsupervised pre-training), the training becomes challenging when the unlabeled data contain significant amounts of irrelevant information. To address these issues, we propose a new generative model, the conditional point-wise mixture restricted Boltzmann machine, which attempts to perform feature grouping while learning the features. Our model represents each input coordinate as a mixture model when conditioned on the hidden units, where each group of hidden units can generate the corresponding mixture component. Furthermore, we present an extension of our method that combines bottom-up feature learning and top-down feature selection in a coherent way, which can effectively handle irrelevant input patterns by focusing on relevant signals and thus learn more informative features. Our experiments show that our model is effective in learning separate groups of hidden units (e.g., that correspond to informative signals vs. irrelevant patterns) from complex, noisy data.