Part V: Input embeddings, from shallow to deep

Speaker: Andrea Vedaldi

This part will review fundamental ideas in image representation. The goal of a representation is to transform the data into a format that facilitate learning. In most cases, this amounts to embedding the data in a suitable vector space. Starting from the desiderata of these embeddings, we will look first at hand-crafted constructions such as the histogram of oriented gradients (HOG), the bag-of-visual-words (BoVW), the spatial pyramid, VLAD, and the Fisher Vectors. We will emphasise how these representations embed the data in a normed vector space whose distance captures a useful notion of image similarity. Then, we will consider the opposite approach, and start from a useful definition of similarity, as captured by a non-linear kernel, and show how a corresponding embedding can be derived. This construction will lead to approximated kernel maps, efficient embeddings that preserve the target similarity while mapping the data to compact vectors. We will link this construction to Nyström's approximation, an infinite dimensional analogous of PCA, and we will show that, for additive homogeneous kernels, this approximation can be computed very efficiently either in closed form or numerically from example data. We will show how discriminative metric learning can be used to further improve a given data embedding as well, in some cases, resulting in a further compression of the data. We will illustrates these concepts by using an example in face verification.

In the final part, we will compare these standard embeddings to the current generation of convolutional neural network (CNN). While training such networks requires significant amounts of data, inspirited by DeCAF and similar approaches, we will experiment with pre-training a CNN on a large-scale dataset such as ImageNet and then using it as a general-purpose representation. We will compare different architectures and study several properties of these excellent representations.