Part IV: Large-scale visual recognition with deep learning

Speaker: Marc'Aurelio Ranzato

In this part, we will introduce deep learning, an emergent field of machine learning that aims at automatically learning feature hierarchies and which has shown promises in several large-scale computer vision applications. The key insight is that complex sensory inputs, such as images and videos, can be better represented as a sequence of more abstract and invariant features and that such features can be learned in a data driven manner. For instance, an image can be represented as a set of simple edges in certain orientations and positions at the lowest layer of the hierarchy. Subsequently, these edges can be automatically composed to form object parts of increasing complexity. Learning takes place at each layer of the hierarchy, leveraging large amounts of data and limiting the time consuming and sub-optimal feature engineering step of many traditional computer vision systems. There are several ways to learn such features (in a supervised, unsupervised and semi-supervised setting depending on the amount of labeled data), and there are several models that can be used (probabilistic graphical models with hierarchies of latent variables and different kinds of convolution neural networks).

In this talk, I will focus on the most successful model which is the convolutional neural network. I will explain how it works and I will provide several practical tips. At the end of my slide deck below, you can find several pointers to toolboxes and advanced reading material on this topic.