Convolutional neural networks: Reading Guide

Deep Learning is behind state-of-the-art results in several problems cross different fields. In computer vision, Convolutional Neural Networks (CNN), one of the deep learning models, is applied to some image and video challenges. In this article, I will point out to what to read to be on the right track and get started to apply CNN to your vision problems.

For Arabian, this videos hould be helpful.


  • Deep Learning is a Machine Learning model, so you should be aware with such field.
    • The easiest start in machine learning probably is Andrew Ng Coursera Course.
      • You need at least to cover up to week 5 (Neural Networks: Learning)
      • However, It is great to finish the whole series
  • You need to know Neural Networks(NN), specifically Backpropagation model.
    • You may learn it through course by Geoffrey Hinton himself!
    • It is a good idea to write/read code for the Backpropagation Algorithm

Convolutional neural networks (CNN)

Basic Reading


Running CNN

  • Wiki list popular ones
  • I am using Caffe, based on C++ and has some interfaces
    • Network defined in text file, not code
    • They provide many examples, including popular ones (AlexNet, GoogleNet..)
    • Documentation is not perfect. Workaround: Reading these examples + Issues on github

CNN in Images

There are several important papers applied CNN in images. Following is little of them per problem.

Image Classification

  • Krizhevsky et al 2012: ImageNet Classification with Deep Convolutional Neural Networks

Object Recognition

  • Girshick et al 2014: Rich feature hierarchies for accurate object detection and semantic segmentation

CNN in Videos

The real challenge in videos is considering too the temporal dimension for the data. One naive way is to ignore that with cost of losing the motion information. To avoid that, there are several proposed methods to make use of both spatial and temporal data. Tomas Pfister et al is the easiest one of them (instead of feeding 1 image of 3 channels...feed k images as one image of 3k channels).

Video Classification (Action Recognition)

  • Shuiwang Ji 2010 - 3D Convolutional Neural Networks for Human Action Recogniton
  • Karen and Andrew 2014 - two-stream-convolutional-networks-for-action-recognition-in-videos
  • Karpathy 2014: Large-scale Video Classification with Convolutional Neural Networks

Pose Estimation

  • Tomas Pfister 2014. Deep Convolutional Neural Networks for Efficient Pose Estimation in Gesture Videos

Useful Papers

  • Mattew and Rob 2014 - Visualizing and Understanding Convolutional Networks

Other Materials

  • Lecun10, Convolutional networks and applications in vision


  • Caffe: Deep learning is reason behind the higher push for performance in several problems. Caffe is a deep learning framework developed with cleanliness, readability, and speed in mind.

Last update: Feb 10, 2015.