Deep Learning is behind state-of-the-art results in several problems cross different fields. In computer vision, Convolutional Neural Networks (CNN), one of the deep learning models, is applied to some image and video challenges. In this article, I will point out to what to read to be on the right track and get started to apply CNN to your vision problems.
For Arabian, this videos hould be helpful.
- Deep Learning is a Machine Learning model, so you should be aware with such field.
- The easiest start in machine learning probably is Andrew Ng Coursera Course.
- You need at least to cover up to week 5 (Neural Networks: Learning)
- However, It is great to finish the whole series
- You need to know Neural Networks(NN), specifically Backpropagation model.
- You may learn it through course by Geoffrey Hinton himself!
- It is a good idea to write/read code for the Backpropagation Algorithm
Convolutional neural networks (CNN)
- Yann LeCun and Yoshua Bengio introduced the CNN. You should read their work.
- Deep Learning book (Yoshua Bengio et al.) has a nice intro highlighting some interesting points.
- Stanford offers a course about CNN..with some assignments.
- Wiki list popular ones
- I am using Caffe, based on C++ and has some interfaces
- Network defined in text file, not code
- They provide many examples, including popular ones (AlexNet, GoogleNet..)
- Documentation is not perfect. Workaround: Reading these examples + Issues on github
There are several important papers applied CNN in images. Following is little of them per problem.
- Krizhevsky et al 2012: ImageNet Classification with Deep Convolutional Neural Networks
- Girshick et al 2014: Rich feature hierarchies for accurate object detection and semantic segmentation
The real challenge in videos is considering too the temporal dimension for the data. One naive way is to ignore that with cost of losing the motion information. To avoid that, there are several proposed methods to make use of both spatial and temporal data. Tomas Pfister et al is the easiest one of them (instead of feeding 1 image of 3 channels...feed k images as one image of 3k channels).
Video Classification (Action Recognition)
- Shuiwang Ji 2010 - 3D Convolutional Neural Networks for Human Action Recogniton
- Karen and Andrew 2014 - two-stream-convolutional-networks-for-action-recognition-in-videos
- Karpathy 2014: Large-scale Video Classification with Convolutional Neural Networks
- Tomas Pfister 2014. Deep Convolutional Neural Networks for Efficient Pose Estimation in Gesture Videos
- Mattew and Rob 2014 - Visualizing and Understanding Convolutional Networks
- Lecun10, Convolutional networks and applications in vision
- Caffe: Deep learning is reason behind the higher push for performance in several problems. Caffe is a deep learning framework developed with cleanliness, readability, and speed in mind.
Last update: Feb 10, 2015.