Last update: 6 Oct, 2016.
In this article, I will describe some basics that will facilitate the road for a junior vision research in this problem. As I have ~2 years experience in this problem, some notes may not be totally correct. If found any mistakes or wanna suggest any notes, contact me.
Let's solve the following challenging problem: Given an image that contains either 1 human, leopard or dog. Could you recognize the class/category/type of the image out of these 3 choices? NOTE: Image will almost be tight on the image object (e.g. leopard in the image).
The above problem is know as Image Classification Problem, where we try to classify the image as one of specific types. In general, classification target could be beyond that (e.g. is this sea scene?). Let say we have successfully wrote Function classify(tight_image, object type) that returns a real valued number indicating confidence in the type. E.g. Classify(tight_image, person) = 0.8 means we are 80% sure image is for a human. Sometimes we need your best guess for the class (top-1 error measure). Other may ask for your best 5 guesses (top-5 error measure)
Src: http://www.cs.cmu.edu/~aarti/Class/10701_Spring14/slides/DeepLearning.pdf
When we have an image and we want to identify an object inside it somewhere, we call the problem Object Recognition Problem. Object is any thing we are interested in it (e.g. person, car, chair, sofa...). The hard part here, we have to find a box/window inside the image matching the target type..NOT using the whole image as it is. Blew is several images, and bounding boxes around the found objects.
Src: http://raweb.inria.fr/rapportsactivite/RA2008/willow/10.png
How could we make use of 1st problem solution for solving 2nd problem? Here is one idea: Let's develop function Detect(image) that will propose for us some promising boxes inside the image such that probably object exist. Then for each box, evaluate it using Classify method and pick the box that has higher score as our best candidate. Function Detect solved the Object Detection Problem, which is function to guess some boxes to have the object inside. Using a recent algorithm, blew is set of guessed boxes (in blue) to have the target objects inside them. The correct boxes in green.
Src: https://pdollar.files.wordpress.com/2013/12/selectivesearch.jpg
So far we have:
Further notes:
Both problems are classical old ones in literature. However, performance is not satisfying yet. As a result, there are too many approaches in books for solving them. For research purposes, to avoid in reading many things that you may end up not using them, depend on recent literature (e.g. last 3 years) from top tier conferences (e.g. CVPR, ICCV, ECCV) and top journals (e.g. IJCV). On parallel, with slow rate, build vision on these old solutions either through papers, books or wiki. A another way to get vision about a research problem is finding recent surveys. Sure if you can ask professor working in the problem is great.
The best ever approach will be applying a brute force. We need promising rectangles..all of them are promising. Our output is ALL the boxes of a given image. We won't filter them.
Then in NxN image, we have N4. If N is 500, then we have 62500000000 windows. That is why no will ever implement that.
There are some approaches that will try to do some filtration for these windows. For example, if we are detecting cars, we know possible ration between height and width. Then, generate only windows using these ratios. Also don't try every pixel, try every x pixels. We know near pixels will generate similar visual boxes.
[1] proposed a very nice Branch and Bound Technique to solve "Exhaustive Search" problem in an efficient manner. They coupled recognition in detection, in specific setup, to identify the best windows in order. Algorithm is O(N2), but its worst case is O(N4). The worst bound happens when the object of interst doesn't exist in the image. later, [2] proposed update for the algorithm (IESS) that made worst case is O(N3). The algorithm is really fast in real time. Both algorithms source code were published. New implementation from someone for IESS is available on web too.
Typically to try to achieve state-of-the art results, you won't depend on this approach, as better ones are now published. However, in practice, in some cases, you may use IESS for efficiency purposes.
[3] and similar style are the dominant approaches now. Approaches depend on preprocessing steps to guess small set of locations (e.g. 2k-3k) where object is there. Given such little number of locations, much slow recognition approaches could be used. Edge box is now state of the art (high recall, less boxes, faster]
Since the boom of deep learning (2012), we moved from "Extract hand crafted features" to "learn features". Many approaches in the past will be interesting and may be hard to understand. But right now, we learn features. To learn about deep learning, see.
Deep Learning approaches: RCnn [4], Fast Rcnn[7], Faster Rcnn [8], Overfeat [5].
Faster RCNN is one the best state of the art approaches. It is very fast and accurate. The code is available. You can tune to specific datasets.
Before that, one of the popular "hand crafted features" was DPM [6]. One of the old popular features, and still worth to learn is BOVW (Bag of Visual Words) features, such in [1].
Following references of above links + Seeing who cite them + the below references => should reveal for you all needed important papers in literature
[1] Beyond sliding windows: Object localization by efficient subwindow search. CVPR 2008.
[2] Efficient Algorithms for Subwindow Search in Object Detection and Localization. CVPR 2009.
[3] Selective Search for Object Recognition. IJCV 2013.
[4] Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR 2014.
[5] OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. 2015
[6] Object Detection with Discriminatively Trained Part Based Models, PAMI 2009
[7] Fast R-CNN, CVPR 2015
[8] Faster Rcnn, 2016