Object Detection and Recognition Problems
Last update: 6 Oct, 2016.
In this article, I will describe some basics that will facilitate the road for a junior vision research in this problem. As I have ~2 years experience in this problem, some notes may not be totally correct. If found any mistakes or wanna suggest any notes, contact me.
Let's solve the following challenging problem: Given an image that contains either 1 human, leopard or dog. Could you recognize the class/category/type of the image out of these 3 choices? NOTE: Image will almost be tight on the image object (e.g. leopard in the image).
The above problem is know as Image Classification Problem, where we try to classify the image as one of specific types. In general, classification target could be beyond that (e.g. is this sea scene?). Let say we have successfully wrote Function classify(tight_image, object type) that returns a real valued number indicating confidence in the type. E.g. Classify(tight_image, person) = 0.8 means we are 80% sure image is for a human. Sometimes we need your best guess for the class (top-1 error measure). Other may ask for your best 5 guesses (top-5 error measure)
When we have an image and we want to identify an object inside it somewhere, we call the problem Object Recognition Problem. Object is any thing we are interested in it (e.g. person, car, chair, sofa...). The hard part here, we have to find a box/window inside the image matching the target type..NOT using the whole image as it is. Blew is several images, and bounding boxes around the found objects.
How could we make use of 1st problem solution for solving 2nd problem? Here is one idea: Let's develop function Detect(image) that will propose for us some promising boxes inside the image such that probably object exist. Then for each box, evaluate it using Classify method and pick the box that has higher score as our best candidate. Function Detect solved the Object Detection Problem, which is function to guess some boxes to have the object inside. Using a recent algorithm, blew is set of guessed boxes (in blue) to have the target objects inside them. The correct boxes in green.
So far we have:
- Object Detection Problem: Search for the candidate positions/boxes in the image to contains object of interest. We could represent output as set of candidate boxes/rectangles/windows. Other shapes may be circles, contours, ... etc
- Image Classification Problem: Given whole image (or a box inside an image), evaluate/recognize this image/box against a specific type such that highest evaluation corresponds to the best guess for this image/box. E.g. classify(box in an image, person) = 0.8 and classify(box in an image, chair) = -0.3, then probably this box is for a person.
- Object Recognition Problem: Given image locate for me specific objects inside it. Typically the above 2 problems will be needed to solve this one. We know the types we need to detect. E.g. Pascal contest define 20 classes (Person, Chair, Dog...etc)
- Sometimes researchers use Object Detection, but mean Object Recognition
- Sometimes word Object Localization is used to mean Object Detection
- In image net challenge:
- Object Localization: Detect specific number of the target object (e.g. detect 2 objects)
- Object Detection: Detect all available objects.
- Localization easier than Detection as you know how many objects at least in the scene
Both problems are classical old ones in literature. However, performance is not satisfying yet. As a result, there are too many approaches in books for solving them. For research purposes, to avoid in reading many things that you may end up not using them, depend on recent literature (e.g. last 3 years) from top tier conferences (e.g. CVPR, ICCV, ECCV) and top journals (e.g. IJCV). On parallel, with slow rate, build vision on these old solutions either through papers, books or wiki. A another way to get vision about a research problem is finding recent surveys. Sure if you can ask professor working in the problem is great.
Exhaustive Search [Not in use]
The best ever approach will be applying a brute force. We need promising rectangles..all of them are promising. Our output is ALL the boxes of a given image. We won't filter them.
Then in NxN image, we have N4. If N is 500, then we have 62500000000 windows. That is why no will ever implement that.
There are some approaches that will try to do some filtration for these windows. For example, if we are detecting cars, we know possible ration between height and width. Then, generate only windows using these ratios. Also don't try every pixel, try every x pixels. We know near pixels will generate similar visual boxes.
ESS Algorithm [Not in use]
 proposed a very nice Branch and Bound Technique to solve "Exhaustive Search" problem in an efficient manner. They coupled recognition in detection, in specific setup, to identify the best windows in order. Algorithm is O(N2), but its worst case is O(N4). The worst bound happens when the object of interst doesn't exist in the image. later,  proposed update for the algorithm (IESS) that made worst case is O(N3). The algorithm is really fast in real time. Both algorithms source code were published. New implementation from someone for IESS is available on web too.
Typically to try to achieve state-of-the art results, you won't depend on this approach, as better ones are now published. However, in practice, in some cases, you may use IESS for efficiency purposes.
Selective Search [heavily used nowadays]
 and similar style are the dominant approaches now. Approaches depend on preprocessing steps to guess small set of locations (e.g. 2k-3k) where object is there. Given such little number of locations, much slow recognition approaches could be used. Edge box is now state of the art (high recall, less boxes, faster]
State-of-the Art Recognition Approaches
Since the boom of deep learning (2012), we moved from "Extract hand crafted features" to "learn features". Many approaches in the past will be interesting and may be hard to understand. But right now, we learn features. To learn about deep learning, see.
Deep Learning approaches: RCnn , Fast Rcnn, Faster Rcnn , Overfeat .
Faster RCNN is one the best state of the art approaches. It is very fast and accurate. The code is available. You can tune to specific datasets.
Before that, one of the popular "hand crafted features" was DPM . One of the old popular features, and still worth to learn is BOVW (Bag of Visual Words) features, such in .
Following references of above links + Seeing who cite them + the below references => should reveal for you all needed important papers in literature
Machine Learning Background
- As I said, now deep learning (e.g. CNN is the most commonly used). To learn about deep learning, see.
- Every like 5 years has something other popular. Before CNN something like SVM is very popular, and still worth learning it.
- You may see other things like Decision Tree, Graphical Model
- K-Means clustering is popular with BOVW.
- Pascal Challenge. Long series of challenging datasets.
- I think latest is 2012 edition.
- 2007 is popular in use due to availability of ground truth for testing. Not sure about recent ones.
- There were some measures change I guess starting from 2008
- ImageNet is the most challenging images datasets.
- A Seismic Shift in Object Detection
 Beyond sliding windows: Object localization by efficient subwindow search. CVPR 2008.
 Efficient Algorithms for Subwindow Search in Object Detection and Localization. CVPR 2009.
 Selective Search for Object Recognition. IJCV 2013.
 Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR 2014.
 OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. 2015
 Object Detection with Discriminatively Trained Part Based Models, PAMI 2009
 Fast R-CNN, CVPR 2015
 Faster Rcnn, 2016