This process involved several simple steps that began with converting N images to grayscale if they were not already in that format to reduce processing time and memory overhead. When the images are then all in grayscale format, we perform dimension reduction using downsampling, which compresses the image reasonable size (on the order of 20x20 pixels) and flattens the result to a 1-dimensional feature vector.
This is the simplest feature extraction technique we tried meant primarily for bench-marking other approaches, however it performed surprisingly well.
Fourier Transformation allows images to be analyzed in the frequency domain. Since images consist of discrete digital pixels, a Discrete Fourier Transform (DFT) should be applied to images to analyze their respective frequency spectrum. We use the Fast Fourier Transform (FFT) algorithm to compute the DFT quickly.
This feature extraction technique computes the N point 2 dimensional DFT of a grayscale image. Here, N is the size of the output image. For our tests, we used N=(20, 20) so we have 20 frequency bins in both dimensions of the image. This yields feature vectors of size 400 after we flatten the output of the FFT algorithm. The image on the left shows the result of applying the FFT algorithm on an image.
Using N=(20,20) yielded mediocre results. In fact, this method performed worse than downsampling which leads us to believe that either superfluous information was captured that should have been filtered out or the small number of frequency bins does not adequately represent the image. The algorithm can likely be improved by introducing filtering in the frequency domain in addition to using more frequency bins.
Similar to the DFT being a projection of a signal onto the Fourier basis vectors, the Wavelet Transform is a projection of a signal onto a set of basis functions called wavelets. An advantage of the Discrete Wavelet Transform (DWT) over the DFT is that the DWT captures the frequency and location of each the input signal, allowing for better resolution in image processing. Additionally, the DWT is a faster operation than the FFT.
The DWT allows for a choice several wavelets to be used when performing the transform. We tried several wavelets including Haar, Daubechies, and variations of the biorthogonal wavelet, however we found Haar performs best. The DWT is a recursive algorithm. That is, it can be applied several times to an image in succession. The number of times the wavelet transform is applied is the level of the wavelet decomposition of the image. We chose this level to get us to a similar sized feature vector as the other feature extraction methods. For Haar, this was a level 3 decomposition.
The DWT produces 4 outputs. At each level, three detail coefficient matrices are given and an approximation coefficient matrix. The last result is what we are interested in. We take the level 3 approximation coefficient (shown on the left), flatten it as before, and treat it as our feature vector. Notice that each level of the wavelet decomposition creates a more coarse approximation of the image. We are relying on the wavelet transform to choose the most important parts of the image.
We expected this method to be on par with the FFT method above, however we were pleasantly surprised to see that the Haar wavelet allowed us to achieve an accuracy comparable with SIFT and SURF.
A more robust, albeit more costly method of featurizing an image involves a mixture of local feature detectors and unsupervised learning. Local feature detectors look for interesting locations within an image called keypoints. Simple features like edges or corners can be detected using well known algorithms like the Canny edge detector or Harris corner detector. The methods we will look into go a step further: in addition to finding interesting points, they also provide a feature vector for each point telling us why that point is important. This is called a descriptor.
This is not exactly what we are looking for. We need a way to represent each image as a point, but these methods give us multiple points for each image. Our solution was to use a bag-of-words model and we use K Means, an unsupervised learning algorithm, to develop that bag of words. It is easier to understand bag of words in the context of English sentences where it is commonly used. The English language has a dictionary, a set of all possible words. For the purpose of this model, the definition of each word is not relevant. The dictionary in our image classification case is derived from the local features of an image using K Means, we will talk about this shortly. When analyzing a sentence, or in our case an image, the bag of words model records the multiplicity of each word in the dictionary. That is, we record how many times each word in the dictionary appears in the sentence of interest. Back to our image case, we can use the keypoints detected by our local feature detectors as words. We simply need to count the multiplicity of each word (keypoint) in the sentence (image) and that becomes our feature vector. However, it is not obvious how to determine whether two keypoints are the “same” so we can count their multiplicity.
We use K Means to do just that. K Means is an unsupervised learning algorithm which clusters the given data into K clusters. We can train this algorithm on a dataset to learn a dictionary. K Means can remember how it clustered that dataset and cluster new data in the same way. The K clusters become the K words in our dictionary. Each keypoint is clustered into one of these K clusters so we are able to count the multiplicity of the clusters in our bag of words model. However, each keypoint is just a pixel location. How do you cluster these effectively? This is where the descriptors of each keypoint fit in. K Means does not try to cluster the keypoints directly, but rather clusters the keypoints based on their descriptor. The cluster into which the descriptor is assigned is the cluster into which the corresponding keypoint is assigned.
In total, we use local feature descriptors to find keypoints in an image along with their descriptors, which are vectors defining points in M-dimensional space where M is the length of the descriptor vector. A set of these descriptors is used to learn a way to cluster keypoints in general into K clusters. This way, each keypoint is assigned to a cluster. This defines a dictionary of length K. We then use a bag of words model to create a feature vector for the image by counting the multiplicity of each cluster in that image. It should be noted that even though this method is technically slower than the previous methods, the delay when classifying images in real time is negligible.
We explored three different ways of detecting local features: SIFT, SURF, and ORB. They are explained in more detail below.
The Scale Invariant Feature Transform, SIFT, is the first local feature extraction method we explore. It was introduced by D. Lowe in 2004. SIFT was introduced as a method that added scale invariance to feature detection. Previous methods looked at an image in fixed size windows to try to detect features such as corners. This becomes troublesome as the scale of objects changes and the fixed size window does not.
SIFT tries to solve this using a Laplacian of Gaussian, which acts as a blob detector at several different scales. SIFT approximates the Laplacian of Gaussian with a Difference of Gaussian as it is easier to compute. The difference of several Gaussian kernels of varying sizes is applied to an image. SIFT searches through the resulting differences to find keypoints at different scales. Once a keypoint is found, SIFT records information about its orientation, scale, location, and neighborhood as a 128-dimensional feature vector. This is the descriptor for that keypoint.
On the left is an example of the types of keypoints SIFT finds in an image. Clearly, it is not just a simple edge or corner detector.
A major issue with SIFT was its speed. SURF, Speeded Up Robust Features, was proposed two years later by different authors in 2006 as an improved version of SIFT. To remove the huge amount of processing necessary to calculate the Difference of Gaussian, SIFT uses a box filter approximation of the Laplacian of Gaussian. While this is less accurate, it can be done in parallel for different scales. SURF additionally uses wavelet responses at different scales to calculate the descriptor for each keypoint. This differs greatly from SIFT.
In theory, SURF is much faster than SIFT, however for small images (200 x 200 pixels) which we are using the difference is not very noticeable. Additionally, the keypoints SURF generates are not as robust as SIFT which is why we expected this method to not perform as well as SIFT. We were surprised to find that this was our best performing method. On the left are the SURF keypoints for the same image shown for SIFT. Clearly SURF calculated many more features than SIFT. We believe this may have helped make SURF more accurate feature extraction method despite the technically less robust features.
ORB is the last local feature extraction method we explored. Unlike SIFT and SURF which are patented algorithms, ORB was written by OpenCV contributors and is open source. It is meant as a really fast and efficient alternative to SIFT and SURF. In that aspect, it succeeds. ORB is often used in simultaneous localization and mapping algorithms (see ORB SLAM) for robotics where speed is especially important. To achieve these kinds of speeds, however, ORB sacrifices a lot of the robustness that SIFT and SURF provide. We noticed that SURF, which made the same trade offs to a smaller degree, outperformed SIFT. Thus we wanted to see if that trend continued.
ORB itself relies on the FAST feature detector. As its name implies, it is a fast method of detecting corners. FAST and ORB look at a circle of pixels around a pixel of interest. If a certain number of consecutive pixels in that circle are brighter than a calculated threshold, then the pixel of interest is considered a corner. This method is known as the segment test.
To find descriptors, ORB relies on a modified version of the BRIEF algorithm. BRIEF generates a binary descriptor which makes it particularly memory efficient when compared to other methods. ORB modifies BRIEF to improve some of the deficiencies. The resulting descriptors may not be as robust as SIFT or SURF, but they excel if speed is of any concern.
We had high hopes for ORB, but we were surprised that it turned out to be our worst performing algorithm. The image above gives a clue as to why this may have happened. The keypoints generated by ORB are all corners or edges, as expected, and there are not many of them. These keypoints do not give enough information for the classifier to accurately classify the image.