Hand Tool Recognition and Classification

Brian Cesar-Tondreau
Alfred Mayalu
Joshua Moser

Fall 2015 ECE 5554/4984 Computer Vision: Class Project
Virginia Tech



    The first step for someone to learn how to work in a shop or lab is to become familiar with the tools that they will be working with. The question was raised if a device could be made for a person to wear that would allow them to look around the lab and automatically be informed on what tools they were looking at. As a proof of concept for this idea we wanted to see if we could write software that would be able to recognize and classify a given set of hand tools using feature descriptors such as SIFT and a machine learning machine like a SVM.  We found that the best results came from using SIFT in conjunction with a border angle feature space. However, even with a model that had an accuracy above 70 % some tools were still classified incorrectly. From the results it appears that this accuracy and therefore the performance of the algorithm will increase if the correct feature spaces are chosen to work in conjunction with one another. 


Teaser Figure


    The motivation for this project is to explore how to recognize tools on a peg board in the a lab using computer vision and machine learning. The scenario would be if a new member came into the lab and did not know what each tool was. They would be able to pick a tool name and the system would be able to highlight were each one of that tool type was located on the board, It could also display information on the correct way to hold it and the different tasks it can preform.  For this project a comparison was done to determine what the best method was to recognize the hand tools using different configurations of three different feature spaces, SIFT, tool border angles and the black and white properties of filled area/perimeter, eccentricity and solidity.



1. The first step was to build a database for training and testing so we could build a model to describe each tool type out of the different feature spaces we were going to compare.

    We found most of our tool training images online from google images. The rest of the tool images were taken in the Mechatronics lab. All of the non-tool images were taken from the negative dataset that came with Dr. Andrea Vedaldi's code and examples 7.

2. Extract local
features from both the training and test images independently.

   To extract the SIFT features from the images the the dense SIFT function, vl_pow, was used from vl_feat's library 6. This was implemented inside another function from created by Dr. Vedaldi. This function made sure that all of the images were a standard size by re-sizing them through matlab's imresize and then it would call the vl_pow function to find the SIFT features.

    To extract the tool shape features we wrote a program to implement matlab's canny edge detection to find the edge of the tool. The gradient direction of each edge pixel was taken and binned every five degrees creating a 72 bin histogram tallying how many of each set of angles occurred. This provided some information about the tool's shape.

    The binary (black and white) features, filled area/perimeter, eccentricity and solidity, of each tool in the image, were found using the region property function in matlab. these results were stored per image to be grouped and analyzed as another feature space.

3. We then created three independent bags of words, using K-means clustering,  for the SIFT, Angle and Black and White features respectively and then described each image in terms of those words. We used three independent bags of words for the different features because the features could exhibit different trends when it came to grouping. 

    Once the bags of words were generated, we then normalized the words and described each training and test image in terms of those words using euclidean distance to determine which word went with each descriptor in the image. A histogram for each image was then made of all the words that were present.

4. The next step was to put the each training image into a linear svm with its histogram of words and its label.

    We used vl_feat's svm (vl_svm) which is a linear binary svm. We trained the svm for each of the eight tool types we would have in our test data. The model for the svm was saved to be used to predict what type of tool each test image contained.

5. After the models were created in the above step they were used to determine if a test image contained the type of tool a user specified.

    The weight and bias values from the svm were used to determine how likely the histogram of words for a given test image represents the tool the model was generated to find.

6. After all of the testing image histograms were classified using our previously generated model the precision and recall  curve was found to determine how well each feature configuration and model preformed in general.

    To do the precision and recall the vl_feat function vl_pr was used. The results from this were displayed using another one Dr. Vedaldi's functions called "display ranked images" which took the training images and displayed them in descending order according to their respective similarity scores to the training image model and plotted the graph.  

7. To test the images taken from the webcam in the Mechatronics lab a function to segment out the tools into individual images had to be written.

    This function converted the RGB image to black and white. Then edge detection was performed on the image to find all of the tool edges. After finding the edges a disc dilation was used to fill in the gaps in the edges to make complete circuits around each tool. Then the matlab function "imfill" was used to fill in all of the edges so that they were solid blobs. After this all of the small "noise" blobs, incident around the individual tools, were removed from the image using the matlab function, "area open". Next the matlab function "region props" was used to to find the bounded area around each blob. A rectangle was then fixed around each blob. This rectangle was used to cut all of the corresponding pixels from the RGB image and place them on a white background. This successfully segmented each tool from the others so that it could be ran through the testing process described in step 6.


Experiments and Results

    To do a comprehensive comparison between the three different feature spaces and all their combinations our overall algorithm was ran multiple times with different features enabled. When the SIFT feature was enabled we used 100 centers for the kmeans clustering. We found that this seemed to give us the best results without under or over fitting the data. For the border angle feature space we used 41 centers in kmeans and when we used the binary feature space we used only 30 centers in its kmeans algorithm. As with the SIFT features these numbers were settled on because they seemed to provide the best results. Each time the svm was used we chose a c value of 1E-11. We found that in general this provided the best results for all of the configurations. 
    As mentioned before, the svm that we used was binary. That means each tool set we were looking for had to be trained separately. For the hammers' tool set we had a total data set of positive images of 89. We used 52 to train the svm and just 37 to test it. When it came to the needle nose pliers we had 103 positive images. We used 70 of them to train and the other 33 to train. For the negative images we used the same number as we were using for positive images in each case. We tried using both tool and non tool images as our negatives but we found that using non tool negatives gave us better results because this provided images without white backgrounds to add to the bag of words and svm.

    To evaluate our approach we used a precision and recall metric. We took the area under the curve to give us an accuracy measurement of how well the model was doing overall. Below are two examples of this, "SIFT & Border Angle Feature Results" and Binary & Border Angle Feature Results". The first is one of the better models where we used both the SIFT and border angle features. It returned some of the best results for the precision recall curve though it still had some trouble with hammers as can be seen when we tested the model on our actual webcam images. The second set of results below are using the binary and border angle features. These results were one of our worst however even our worst model did better then a naive approach that made random decisions. The rest of the feature configurations' results can be seen in Appendix A at the bottom of the webpage. 

    Below the accuracy of every configuration and the random approach can be seen in the table labeled  Comprehensive Accuracy Table. This shows how well all of the different feature space configuration preformed over all.

    Some trends appeared while we were running the different configurations. In general hammers usually preformed worse the needle nose pliers. Hammers especially had trouble when it came using the border angle features, the models would often recognize saws as hammers. We believe this is because saws, like hammers, had many straight borders. This would make hammers and saws look very similar in the border angle feature space.

    Another trend we noticed was that saw in general was that binary features were a hindrance in most cases. After looking through the data it appears this is because there were not very many clear trends separating the different tool types. Due to this fact when this feature is used the accuracy decreases. 

    The results are what we expected to see. In preliminary testing we saw that all of the binary feature values were very similar. Because of this we expected the binary properties to hinder the svm models' ability to tell the difference between tools. We expected the SIFT and the border angles to both help the overall accuracy since we were seeing some variation in these feature spaces between the different tool type. It also makes sense why the needle nose pliers work well since they are more distinct from most of the other tools. One thing that did surprise us was that the hammer recognition did not have a higher accuracy. We expected the head of the hammer to provide a somewhat unique shape for the algorithm to pick up on but it appears from the result that this was not the case. We think that even though the overall shape was different it had a lot of common texture to many of the other tools and therefore SIFT features would not distinguish it the best from all the other tools. 

    Even when we had a good model we are still missing some of the correct tools. We expected to see this since we do not have an overly large data set. If we had more positive examples for each tool type we expect our actual webcam test image to follow each models accuracy score more accurately.


Qualitative results

Example of Webcam Input and Output

Example of Test Image Inputs

SIFT & Border Angle Feature Results

Binary & Border Angle Feature Results

Comprehensive Accuracy Table

Conclusion and Future Work

    The results from the comprehensive comparison of these three feature spaces show that using SIFT alone will work to a certain extent but when adding in the feature space of the tool border gradient directions the accuracy increases. However, when adding in the binary feature space the accuracy decreases. Even with the increased accuracy we can not detect the right tools every time but the comparison did show that if we add enough of the correct feature spaces we might be able to get close to detecting every tool correctly.

Future work that we would like to do would be implementing this in a real-time system instead of the 10 seconds it takes now. We would do this by trying to streamline our algorithm to utilize the GPU processor and implement concurrent execution of our classification algorithm to hasten the response of our output.  In addition to moving our algorithm to a more portable programming language and library such as python or java and implement image processing libraries such as VL_feat and OpenCV.  We would also like to implement a multi class SVM instead of using comparing our test images to several several binary trained SVM models. A final thing we would like to add in the future would be increasing our data sets. With these improvements our overall accuracy should improve and as well as the user interface of our algorithm implementation. 


1: Bruno, Alessandro, Luca Greco, and Marco La Cascia. "Object Recognition and Modeling Using SIFT Features." Advanced Concepts for Intelligent Vision Systems Lecture Notes in Computer Science (2013): 250-61. Web.
2: Chen, Xiangrong, and A.l. Yuille. "A Time-Efficient Cascade for Real-Time Object Detection: With Applications for the Visually Impaired." 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Workshops (2005): n. pag. Web.
3: Coates, Adam, Paul Baumstarck, Quoc Le, and Andrew Y. Ng. "Scalable Learning for Object Detection with GPU Hardware." 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems (2009): n. pag. Web.
4: Hinterstoisser, Stefan, Stefan Holzer, Cedric Cagniart, Slobodan Ilic, Kurt Konolige, Nassir Navab, and Vincent Lepetit. "Multimodal Templates for Real-time Detection of Texture-less Objects in Heavily Cluttered Scenes." 2011 International Conference on Computer Vision (2011): n. pag. Web.
5: Siddharth Batra. "Multi-Class Object Recognition Using Shared SIFT Features." ResearchGate. N.p., n.d. Web. 20 Nov. 2015.
          6: VLFeat: An Open and Portable Library of Computer Vision Algorithms (2008) by A. Vedaldi, B. Fulkerson.
          7.Vedaldi, Andrea, and Andrew Zisserman. "Image Classification." Image Classification. N.p., 2011. Web. 21 Nov. 2015.


Appendix A

Comprehensive Feature Space Results:


© Brian Cesar-Tondreau, Alfred Mayalu and Joshua Moser