This Wiki Entry reviews the results of a month-long attachment at Defence Science & Technology Agency, detailing the process of progressing from basic Machine Learning theory to building, training and testing models. Recent breakthroughs in Machine Learning, be it AlphaGo, Deepfakes or sophisticated Object Detection models, largely use Deep Learning frameworks. The project aims to create an Object Detection model using Deep Learning to detect classes of our choice. The TensorFlow library and application programming interface was used in conjunction with pre-trained models found on Github TensorFlow Detection Model Zoo and most of the work was done on Google Colaboratory. Different models have their own strengths and limitations and to find the one best suited for your needs, a certain degree of trial and error with model architectures and hyperparameters is required.
The project aims to create an Object Detection Model with the Deep Learning framework for detecting objects of any class we choose by providing it sufficient training data. Resources used to create the model are Github repositories containing pre-trained models, TensorFlow libraries and application programming interfaces (APIs), LabelImg and Spyder.
The coding was done in Python on Google Colaboratory, utilising the free Graphics Processing Unit (GPU) to speed up training.
Supplementary resources include YouTube videos and links to other online resources for an introduction to concepts we had not learnt as well as explanations from mentors.
The videos helped to provide us with a rough understanding of Deep Learning before diving into the theory behind such models and how some models can be further optimized such as CNNs to Faster RCNNs to achieve different goals.
To progress to Object Detection, we first started off with Image Classification. This would be beneficial in our progress towards our end goal since Object Detection can be thought to be consisting of two steps, firstly locating an object and secondly recognising what the object is, in other words, an advanced form of Image Classification.
We built and trained a simple Image Classification model from an Artificial Intelligence (AI) camp last year by following a step-by-step guide. We then proceeded to use a dataset (Google’s QuickDraw) and built an Image Classification model with and without Transfer Learning using a brief template our mentor provided us and by referencing other notebooks.
One of the challenges faced was trying to split the entire dataset into portions for training, validating, and testing our model. Initially, we were not aware of a pre-existing train_test_split function from sklearn that would handle the splitting for us but with our mentor’s help we managed to successfully split our dataset.
Another challenge faced was during Data Pre-processing. The QuickDraw “images” were in the form of numpy arrays, in grayscale, but during Transfer Learning, the pre-trained VGG16 Model we used had an input of 3 colour channels. This is because VGG16 was trained on the ImageNet database, which has RGB images. This was fixed by converting the array to an image, resizing as necessary and converting to RGB before converting the image back to the numpy array.
Even though Image Classification had taken up the first two weeks, I think it is important that we tried our hand at this. Had we jumped straight to Object Detection, I think that I would have struggled more since the concepts of Data Pre-processing and Transfer Learning would have been new to me.
Finally, we moved on to Object Detection. Our mentor provided us with a template for a different dataset, the Pascal VOC, so we had to load in our data as well as split the data. This involved uploading data to the correct folder and specifying the correct directory as the code would handle the actual splitting.
After creating our class labels, we then tried to start model training with a pre-built, pre-trained Common Objects in Context (COCO) model. However, as we found out from our mentor, the model we downloaded from Github was not configured to our dataset. Hence, we had to enter the configuration file (pipeline.config) to change the number of classes, steps, path configurations to our dataset and model parameters if necessary.
Once the model finished training, taking one or two days to do so, we exported our model, saving the final version into a folder and proceeded to Model Inference.
In Model Inference, our mentor provided us a code to visualise what our model was detecting allowing us to compare the model’s detection to the ground truth (images we annotated). When this was done, we either picked another model to train, typically a different type such as a Single Shot Detector instead of a two stage Faster RCNN, or changed some hyperparameters eg. Learning Rate. This allowed us to compare between different models to get a better understanding of the factors affecting accuracy and speed of detection.
For the Image Classification Challenge, the model without Transfer Learning got a training accuracy of 0.7978, validation accuracy of 0.7962, training loss of 0.6768, validation loss of 0.6966.
For the Image Classification model with Transfer Learning, Transfer Learning resulted in higher training accuracy faster but validation loss was still quite high. Model got training accuracy of 0.9901, validation accuracy of 0.9293, training loss of 0.0339, validation loss of 0.4077.
For Object Detection, I used Faster RCNN ResNet101 and SSD InceptionV2, downloaded from Github.
For Faster RCNN ResNet101, these were the graphs of mAP against steps for each of the classes:
Towards the end at about 110k steps, the mAP for all 4 classes went down. This was a possible point to stop model training so as to prevent overfitting on the training set and to save time.
These are some model predictions of images that were ran through the Faster RCNN ResNet101 Model:
(Ground Truth Left, Detections Right)
The Faster RCNN ResNet101 model was able to pick up a pair of glasses that I missed out during annotations. While it did wrongly classify the waist strap as a pen, I think the misclassification can be excused since the strap does have some resemblance to a pen.
On that note, see this example of Chihuahua or muffin?
Below are samples of the model prediction when some new images without any annotations were fed into the model:
Model wrongly classified an orange as glasses.
For SSD InceptionV2, these were the graphs of mAP against steps for each of the classes.
The graphs fluctuate a lot showing that the model didn’t learn much.
These are some model predictions of images that were ran through the SSD InceptionV2 Model:
(Ground Truth Left, Detections Right)
The SSD InceptionV2 model, like me when I was annotating, missed out the pair of glasses on the cup that the Faster RCNN ResNet101 model was able to pick up.
Below are samples of the model prediction when some new images without any annotations were fed into the model:
Model missed out the rightmost bird and generally has lower confidence percentages than the Faster RCNN ResNet101 model.
Model missed out bird in the centre.
Model missed out 4 birds whereas the Faster RCNN ResNet101 got all 6 birds in this image.
Conclusion:
Faster RCNN ResNet101 is generally more accurate and more confident in detecting and labelling objects. However, the SSD InceptionV2 has the speed advantage, taking 4.47 seconds to run inference for 15 images whereas Faster RCNN ResNet101 took 65.2 seconds to run inference for 15 images. This is an important factor to take into consideration especially for Real-Time Object Detection.
After both the Faster RCNN ResNet101 and the SSD InceptionV2 had finished training and inference, I re-trained the Faster RCNN ResNet101, decreasing the Learning Rate by a factor of 10.
For the re-trained Faster RCNN ResNet101 model, these were the graphs of mAP against steps for each of the classes.
For the re-trained Faster RCNN ResNet101 model, here are some model predictions of images that were ran through the re-trained model:
(Ground Truth Left, Detections Right)
Below are samples of the model prediction when the same new images without any annotations were fed into the model:
Model no longer detects orange as glasses
A decreased learning rate results in longer training time but can result in model with more optimal weights. However, when comparing this re-trained model with the first Faster RCNN ResNet101 model, the final mAP for Pen and Pencil appears to be lower than that for the first version of the model, although the model predictions for the images are similar.
Nonetheless, the performance of this re-trained model is still better than the first SSD InceptionV2 model.
After this, I re-trained the SSD InceptionV2, decreasing the Learning Rate by a factor of 10, decreasing the IOU threshold from ~0.6 to ~0.4 and increasing the input image size from 300x300 to 600x1024. As the time taken for 100 steps was 250-300 seconds, while it trained, I monitored the graphs of mAP against steps, choosing to stop model training after 10,300+ steps since the graph was leveling off.
For the re-trained SSD InceptionV2 model, these were the graphs of mAP against steps for each of the classes:
For the re-trained SSD InceptionV2 model, here are some model predictions of images that were ran through the re-trained model:
(Ground Truth Left, Detections Right)
Model manages to detect the pair of glasses but the confidence is not as high as Faster RCNN ResNet101
Below are samples of the model prediction when the same new images without any annotations were fed into the model:
Model manages to detect rightmost bird
Model still misses out on bird in the centre; possibly an issue with the scale difference
Model manages 5 out of 6 birds detected, not bad considering detection speed and that the last bird only has its tail sticking out
Conclusion:
This re-trained SSD InceptionV2 with decreased learning rate, decreased IOU threshold does seem better than the first SSD model. While not as accurate as Faster RCNN ResNet101, it is faster, taking 18.1 seconds to run inference for 15 images compared to 65.2 seconds for Faster RCNN ResNet101. However, this re-trained model does take longer to run inference compared to the first SSD InceptionV2 model, taking 4.47 seconds for the same 15 images.
In either Image Classification or Object Detection, data has to be pre-processed.
In the case of Image Classification, I learnt how to split data into training, validation and testing set using sklearn library and how to convert grayscale images in the form of numpy arrays to RGB form.
In the case of Object Detection, I learnt how to annotate images using LabelImg and how to feed the images and their corresponding annotations in the form of .xml files into the model.
2. Training and evaluating the model
For Image Classification, I learnt the basics of Transfer Learning, using pre-trained models that are already decent at picking up edges and curves in images and adapting them to our dataset. Since picking up edges and curves is quite similar across images, the model can focus on tuning the weights and biases of the last few layers leading up to the output. This idea of Transfer Learning is also implemented in the Object Detection model.
For Object Detection, I learnt how to analyse the graph of mAP against steps to see when the model is not improving anymore to stop the training. I also learnt how to do Model Inference to be able to put the trained model to a “real world” test by feeding in new images not in the dataset.
3. Debugging techniques
When building a Deep Learning model, carelessness easily leads to bugs in the program. As no human is perfect, the program does not always run as expected, without any errors. Upon seeing the error message, it can be very frustrating, especially if the bug persists despite multiple attempts to fix it.
However, the bug will not fix itself so there are certain strategies that can be used when debugging:
Having the model successfully start training was an indication that all the previous code leading up to the actual training such as handling and feeding in the data was error-free. Albeit slowly, seeing the loss and steps reported as the model trained was particularly rewarding, alongside seeing some of the model predictions since it was concrete proof of a working model.
2. How much libraries and APIs help in model building, training, testing and saving checkpoints
When I had first started the project, I thought I had to code most things from scratch. I knew of TensorFlow library beforehand but I did not know that the library would package everything together, allowing me to start model training with a single line of code. I assumed that there would be certain functions or Python programs for each part making up the Deep Learning model, such as the neural network itself, the loss calculator, the optimizer, the calculation of mAP and IOU etc., which we had to assemble ourselves. However, having the library handle this was a relief as we could focus more on analysing the results and tuning hyperparameters instead of implementation.
I have learnt about Time Management. Work at DSTA starts and ends at usual office hours, 0830-1800. Considering the travel time to and from DSTA, about an hour 15 minutes each, there leaves about 12 hours left at home. Assuming a 7 hour sleep duration, I would have 5 hours left. This has taught me to better manage my time both at home and during office hours as I cannot possibly keep working at home without breaks. For instance, I do parts of the Daily Log and bits of the Wiki report when waiting for the model to finish training, so that my workload at home is not as heavy, giving me more time to rest.
On the other hand, work at DSTA has been enjoyable, made possible by friends be it from TJC or other schools and kind mentors and the WOW! programme seems to have passed by faster than expected.
From left: Mentors Juncheng and Jing Lun, Myself, Carrie