CARRIE TAN SUI YI

Executive summary/Abstract

I spent the month of January in DSTA, Defence Science Technology Agency, for the WOW program. Throughout this one month, I completed models for image classification and object detection using deep learning. Undoubtedly, I have gained a much better understanding of deep learning, from what convolutional neural networks (CNN) are to coding them manually, even using my own data sets.

Background information of the projects / tasks which you were involved in:

We were given the task of creating models for image classification and object detection using Google Colab, Tensorflow (Python language). Image classification and object detection falls under deep learning, a type of machine learning whereby data is first fed into the model, training it, before testing the model with data the model has never seen before. If the model is still able to correctly identify the object, the model will be deemed as having ‘learnt’ what that object is.

Elaboration / record of the activities done

7 - 9 January 2020

Watching videos and reading up on resources on deep learning, image classification and object detection to gain an overall better understanding

10 - 15 January 2020

Coding for image classifier with and without transfer learning, using Quickdraw data set

16 - 17 January 2020

Consolidation meeting + Finished annotations for object detection data set

20 - 23 January 2020

Coding and training the Faster RCNN ResNet101 COCO object detection model, export and visualize the model

29 - 31 January 2020

Coding and training the SSD ResNet50 FPN COCO object detection model, export and visualize the model

To start off our learning, we were provided with a google docs titled “Resources”. Contained in this document were a long list of video links, google Colab introduction links, data sets, and models. Since my team and I were fairly new to deep learning as a whole, our mentor wanted us to look through all of those links and watch the videos, providing us with a ‘crash course’ of deep learning. Some of those videos were hour-long lectures from Massachusetts Institute of Technology and Stanford University, highlighting the immense complexity and depth of this topic. This summarized my first few days in DSTA as I familiarized myself with deep learning, image classification and object detection, reading through many lines of code and trying to understand them as well.

Armed with new knowledge and information (albeit still filled with doubts and questions), we progressed to the next stage of our internship - coding an image classifier using the Quickdraw data set. We would first code and train the model manually, followed by using transfer learning. Unfortunately, we only realized why this was posed as a challenge to us by our mentor when we tried to do the data processing part - we encountered numerous difficulties and were not able to just simply reference from other code. For example, the Quickdraw images were already numpy arrays so we did not have to convert it, unlike what our reference codes mentioned. Additionally, splitting the images using train-val split proved to be yet another big headache for us given our unfamiliarity with Tensorflow. Since the images were grayscale (1 channel), we also had to convert them into RGB (3 channels), something not accounted for in the reference code either. Ultimately, although we had a reference code to guide us, most of the code was not applicable for us. This left us very confused and lost as we struggled to code the image classifier by ourselves using Tensorflow.

After reshaping the data set, normalizing it, splitting it and much more, we were finally done with the data pre-processing portion. Following that, we had to create a data pipeline, the model itself (the convolutional neural network), train the model, and finally, evaluate the model. Countless difficulties and bugs were encountered throughout the project and it was a constant cycle of debugging, facing bugs, debugging and facing bugs again. Finally, with much assistance from our mentor, we managed to complete the image classifier without transfer learning.

Moving on, we focused on altering our code to include transfer learning this time, choosing to use the VGG16 pre-trained model. Transfer learning is a machine learning method whereby a pre-trained model (model is trained on a different data set previously) is used as the starting point for a different data set/task. This is meant to increase the accuracy and performance of the model as a whole, minimizing loss, since the model had already learnt to recognize the edges and corners of the images. For example, if I have taught a child to draw birds before, it would be much easier to teach her how to draw flowers next compared to a child who has never drawn anything before.

In theory, this should be easier since we had already built a model before and could reuse a large portion of our first code for this next project. In reality, it was not that easy at all. Given our inexperience, we were very unsure of how to import VGG16, how to adapt it to our code, how to freeze several of our layers first and so on. Even after completing the first model and believing that it had been successful, I found out that I had several bugs in my first code (while doing up my second one) that had bypassed my attention. Going back and forth between two different projects to try and correct all the errors turned out to be yet another problem I had to face.

Image classification

This is the model with the input layer, hidden layers (convolutional layers) and output layer.

After completing both models for image classification, our mentors gave us a quick lesson on deep learning and image classification, clearing up our remaining doubts and answering all of our questions. While it was still a little difficult to wrap my head around this, I felt a lot more confident having finished both models.

Having finished the two image classification models, I could move on to object detection next. For this task, we built our own data set using images from google, annotating and labeling each object in every image by hand using Labellmg. I chose 5 different classes to focus on - monitor, light, document, whiteboard and chair. For each class, I had 1000 or more objects, except for ‘whiteboard’ whereby I had 450 objects (most images only had 1 or 2 whiteboards so it was more difficult to get the quantity required). This brought to attention an interesting fact about deep learning and machine learning as a whole currently - human labor is still very much required to form the fundamentals of deep learning and provide valuable data needed by machines.

First, I used Faster RCNN ResNet101 COCO, a two-step object detection model. After setting up Tensorflow, the COCOAPI and several other directories, the data and annotations had to be split into training and validation data sets. The training and validation had to be converted to TFRecords first as Tensorflow could only read the data in such a format. Finally, the model was downloaded from the Github model zoo and the Tensorboard was launched as well before training could begin. Every 100 steps took around 28 seconds on average so the model was left to train overnight to reach the desired total number of steps, which took many hours. The graphs of the mean average precision, mAP, for each class could be viewed through the Tensorboard, along with the total loss. After ending the training at checkpoint 118726, the visualization of my results could be viewed using a separate piece of code.

Despite this seeming quite straightforward and simple, I still encountered a number of difficulties, especially in the data processing stage. One such example came from a recurrent error I could not resolve - the input was supposed to be 4 dimensional. I initially thought that error referred to a png image having 4 channels instead of 3 like a jpg. However, even after converting all my images to 3 channels, the error still remained. Thankfully, with the help of my mentor, we realized this error came down to a particular image that had 0 width and 0 height. While we were rather confused as to how this came to be, we simply deleted the picture and updated the data set, thereby resolving the issue.

Secondly, having finished training the first model, I moved on to the second model, the SSD ResNet50 FPN COCO. This is a single stage model, unlike the first model I used. This model is faster to train, but has a lower accuracy than the two stage model. For the two stage model, detection occurs in two stages: first, the model proposes a set of region of interests. Second, a classifier processes the region candidates. The one stage model skips the region proposal stage and runs detection directly over a dense sampling of possible locations.

Unfortunately, after training this model for over 10000 steps, I realized there was an issue with it since the evaluation folder was empty. This could be due to some other technical problems. Initially believing this model was successful, I had moved on to altering the parameters for the same model, changing the learning rate from 0.04 to 0.1. Increasing the learning rate is meant to increase the training speed of the model, but it decreases the accuracy as a trade-off. Since I could no longer use the checkpoints from my first attempt, I decided to delete them and focus on training the second model instead. Given that I was focused on changing the model, I could reuse my previous code and simply download the SSD model as well.

The results from this SSD model (0.1 learning rate) were significantly poorer than my Faster RCNN model, despite taking a longer time. SSD models are usually faster but less accurate, but in this case, the bottleneck appeared to be somewhere else regarding my CPU (not really possible to change). Instead of taking about 28 seconds for every 100 steps like my Faster RCNN, this SSD took around 176 seconds, a drastic increase unfortunately. Due to the time taken, I only trained it for around 17000 steps this time, especially seeing how the loss had plateaued. After exporting this model to visualize it, I went on to retrain the initial SSD model (0.04 learning rate).

Comparing the results between the first and second SSD model, there was no significant difference in the time taken to train the model or the total loss. The second SSD model with a lower learning rate did have a higher mAP though.

Elaboration / record of results / deliverables / impact of work done

Unfortunately, I accidentally deleted my image classification files from my google drive, thus losing my work. The results below are taken from the full reference code my mentor sent us after we completed this challenge.

For image classification, 20 different classes from the Quickdraw data set were selected, ranging from lollipop to fish. After processing this data and setting up the model, the training was started. Using transfer learning is supposed to give me a better result, although that was not so for my case. This could have due to some earlier error I did not spot.

This graph shows the model accuracy based on the training and validation set. For image classification, the higher the accuracy is, the better (except if there is over fitting, in which case the loss will also increase).

This graph shows the model loss on the training and validation set as well. Loss should be minimized instead of maximized to get a better model overall.

The results can be better visualized here. The percentages and words in brackets refer to the model’s guess. For incorrect guesses, the words will be displayed in red. The percentage refers to how sure the model is when guessing the object in the picture. These images are all doodles by the public, taken from the open source Quickdraw data set.

For object detection, I used Faster RCNN ResNET101 COCO first, feeding the model images I had taken from online. Since I had a total of 5 classes, I tried to find images that would include as many of these images as possible.

When training the Faster RCNN model, it can be seen that the model trains for 1700 steps before evaluating, with every 100 steps taking around 28 seconds.

After training the model for over 120000 steps, all the classes, except light, were performing pretty well. Even though I had more objects for light than whiteboard, whiteboard still performed much better than light. This could be due to the difficulty of the light class, given that lights come in many shapes, sizes and designs, and are scaled very differently in every picture. 50IOU refers to only accepting the model predictions that intersects the ground truth boxes by at least 50%.

As can be seen from the loss graph, the model got worse with more steps trained and I should have stopped training the model much earlier on. There is a change in the trend from 30000 steps onward as my data set was altered slightly and I had to re-run the model. Because of that, my model would end up performing much worse. There also seems to be over-fitting.

After exporting the model, the detections were visualized through the pictures, comparing the groundtruth and what the model guessed. From there, it is evident that the model excels especially in detecting chairs, but fails to detect most of the lights.

Due to an earlier error, the first SSD model that I completed had a learning rate of 0.1 instead of the initial 0.04. Unlike my Faster RCNN model, this SSD model only trained for 300 steps before evaluating and took significantly longer for every 100 steps - an average of 176 seconds. I trained the model for over 17000 steps.

There is an extremely drastic difference between the Faster RCNN model and this SSD model. When comparing the mAP for chair, the Faster RCNN model hit about 0.83 at its highest point. Conversely, the SSD model only hit 0.1, an extremely low accuracy. The straight line in the graphs is due to my computer dying (out of battery) overnight. As such, when I re-ran the model, there was a sudden spike.

I decided to stop training the model due to the total loss plateauing after 6000 steps (model is unlikely to still be learning due to the learning rate being too high) and a lack of time. Classification loss refers to the model classifying the objects (ie. whiteboard instead of document). Localization loss refers to where the model drew the box compared to the groundtruth. Regularization loss is to optimize the model. Total loss is a combination of all three.

When visualizing this SSD model, it is evident how much better the Faster RCNN model fared. In the first picture, the SSD model was not able to even pick anything up. This could be due to the learning rate being too high and far fewer training steps taken. Additionally, I used the SSD ResNet FPN COCO, which is better at scaling (i.e. same object at different sizes), whereas my data had much variation for objects.

For my second SSD (learning rate 0.04), I trained the model for over 16000 steps. Much like my first SSD model, the model would train for 300 steps before evaluating. Although I decreased my learning rate from 0.1 to 0.04, there was only a slight increase in time taken.

Comparing the results between the first and second SSD model, it is evident that this second model fared much better, especially for the ‘document’ class. This would be due to the decrease in learning rate, increasing the accuracy of the model will minimizing the loss.

Surprisingly, the loss graphs for this second SSD model was very similar to the first model, plateauing after 6000 steps as well. A possible reason could be that the learning rate could be further lowered since the machine was not learning anymore.

On the last day, my last checkpoint for this model disappeared and I was left with an earlier checkpoint of about 6000 steps only. As such, I chose not to visualize the images from that checkpoint as it would not have been an accurate representation of my model.

3 content knowledge / skills learnt

Firstly, I learned what a Convolutional Neural Network (CNN) is and what it is composed of. CNN is a deep learning algorithm that is comprised of the input layer, hidden layers and an output layer. It is much like a human’s brain with many neurons and mimics the connectivity found there. Input refers to the images I had in my data set, for the model to learn. The images vary widely from RGB (3 channels) to grayscale (1 channel), png to jpg. For all deep learning frameworks, all the images have to be converted to numpy arrays first.

Hidden layers refer to the convolutional layers within the network. For models such as the Faster RCNN ResNet101 COCO that I used, there are 101 convolutional layers in the model. In this image, ReLU (Rectified Linear Unit) is used as the activation function (turns values into something between 0 and 1).

This is usually followed by pooling whereby the image is ‘compressed’ with the high-level features extracted (forming the convolved feature), and processed in the second convolutional layer and so on. This decreases the computing power required to process all the images, speeding up the computation while reducing over-fitting. Max pooling takes the maximum value of the portion of the image covered while average pooling takes the average value.

Padding is usually used to ensure that the entire image is processed by adding additional pixels at the boundary of the data. The kernel (the yellow box in the picture below) moves from left to right before going on to the next row and so forth till the entire image has been traversed, with the stride dictating the ‘distance’ the kernel moves. The first layer is responsible for capturing the low level features such as weights.

After going through all the convolutional layers, the image is flattened before going into the fully connected layer to get an output (e.g. the model classifies the image).

Secondly, I learnt about object detection and how to measure the model’s accuracy. The score threshold determines how well the model fared. Mean average precision, mAP, is calculated over recall and precision. Generally speaking, a higher mAP means that the model is better. I can view the detection boxes’ performance by category which calculates how well the model fared by the overlap between the model’s guess and the groundtruth. For most models, it is difficult to achieve a high mAP for this since the overlap criteria increases from 50% to 95% gradually, so the mAP is pulled down due to the averaging from higher overlap requirements (eg. detection box considered correct under 50% IoU but not in 80%IoU so average from this is taken).However, there is also a mAP for 50IOU by category which only requires the model to be 50% correct before the detection box is considered correct. Non-maximum suppression is done during the post processing of data to choose the bounding box the model has the greatest confidence in and choose which other bounding boxes to ignore.

High precision means the model is very accurate and returns more relevant results than irrelevant ones, which is helpful when there is a need to reduce false positives. On the other hand, high recall means the model returned most of the relevant results, which is needed when the cost of false negatives is high. F1 is a combination of the two to attain few false positives and few false negatives. That being said, this is usually a trade-off situation as it is not really possible to maximize both precision and recall.

High accuracy does not always mean that the model is a good one. For example, the model has an accuracy of 99%. However, there are 99 cats and 1 dog in the data set. In such a case, the model would have predicted the dog to be a cat, thereby attaining 99% accuracy, but not truly being a good model, especially if the model needs to identify the 1 dog.

A way to increase accuracy would be by reducing the learning rate to minimize loss. If the learning rate is too high, the model will not be able to reach the minimum loss value and the machine will not be able to learn. Conversely, if the learning rate is too low, the model will take an extremely long time to learn as the weights are increased bit by bit. The optimum learning rate for a model is usually found through trial and error, as there is no one ideal learning rate.

There is also learning rate warm up whereby the model starts with a fraction of the initial learning rate and increases it gradually (i.e. learning rate warm-up = 0.1 and initial learning rate = 1. Model starts with 0.01 and +0.1 every few hundred steps). In addition, there is learning rate decay which is the opposite of learning rate warm up - the model starts with initial learning rate and slowly decreases it over time. Both can improve the model although trial and error is definitely still required.

Thirdly, I learnt the importance of data and how to handle it. Data forms the basis for deep learning. Not only does the data need to be of good quality, a substantial quantity (thousands or even millions of images) is required as well. Without quality and quantity, there is no way to train and achieve a good model suitable for use. In the case of object detection, it can be used in autonomous vehicles to detect objects in the surrounding environment. If the necessary data is not available, the vehicle may not be able to identify a person as an object that should be avoided, resulting in a severe accident.

Keeping this in mind, I had to be meticulous when labeling and annotating images I took from Google to form my data set. The object had to be completely boxed up and all the relevant objects in the image had to be identified. If I overlooked an object and the model managed to detect it, the model would be taught that it was wrong since it did not correlate to the groundtruth (even though it was my fault for overlooking the object). Given that a large amount of images and objects were required for the model to learn, I had to devote much of my time to annotating and labeling, a task that is human labor intensive. I was surprised to learn that humans still play an integral part in deep learning, especially amidst much debate on the topic of humans vs machines currently. If anything, it seems unlikely that humans will become obsolete or useless anytime soon when we are required to fulfill the most fundamental role of deep learning - generating data required.

The importance of data was further highlighted when I had to deal with data processing for image classification and object detection. This was definitely a pain as I was bombarded with a myriad of errors when trying to code and run this section. From trying to do a train-val split to converting all the images to RGB with 3 channels instead of 4, every minor detail had to be accounted for and resolved. Even a small issue such as having a png image instead of a jpg image would trigger an error since there were 4 channels instead of 3. Other problems came down to Tensorflow itself such as the need to convert my data to TFRecords first as Tensorflow could only read the data in such a format. Given my inexperience with Tensorflow and deep learning as a whole, I struggled quite a lot with this and had to do much trial and error, learning from my mistakes along the way.

Before the model can be trained and evaluated, the fundamentals, the data, must be settled and readied accordingly. Only then can the model proceed successfully and generate results. From there, further fine-tuning can be done such as adjusting the batch size and learning rate. However, data still remains the most important feature of deep learning.

2 interesting aspects of your learning

Firstly, I was very surprised to learn that DSTA has a very flexible culture in place, providing us with much autonomy and freedom. A normal workday is from 8.30am till 6pm. However, our mentors were comfortable with us arriving at 9am instead, with most staff arriving anytime between 8.30 and 9am. Additionally, there is no fixed lunch time for any of the staff here so we usually left for lunch at 11.30am and were back in the office past 1pm to continue our work. The key criteria for all staff was that the necessary deadlines for their work had to be met. Other than that, DSTA had a very friendly and comfortable environment that felt extremely welcoming, especially since I was new to the workplace environment.

This flexibility afforded us the opportunity to be more autonomous and manage our time as we liked, based on our workload. There was no rigid structure like I had expected and our mentors left us alone for most of the time. Whenever we needed help, we were free to approach them and ask, but other than that, we were left to our own devices. Unlike school whereby we have a strict timetable to follow, we could manage our own time freely and allocate our work accordingly as well. This was a new responsibility as well since it was up to us to complete our work in time without anyone else urging us on. Personally, I liked this arrangement as I prefer having control over my own time and being able to complete work ahead of time if possible.

Secondly, since DSTA is split into many different departments covering specific aspects of Science and Technology, the staff stationed in one office would have rather similar interests and job scope. For example, I was at PC12, Enterprise Information Technology, so everyone in that department was skilled at coding and focused on the technological side of DSTA. As such, I felt that it would be easier to get along with colleagues if I were working there since we would have that in common. If I needed help, it would be far easier to ask those around me for advice. During my one month in DSTA, I also observed that the working environment was very positive with the staff there chatting freely every now and then and cracking jokes at each other. Such a moment of joy breaks the monotony of work and allows them to take a break when needed. While work may not seem like something to look forward to as a student, this experience made me realize that it is not quite ‘all work and no play’ as I had previously believed it to be.

1 takeaway for life

My biggest takeaway was improving my self discipline and fostering a better work ethic through this one month. I am very prone to procrastination and pushing everything I need to do to the last minute, a bad habit that has persisted for many years. That being said, spending one month in DSTA whereby my only task was to finish the models for image classification and object detection meant that I was not able to procrastinate - I had nothing else to do but work so I just had to force myself to complete it in time.

Additionally, since there was no ‘authoritative figure’ there to supervise me 24/7, I was left to my own devices most of the time. This meant that I had to become a lot more self disciplined in order to finish all my work in time. Such a feeling was coupled with the fact that my teammate, Ryan, was seated beside me and had a very strong work ethic. There was some positive peer pressure to not slack and to catch up to his progress as well, forcing me to develop and strengthen my work ethic.

These values are invaluable, be it in school or when I go to work in the future. Without self discipline or a strong work ethic, much time will be mindlessly thrown away in favor of frivolous frills that can be a waste of time. Only through this one month in DSTA was I able to better appreciate and learn the importance of these two values.

Page updated

Google Sites

Report abuse