The aim of this project is to detect various courier packages that arrive at your doorstep by the images or videos captured by the Raspberry Pi camera. This falls under the category of Computer vision tasks and is a subset of object detection. The generic process of object detection is brought down to detect the various courier service providers like FedEx, UPS, USPS, Amazon by identifying the logos on the packages along with the bounding boxes. I will be using the YOLOv3 framework for object detection.
Human sight and vision are the most important senses among all the others, we rely on them to see the outside world and comprehend. It is highly capable and has a lot of components involved, thus making it a complicated system [1]. Mimicking this complex skill of humans by the computer and understanding it is the field of computer vision. Object detection is a task in computer vision and image processing that deals with detecting objects in images or videos. With the advent of technology in all sectors, it has found a prominent place in a variety of applications including video surveillance, self-driving cars, object tracking, etc.[2] Object detection allows us to classify the types of things found in the image while also locating instances of them in the image.
I will be making use of this field of Artificial Intelligence in order to implement my capstone project.
Detection of objects involves all the steps of a machine learning pipeline. So will discuss them in detail in the below sections
Data Source and collection:
In order to implement the project, we have to train the model to detect these specific logos in the given image or video. As the data is scenario-specific, I will be creating custom images set for each of these companies that have the target logos in them.
Data collection: Scrapping of images with these logos from Google images using selenium.
Data augmentation: Augmenting the real data to produce new fake images that can be used for training. (Neural networks perform better when trained on huge data)
Data labeling: Annotations are vital in object detection, so I'm planning of using an open-source tool(labelImg) to label the images.
Data configuration: Once the training data is ready it has to be converted in the format how the YOLO framework accepts.
Train and test config files have to be modified according to the number of classes we have.
File structure alignment as per the YOLO framework
YOLO Architecture(Darknet): Use the YOLOv3 network architecture(from the original paper written using Darknet neural network)
Train: Training the model with the pre-trained weights on our dataset and save the trained weights for every 100 epochs.
Test: Predict bounding boxes on the test images and video files.
Installed chrome driver for google chrome web browser
installed selenium package in python
executed python script to download google images for classes - amazon, fedex, ups, usps
Looking at the images (EDA) - Colab notebook
Along with identifying the logos on the image, we will add another class with courier boxes to detect in the image as well.
Note: As Google has updated their policies on web scrapping, I was not able to download more than 80 images from the Google Images link.
Neural networks, in general, are data-hungry that require large data to learn and perform better.
Data Augmentation is the process of applying different transformation techniques like horizontal flip, cropping, shearing on the available data to synthesize new data.
For this, I have applied 3 kinds of transformations on the available data using the skimage package - Colab Notebook
a. horizontal flip (left to right)
b. vertical flip ( up-down)
c. rotation by 45 degrees
After augmentation, we have 1280 images for five classes that can be split to train and validation.
Original Image
Horizontal flip
Vertical flip
rotate by 45 degrees
used labelImg[3] open-source annotation tool
LabelImg is a graphical image annotation tool.
It is written in Python and uses Qt for its graphical interface.
Annotations are saved as XML files in PASCAL VOC format, the format used by ImageNet. Besides, it also supports the YOLO format
The annotation file should be in .txt format.
(class) (X_CENTER_NORM) (Y_CENTER_NORM) (WIDTH_NORM) (HEIGHT_NORM)
class labels
labeled image with bounding boxes
label file
Before we delve into YOLO, let us look at some of the previously used models for object detection.
Region-based detectors and Deformable parts model were all two-stage detectors where
The first step was to find the region of interest in the input image.
The second step was to pass those regions to a classifier for detection.
DPMv5(Deformable parts model) was like a gold standard in object detection taking 14 seconds to process a single image with an mAP of 33.7. This is extremely far from being real-time.
Then came R-CNN [4](Region-based Convolutional Neural Networks)
*source: https://arxiv.org/pdf/1311.2524.pdf
In Fast R-CNN [5] instead of extracting CNN feature vectors for each region proposal, this model passes the entire image through one conv net and the region proposals share this matrix. Then the same feature matrix is branched out to be used for learning the object classifier and the bounding-box regressor. In conclusion, computation sharing speeds up R-CNN.
*source:https://arxiv.org/pdf/1504.08083.pdf
Faster R-CNN [6] in which they construct a single, unified model composed of RPN (region proposal network) and fast R-CNN with shared convolutional feature layers.
*source:https://arxiv.org/abs/1506.01497
Summary of all the R-CNN family models
*source:Weng, L. (2017). Object Detection for Dummies Part 3: R-CNN Family. lilianweng.github.io/lil-log.
Here is the comparison chart for the different models with their mAP and speed run on COCO dataset.
This model skips the earlier discussed regional proposal stage and directly runs detection on the image using the deep convnets, hence it is faster and simpler. It is a state-of-the-art model that had a major breakthrough in computer vision mainly because of its speed and accuracy.
Unlike other models like Faster-RCNN, this model uses a single complete neural network to detect the objects in the image and as well locate them on the image. In Yolo, each input image is divided into a S x S grid. For each cell in the grid, some bounding box predictions are generated simultaneously with class probabilities/scores for predicting objects associated with that grid cell. Each score reflects how confident the model is that the box contains a certain class of objects. Another biggest advantage is that it can process 45 frames per second.
Darknet is the name of the underlying architecture of YOLO
In total, one image contains SxSxB bounding boxes, each box corresponding to 4 location coordinates, 1 confidence score, C conditional probabilities for object classification.
The total prediction values for one image is S×S×(5B+C), which is the tensor shape of the final conv layer of the model.
So for a 7x7 cell with 20 classes with 2 bounding boxes for each cell, the output will be 7x7x30 tensor
Here is the workflow of YOLO from the original paper[9]:
*source:https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Redmon_You_Only_Look_CVPR_2016_paper.pdf
Network Architecture
Classes= 5, Bounding boxes = 5 i.e length = 30
YOLO has 24 convolutional layers followed by 2 fully connected layers (FC). Some convolution layers use 1 × 1 reduction layers alternatively to reduce the depth of the features maps. For the last convolution layer, it outputs a tensor with shape (7, 7, 1024). The tensor is then flattened. Using 2 fully connected layers as a form of linear regression, it outputs 7×7×30 parameters and then reshapes to (7, 7, 30), i.e. 2 boundary box predictions per location.
*source:https://towardsdatascience.com/dive-really-deep-into-yolo-v3-a-beginners-guide-9e3d2666280e
*source:https://towardsdatascience.com/dive-really-deep-into-yolo-v3-a-beginners-guide-9e3d2666280e
It consists of four parts [12]:
centroid (xy) loss
width and height (wh) loss
objectness loss (object or no object)
classification loss for conditional class probabilities
popular metric in measuring the accuracy of object detectors
Precision: How precise the model is out of the predicted positives and how many of them are actual positives
True positive/[Total predicted positives]
Recall: Calculates how many actual positives did the mode capture
True positive/[Total actual positives]
2. computes the average precision - area under the precision-recall curve
3. mean of the average precision is calculated from the above-average precision
4. Intersection over Union(IoU) - measures overlap between the two boundaries
there are predefined thresholds for specific datasets
A detection is a true positive if it has “intersection over union” (IoU) with a ground-truth box greater than some threshold (usually 0.5; if so, the metric is “mAP@0.5”)
A number of enhancements were brought in the version 2 model also known as YOLO9000
BatchNorm function applied on all the convolutional layers, improvement in convergence.
Made use of high-resolution images for fine-tuning the base model which improves the detection performance
brought in the convolutional anchor box detection
The cfg file describes the layout of the network, block by block.
There are mainly five types of layers used in YOLOv3:
Convolutional layer block
It has 75 convolutional layers with 64 filters size of 3, the stride is 2 and the activation function used is leaky Relu.
2. shortcut layer
This is the skip connection, in which the features from 3rd layer backward from the shortcut, and the features from the previous layer are added to give the output.
3. route layer
The layers parameter in this can hold one or two parameters.
These layers' values signify the feature maps from which it has to be given as an output. The case of two values, it is concatenated and then given out
4. upsample
This block upscales the previous layer by a stride of 2
5. yolo layer
This a simpler version of the YOLOv3 architecture.
reduced number of convolutional layers - only 7 convolutional layers
uses pooling layer
*source:https://www.researchgate.net/figure/You-Only-Look-One-v3-tiny-YOLOv3-tiny-network-structure_tbl1_335043703
improved detection performance and superior speed
Backbone: CSPDarknet53,
Neck: Spatial Pyramid Pooling additional module, PANet path-aggregation,
Head: YOLOv3
It is a proficient and authoritative object detection model that allows individuals with a 1080 Ti or 2080 Ti GPU to training a very fast and accurate object detector.
The consequences of state-of-the-art “Bag-of-Freebies” and “Bag-of-Specials” object detection procedures all the while detector training was confirmed.
The converted state-of-the-art methods covering CBN (Cross-iteration batch normalization), PAN (Path aggregation network), that are greater skilled and applicable for single GPU training.
This is a compressed version of the YOLOv4 model.
It has only two yolo heads layer
29 convolutional layers
reduced number of anchor boxes predictions
In this step, we are configuring the data file structure as in the image.
Configuration files have to be changed as per the number of classes (5)
Number of filters to be changed to is given by
(Classes+5)*3 i.e. 30
yolo.names - classes files
yolo.data - configuration files details
train.txt - train images path directory
I trained the datasets on YOLOv3, YOLOv4, YOLOv3 tiny, and YOLOv4 tiny models.
Each of the models was trained on 4000 epochs.
Yolo_V3 - Colab Notebook
trained model weights file - drive.google.com/file/d/1FBFf_6thDNevuc9qzQS_-LU6mWWgjuln/view?usp=sharing
Yolo_V3 tiny - Colab Notebook
trained model weights file - drive.google.com/file/d/1MfVpxBrd0GIAaF0Shf-zufVi-g-_8zEp/view?usp=sharing
Yolo_V4 - Colab Notebook
trained model weights file - drive.google.com/file/d/16bQwz1DsTbDlfdJMsWnGwGdnkyyWmdB1/view?usp=sharing
Yolo_v4 tiny - Colab Notebook
trained model weights file - drive.google.com/file/d/1ArUkn6WvakV_7mdRzVjCw6FAlv9I1-Jw/view?usp=sharing
Training loss
mAP@0.5 is 99.90%
mAP@0.5 is 98.07%
The model didn't give any prediction
Comparing the time taken to predict on the same test image
From the below results we can YOLOv3 tiny is very fast compared to YOLOv3
YOLOv3_tiny
YOLOv3
training loss
mAP@0.5 is 80.99%
training loss
mAP@0.5 is 82.05%
Comparing the time taken to predict on the same test image
From the below results we can YOLOv4 tiny is very fast compared to YOLOv4
YOLOv4
YOLOv4_tiny
The video was captured using a PiCamera V2 which has a framerate of 40 fps
As we saw, there are tiny alternatives for both versions. The tiny versions are designed to train the model on machines that have less computing power (mobile and embedded devices) by simplifying the network and reducing the parameters.
Although the accuracy decreases a bit in these tiny versions the speed of detection is very high as compared to any other models to date.
Therefore, this leads to the concept of edge computing. Edge computing is a method of optimizing cloud computing systems by performing data processing and inference at the edge of the network, near the source of the data.
Training the models in the cloud (where there are enough GPUs) and then deploying just the trained weights onto edge devices to get the inference would help in reducing the cost, time, and much more for real-world applications.
Anchor boxes: In predicting multiple objects, the network is making thousands of predictions(bounding boxes) surrounding the objects. These are called anchor boxes, the anchor box that has the highest IoU is greater than 50% then that anchor box should predict the object.
Object detection is the core component for most computer vision systems, the current advancements have been used in many real-time applications.
As we know training the model is computer-intensive, but attaining the inference on low compute devices not requires more resources. Hence the future is towards predicting inference on edge devices that have limited or no access to cloud or high compute resources. Deploying the trained model's weights on edge devices and detecting the images captured on the go by running a forward pass along with the pre-trained weights will surely lead to many more applications.
Becoming Blind Is a Top 4 Fear – Smart Vision Labs. “Why Vision Is the Most Important Sense Organ.” Smart Vision Labs, 29 Jan. 2017, www.smartvisionlabs.com/blog/why-vision-is-the-most-important-sense-organ/.
Brownlee, Jason. “A Gentle Introduction to Computer Vision.” Machine Learning Mastery, 5 July 2019, machinelearningmastery.com/what-is-computer-vision/.
Tzutalin. LabelImg. Git code (2015). https://github.com/tzutalin/labelImg
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. “Rich feature hierarchies for accurate object detection and semantic segmentation.” In Proc. IEEE Conf. on computer vision and pattern recognition (CVPR), pp. 580-587. 2014.
Ross Girshick. “Fast R-CNN.” In Proc. IEEE Intl. Conf. on computer vision, pp. 1440-1448. 2015.
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. (2016). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.
Weng, L. (2017). Object Detection for Dummies Part 3: R-CNN Family. lilianweng.github.io/lil-log.
Redmon, Joseph, and Ali Farhadi. “YOLO9000: Better, Faster, Stronger.” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, doi:10.1109/cvpr.2017.690.
Redmon, Joseph, et al. “You Only Look Once: Unified, Real-Time Object Detection.” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, doi:10.1109/cvpr.2016.91.
Redmon, J. (n.d.). Darknet: Open Source Neural Networks in C. https://pjreddie.com/darknet/.
Christiansen, A. (2021, January 4). Anchor Boxes - The key to quality object detection. Medium. https://towardsdatascience.com/anchor-boxes-the-key-to-quality-object-detection-ddf9d612d4f9.
Li, E. Y. (2021, April 21). Dive Really Deep into YOLO v3: A Beginner's Guide. Medium. https://towardsdatascience.com/dive-really-deep-into-yolo-v3-a-beginners-guide-9e3d2666280e.
He, & Huang, Chang-Wei & Wei, Liqing & Li, Lingling & Anfu, Guo. (2019). TF-YOLO: An Improved Incremental Network for Real-Time Object Detection. Applied Sciences. 9. 3225. 10.3390/app9163225.
]Alexey Bochkovskiy, Chien-Yao Wang, and HongYuan Mark Liao. YOLOv4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020.