This case study focuses on the application of incorporating an image processing machine learning (ML) model onto a mobile robotic platform to provide information about the surroundings encountered by the robot. The robotic platform being used for this case study is the RockBot robotics learning platform developed by the Mayhem Outreach Lab at New Mexico Tech. This study will focus on tying together all of the concepts learned throughout this program into a practical application. The goal of this study is to provide readers with insight and inspiration into applying their new and improved image processing and machine learning skills to tasks within their own lives.
Introduce you to the planning and preparation behind incorporating image processing models into practical applications.
Introduce you to the decisions driving aspects of the image processing model to be used.
Introduce you to the physical requirements of the robotic platform depending on the data being processed.
Introduce you to the RockBot robotics platform used as the driver of the study.
Provide insight on alternatives to the chosen model for the chosen application.
Before you can efficiently begin any new project, it is very important to create a plan or layout to aim and guide the process, or flow, of the project. In regard to the development of a machine learning (ML) model for integration with one of our mobile robotics platforms, a plan was developed and optimized before any of the programming work for the project began.
It is possible to navigate through a full project without a solid plan in some cases. However, doing so when working with machine learning can get very complicated, complex, and confusing rather quickly. When designing a project plan for a ML-related project, it is important that at least the following three questions are addressed:
How does the machine learning model contribute to the overall project goal?
Is the machine learning model the primary portion of your project?
Is your machine learning model considered to be a supporting portion of the project as a whole.
What will be the input or inputs for your machine learning model?
How does your input data come to your model? How is it formatted?
Is your input data in the form of videos, images, numerical datapoints, etc.?
Are there multiple streams of input data being processed in your model?
What will be the output or outputs for your machine learning model?
What is the output of your model?
Is it a prediction, a segmented image, a numerical output, etc.?
Are there multiple outputs required?
How does your model output need to be formatted?
Do you need an image with bounding boxes?
Do you need a numerical prediction?
To help you in your development of your own machine learning model project, the following sections provide a discussion of our answers to these three vital planning questions. It is important to note at this point that each question and answer is closely interconnected and decisions made in each step will impact the remaining questions.
To begin the discussion of the machine learning model for our mobile robotics application, we need to introduce you to the robotic platform we used and will continue using. The robotic platform that drove the study, and was therefore used in data collection, is the RockBot wheeled robotic platform, pictured on the right.
The RockBot system is considered to be a robotic platform, as it was developed to be usable for various applications. This means that RockBot is designed to be easily adaptable for various purposes, including the implementation of a ML model. One of the initial applications that RockBot was used for was to teach various levels of learners about the wiring and mechanical design of inspection robots. The inspection-focused RockBot had a real-time camera streaming to a screen for an operator to remotely navigate through a wooden maze, as can be seen in the image to the left.
Recently, a version of the RockBot is being developed that utilizes specialized computational hardware and software to autonomously navigate areas by using way-point navigation. This application sparked an idea and further interest in creating and utilizing computer vision to influence and assist in the autonomous navigation of this application of the RockBot system. This provided us with the purpose of our ML-focused project that we are discussing in this case study: creating an image-based system utilizing a machine learning model to implement with the autonomous RockBot system.
For a previous application utilizing the RockBot platform, a small real-time camera was used to stream a video feed to a separate screen which allowed the operator to remotely navigate various scenarios. For our autonomous RockBot application, real-time data processing is not completely necessary, but would be preferable once the model has been fully developed, trained, and implemented. In fact, the most important aspect of the image input that needs to be accounted for is the quality of the images used. Planning for one's training dataset is a vital task in creating a reasonable and accurate machine learning model for your project.
Before we could begin building our dataset we needed to have a plan for what our model would be attempting to identify. For our purposes, the objective of the autonomous RockBot is to be able to navigate around the New Mexico Tech campus autonomously while avoiding occasional obstacles. The waypoint navigation already in use on the autonomous RockBot helps with the navigation around campus, but cannot account for the obstacles that the robot will ultimately encounter. This means that we will need our ML model to detect and identify obstacles in its path. At this point, we have successfully determined the subject matter that we need to create our ML model around, and now we just need to obtain the dataset for training and testing the model. We have an entire section in which we discuss the process we followed to create our image dataset for the model we created. Planning the input and overall goal of your model takes a lot of forethought and can take a large amount of time in the beginning of a project. Putting in the time and effort to do this planning will save you hours, if not days, of work once you begin building your dataset, ML model, and overall project.
There are a variety of output characteristics that could be used in our autonomous RockBot model that would provide beneficial information to our application. For ease and convenience in providing a decent example in this study, the output characteristics were chosen to be common obstacles that the robot may experience while driving around the NMT campus: a bike and a water bottle. These two items were chosen because they are easy to distinguish from the background. Ultimately, our ultimate goal in this case study is to provide a complete example of application and integration of an image processing machine learning model to our project and provide guidance in doing so in your own projects.
At this point, we had chosen our objects for training on detection in our model. This choice does not mean we are done developing our plan, though. We need to determine the format or layout of our model output. There are multiple paths that could be taken in this decision, and we chose to output classified bounding boxes from our model. A bounding box is defined to be a rectangle (or square) around an object in an image, similar to the region of interest (ROI) in other applications. The image to the right demonstrates a bounding box (with a label of 0) around the bike.
With the previous work done during the project planning phase of the process, we know what our goals are while setting up and creating our dataset. We know that we are attempting to teach our model to recognize and categorize water bottles and bikes in a variety of environments. This provided a framework for the images that we needed to capture for our dataset. The RockBot needed a few slight modifications from its previous configuration to be able to operate repeatedly and efficiently in gathering images. The main modification was the addition of a mirrorless camera mounted to the top. An image taken of the robot during data collection is shown on the left.
Dataset images were taken using the mobile RockBot system, with image captures being triggered remotely. The objects of interest were placed in an assortment of places and positions around NMT campus and the camera was maneuvered to have the objects in frame before acquiring an image. The team went through three rounds of data collection while deciding on the final configuration of our dataset. A final dataset of 138 images was developed. Of these images, 72 images depicted one or more water bottles, 52 images depicted one or more bikes, and 14 images depicted a mixture of both objects.
In order to provide the ML model with a training dataset, we need to provide it with image labels to learn from. Depending on the chosen path between input data and output predictions, labels should be set in specific configurations. For our specific example, we will be using labeled bounding boxes to label each of the images for training. Each bounding box will provide a label (to describe the object as either a water bottle or bike) and a set of four numerical values which correspond to the (x,y) coordinates of both the top left and bottom right points of the box. The image to the right demonstrates a bounding box placed around the water bottle and provided with a label of 1 (which corresponds to the label, ‘water bottle’).
To speed up the labeling process, we created a semi-automated system to label our images. For the dataset we are using for this case study, we used an application called MATLAB (a MathWorks numerical computing platform). MATLAB is a paid software package that many engineers and data scientists use in evaluating large datasets. The remainder of this section will discuss the MATLAB script we used to label our dataset. However, we have developed and attached a Python script that will allow you to follow a similar process, without purchasing a new software package. It is important to note that the attached Python script cannot be run in Google Colab due to Colab’s graphical user interface restrictions.
The MATLAB code we developed for labeling our images was initiated by reading in images from a file folder. The images being read into the software were all of the raw images extracted from the camera after data collection occurred. For each image in the dataset, the code iterates through each of the given classes, which were water bottles and bikes. The bike and water bottle classes were given indicial values of zero and one, respectively. For each class index, the code prompted the user to select regions of interest around the objects of said class. These regions of interest are the bounding boxes of the training dataset and were saved in a CSV as a list of values, corresponding to the following variables: class label, minimum X value, minimum Y value, maximum X value, and maximum Y value.
A CSV file is created for each of the images in the dataset and each file contains as many lines, containing bounding box coordinates, as regions of interest selected for the image. The Python data labeling script uses the same process in regards to saving the boundary box data.
Two major learning paths were needed for our model to accurately work, bounding box regression and class label classification. Boundary box regression is the process the machine takes to infer/predict the locations of the boxes within each image. In other words, the boundary box regression path strives to correctly predict the boxes around objects within the image. A boundary box regression model can learn from labeled bounding boxes, and compare its initial box location guesses to the known ground truths set by the labeled boxes. Our model needed to effectively predict boundary box locations and label these boxes.
The class label classification learning path worked towards determining the contents of a given bounding box by examining pixel intensity distributions within the box. The classification model strives to correctly predict the class of the object located within the bounding box. The model learns by making an initial guess on the object within the bounding box and comparing the predicted label with the ground truth label provided in the model’s training input. For our application, we only have two classes to differentiate between, but it is possible to train a classification model on more than two class labels.
The choice to further train a pre-trained model was made to create an effective overall model for both the tasks of classification and regression. There are a variety of pre-trained models available for use in transfer learning exercises. The transfer learning model needed to have decent performance with object detection and classification tasks. While we searched for the optimal model to use in our application, we came across two reasonable contenders, the VGG16 model and the MobileNetV2 model. We decided we would build upon and train both of these models on our dataset and compare the results. During training, we set the same parameters for both models, as shown in the table on the left.
The VGG series of pre-trained models got their name from the group that developed the model, the Visual Geometry Group (VGG) at Oxford University. The VGG-16 has sixteen total layers, attributing to the 16 in its name. Of the 16 layers in the VGG-16 model, 13 of these layers are convolutional layers and three are fully-connected layers. The model was designed for image classification and object detection applications. Some of the known limitations of this model are that it takes a long time to train and 138 million parameters, which can occasionally lead to exploding gradients.
The MobileNet series of models (V1, V2, and V3) was designed by Google for the purposes of providing state of the art performance on mobile (resource-constrained) machine learning models on multiple tasks primarily. One of the major benefits of the MobileNet model is its significant reduction in the memory footprint required to run inferences on a set of data making it usable in a variety of embedded system applications. As the name suggests, the MobileNetV2 model is the second version of this model, with improvements from the previous version of the model.
After training of each model was completed there were a couple of immediate key differences between the model and their performance. The first key difference was the time it took for the models to complete the same amount of training on the same dataset. The VGG16 model took at a minimum about three times longer than the MobileNetV2 model. This extreme time difference is likely due to the memory requirements for both of the models. Another key difference was found in the prediction images that were obtained from each of the models (we will discuss this difference shortly).
To make numerical comparisons between the two models, various performance metrics were obtained for the classification performance of each model. An interesting outcome of the training was that both of the models performed similarly (almost identically, in fact) in their classification of the objects within each image predicted upon. If you remember the Evaluating ML Models activity, you will recognize the metrics we obtained from each of our models after training them. We recorded the accuracy, precision, recall, and F1 score for comparison between the models. The table on the right provides the values of each of these performance metrics after training.
With the numerical performance metrics that we obtained, we knew that our classification learning path was about as optimized as we could make it without increasing the size of our dataset. Though, these performance metrics did not provide much insight into the model to use for implementation. We did not have the right libraries at the time to obtain performance metrics for the bounding box regression of the model, so we analyzed this through examining the prediction boundary box locations and sizes.
As a visual analysis of the class label performance of both models, we plotted the confusion matrices of the trained models. A confusion matrix is a table comparing the correct and incorrect predictions made by a classification model. Confusion matrices can be plotted for almost any classification model, even if it has more than two classes (or labels). In most confusion matrices, you provide a square a darker hue as the value within the square rises. To have a “good” performance shown within the confusion matrix, you want to aim for the darker colors to be along the diagonal and nowhere else. We achieved this good performance for both of our trained models. However, similarly to the numerical performance metrics analyzing the performance of the classification, both of our confusion matrices for each model ended up being exactly the same. The matrix on the left is the VGG16 confusion matrix and the matrix on the right is the MobileNetV2 confusion matrix.
We chose to analyze the loss of the model over training to see if there were any patterns that were either concerning or promising. We found that the VGG16 model produced a concerning pattern with its loss over epochs. The VGG16 model had a steady decrease in training loss over the 10 epochs of training, but its validation loss was all over the place. On the other hand, the MobileNetV2 model showed a steady decrease in both training and validation loss over the 10 epochs of training. The following two plots are the loss plots that we used to come to these conclusions. The top plot is the VGG16 loss plot and the bottom plot is the MobileNetV2 loss plot.
The final measure of the models performances that we analyzed were the predictions made by each of the models. We used the same images as prediction inputs for both of the models and compared the output predictions. These predictions provided us with a rather obvious choice as to which of the models we should integrate into our application. We already know that our classification within both models is performing very similarly. By analyzing our prediction images, we can make inferences about the performance of the bounding box regression for each of the models. The following four images provide the prediction images used in our analysis. The first set of images was provided by the VGG16 model and the second set of images was provided by the MobileNetV2 model.
Both models provided the correct classifications, but one of them appears to have better predictions in bounding boxes locations, the MobileNetV2 model. The MobileNetV2 model has portions of each of the objects within the boxes. On the other hand, the VGG16 model was nowhere near either of the objects with its bounding box predictions. Remember also that the MobileNetV2 had the favorable pattern demonstrated in its loss plots from earlier. With all of these aspects in mind, the team decided that the MobileNetV2 model would be the model to incorporate into our project.
With our base model chosen, there are a few next steps that the team will be taking to best incorporate the project with this model. One of the main checklist items that is left is fine-tuning the model before implementing it onto the RockBot. To best fine-tune our model, we will also need to collect a larger dataset. The team will be continuing to collect data in the same fashion detailed in this case study, and we will be using this data to further train and optimize our model for our specific application. Fine-tuning will likely involve slight tweaks of various model hyper-parameters such as learning rate, number of epochs, and batch sizes. It will be a long trial and error process, but the model will be optimal for integration once it is complete!
In order to implement the model onto the RockBot, we will need additional computation power added to the RockBot. Since the model does not take as much memory as other models, we don’t need a huge computer on the RockBot, so a Raspberry Pi or similar hardware component may just do the trick. We will keep you updated as more progress is made in this regard!