Advanced Applied Deep Learning

Lecture Course

Sheng Yun Wu

Week 11: Mask R-CNN – Instance Segmentation

Objective:

To introduce students to Mask R-CNN, a powerful model that extends Faster R-CNN to perform both object detection and instance segmentation. Students will learn how Mask R-CNN works, how it builds on Faster R-CNN, and how it can be used to detect and segment objects in an image at the pixel level. By the end of the week, students will understand the structure of Mask R-CNN and be able to implement it for object detection and instance segmentation tasks.

Lecture 1: Introduction to Instance Segmentation

11.1 What is Instance Segmentation?

Definition:
- Instance segmentation is the task of detecting objects in an image and simultaneously segmenting each object at the pixel level. It not only predicts bounding boxes around objects but also assigns each pixel of an object a specific class label.
Difference Between Object Detection, Semantic Segmentation, and Instance Segmentation:
- Object Detection: Locates objects in an image by predicting bounding boxes and class labels.
- Semantic Segmentation: Classifies each pixel in an image as belonging to a specific class but does not distinguish between different instances of the same class.
- Instance Segmentation: Combines both, detecting each object and segmenting it at the pixel level while distinguishing between multiple instances of the same class.
Applications of Instance Segmentation:
- Autonomous driving (detecting and segmenting pedestrians, vehicles).
- Medical imaging (detecting and segmenting tumors).
- Augmented reality (segmenting objects for interaction).

Lecture 2: Mask R-CNN Architecture and Workflow

11.2 Mask R-CNN Overview

What is Mask R-CNN?
1. Mask R-CNN extends Faster R-CNN by adding a branch for predicting segmentation masks on each detected object, in addition to the bounding box and class label.
Key Components of Mask R-CNN:
1. Feature Extraction with CNNs: Similar to Faster R-CNN, the input image is passed through a backbone CNN (e.g., ResNet or ResNeXt) to generate feature maps.
2. Region Proposal Network (RPN): A Region Proposal Network generates candidate object proposals (as in Faster R-CNN).
3. RoI Align: Instead of RoI Pooling (used in Faster R-CNN), Mask R-CNN uses RoI Align, which avoids the quantization issues of RoI Pooling and improves the accuracy of object detection.
4. Bounding Box and Class Prediction: As in Faster R-CNN, the model predicts bounding boxes and class labels for each region proposal.
5. Segmentation Mask Prediction: A new branch is added to predict a binary mask (1s and 0s) for each object, indicating which pixels belong to the object and which do not.

11.3 Mask R-CNN Workflow

Step-by-step Workflow:
- Feature Extraction: The image is passed through a backbone CNN to generate feature maps.
- Region Proposal: The RPN generates a set of region proposals for potential objects.
- RoI Align: The region proposals are projected onto the feature maps, and RoI Align is applied to extract features for each proposal.
- Bounding Box and Class Prediction: Each proposal is classified, and the bounding box is refined.
- Segmentation Mask Prediction: A binary mask is predicted for each object to segment it at the pixel level.
RoI Align vs. RoI Pooling:
- RoI Pooling (used in Faster R-CNN) approximates bounding box coordinates by quantizing them to the nearest pixel, which can lead to misalignments.
- RoI Align solves this by removing quantization and computing the feature map values using bilinear interpolation, improving the accuracy of both bounding box and mask predictions.

Lecture 3: Differences Between Mask R-CNN and Faster R-CNN

11.4 Mask R-CNN vs. Faster R-CNN

Instance Segmentation:
- The main difference between Mask R-CNN and Faster R-CNN is the addition of a mask prediction branch. Mask R-CNN can perform pixel-level segmentation, while Faster R-CNN only predicts bounding boxes and class labels.
RoI Align:
- Mask R-CNN introduces RoI Align to improve the precision of object detection and mask prediction. Faster R-CNN uses RoI Pooling, which may result in misalignments between the predicted bounding box and the true object location.
Speed and Complexity:
- Faster R-CNN is faster than Mask R-CNN due to the additional complexity of mask prediction in Mask R-CNN. However, Mask R-CNN provides a more detailed output by segmenting each object.

11.5 Use Cases of Mask R-CNN

Medical Imaging:
- Segmenting different types of cells or tumors at the pixel level for accurate diagnosis.
Autonomous Vehicles:
- Segmenting pedestrians, cyclists, and vehicles to improve object recognition and safety.
Robotics and Augmented Reality:
- Object manipulation and interaction, where the exact boundaries of objects are important.

Practical Session: Implementing Mask R-CNN for Instance Segmentation

Objective: Implement Mask R-CNN for instance segmentation using a pre-trained model and evaluate its performance on detecting and segmenting objects in an image.

Dataset: COCO or PASCAL VOC dataset (or a custom dataset with segmentation labels).

Key Steps:

Step 1: Load a Pre-trained Mask R-CNN Model
- Use a deep learning framework like PyTorch or TensorFlow to load a pre-trained Mask R-CNN model (e.g., from Detectron2 or Matterport’s Mask R-CNN implementation).
Step 2: Perform Instance Segmentation
- Apply the Mask R-CNN model to test images to detect and segment objects.
- Visualize the predicted bounding boxes, class labels, and segmentation masks.
Step 3: Fine-tune the Mask R-CNN Model
- Fine-tune the pre-trained Mask R-CNN model on a custom dataset with fewer classes or specific domain data (e.g., medical images, industrial datasets).
- Adjust the RoI Align parameters and segmentation mask thresholds to improve performance.
Step 4: Evaluate the Model
- Evaluate the performance of the Mask R-CNN model using metrics like mean Average Precision (mAP), IoU, and pixel accuracy.
- Compare the accuracy and speed of Mask R-CNN with Faster R-CNN in terms of both bounding box prediction and segmentation quality.

Assignment for Week 11:

Coding Assignment:

Implement Mask R-CNN using a pre-trained model and apply it to a custom dataset for instance segmentation.
Fine-tune the model and experiment with different RoI Align parameters and segmentation mask thresholds.
Measure the model’s performance in terms of both object detection (bounding box) and instance segmentation (mask).

Analysis:

Compare the performance of Mask R-CNN and Faster R-CNN in terms of accuracy and speed.
Analyze the effect of RoI Align on the accuracy of both bounding box and mask predictions.

Reading Assignment:

Read Chapter 12 of "Advanced Applied Deep Learning" by Umberto Michelucci.
- Focus on understanding the architecture of Mask R-CNN and how it extends Faster R-CNN to perform instance segmentation.

Summary of Key Concepts:

Instance Segmentation: The task of detecting and segmenting each object in an image at the pixel level.
Mask R-CNN: An extension of Faster R-CNN that adds a branch for predicting segmentation masks in addition to bounding boxes and class labels.
RoI Align: A technique used in Mask R-CNN to improve the accuracy of bounding box and mask predictions by eliminating quantization errors.
Comparison with Faster R-CNN: Mask R-CNN is slower than Faster R-CNN but provides more detailed outputs by performing instance segmentation.

This week introduces students to Mask R-CNN, a state-of-the-art model for instance segmentation. Students will gain practical experience implementing Mask R-CNN and understanding how it extends Faster R-CNN to provide pixel-level object segmentation. They will also explore the use of RoI Align to improve the accuracy of both bounding boxes and segmentation masks.

Page updated

Report abuse