Manu S Pillai, Abhijeet Bhattacharya, Tanmay Baweja, Rohit Gupta, Mubarak Shah
ABSTRACT
Unlike traditional optoelectronic satellite imaging, Synthetic Aperture Radar (SAR) allows remote sensing applications to operate under all weather conditions. This makes it uniquely valuable for detecting ships/vessels involved in illegal, unreported, and unregulated (IUU) fishing. While recent work has shown significant improvement in this domain, detecting small objects using noisy point annotations remains an unexplored area. In order to meet the unique challenges of this problem, we propose a progressive training methodology that utilizes two different spatial sampling strategies. Firstly, we use stochastic sampling of background points to reduce the impact of class imbalance and missing labels, and secondly, during the refinement stage, we use hard negative sampling to improve the model. Experimental results on the challenging xView3 dataset show that our method outperforms conventional small object localization methods in a large, noisy dataset of SAR images.
Table 1: Comparison of our approach with state-of-the-art methods for small object localization and generic object localization on the validation split of the xView3 dataset.
[G] : Generic method not adapted for small objects
Our proposed training pipeline. The input SAR image bands are converted to false-color RGB, and then the FPN architecture is used to extract multi-resolution features combined using the ``Upsampling Stage" to create high-resolution 128-dimensional pixel-level feature vectors. The extracted features are then used to predict foreground regions using a Foreground-Background (FB) classifier head (class-agnostic detector). Finally, the predicted foreground regions and feature vectors are used to predict the semantic segmentation map using a Semantic Segmentation (SS) head.
DATASET
In this work we have use the just released xView3 dataset. The SAR images present in the dataset were obtained from the Copernicus Sentinel-1 mission of the European Space Agency (ESA), which comprises a constellation of two polar-orbiting satellites (S1A and S1B) with a repeat cycle of 6 days, operating in all weather conditions day and night. These SAR images were then annotated by the Global Fishing Watch (GFW) and created a database of ship detections and offshore infrastructure using the Automatic Identification System (AIS) and Vessel Monitoring System (VMS). The SAR images in this dataset were obtained from various areas with different demographic situations like areas near shore with intense vessel traffic, open ocean and island environments, and low and high latitudes. The area covered in the dataset varies from North Sea, Bay of Biscay, Iceland, and the Adriatic Sea, parts of Europe and also West Africa, which leads in the IUU activities.
The dataset contains a total of 991 SAR scenes which can be further broken down into three categories: scenes where the majority of vessels matched to AIS and could thus be identified (514), scenes where the majority of vessels did not match to vessels broadcasting AIS (70), and scenes with a large amount of offshore infrastructure such as wind turbines or oil platforms (407). A single SAR image/scene consists of two polarization bands i.e. VH (vertical-horizontal) and VV (vertical-vertical) polarization bands. In addition to the polarization data each SAR image/scene also consisted set of low resolution (with spatial resolution of 500 meters) ancillary rasters taken from the Sentinel-1 Level-2 Ocean (OCN) product and the General Bathymetric Chart of the Oceans (GEBCO). These rasters include the data of: bathymetry, wind speed and direction, wind quality, and land/ice masks.
The full dataset yields a total of 260,000 positive detections which were annotated using two methods. The first method used GFW's automated approach of detecting objects in SAR and matching these detections to vessels broadcasting AIS and VMS. This method produced a total of 554 SAR images in which there are a total of 60,000 detections. The second method was using professional labelers to annotate the given SAR scene. This method produced a total of 437 SAR images in which there were a total of 200,000 detections. The latter scenes were prioritised for the validation, the public leaderboard, and for the holdout set because of their likely high quality.
The dataset is partitioned into train, validation, public, and holdout sets. These subsets of the dataset can be classified as: Train set, which contains GFW labels only with a total of 500 images; Validation set which contains hand and GFW labels with a total of 50 SAR images; public set which contains hand-labelled and GFW labelled with a total of 100 SAR images; holdout set contains hand-labelled and GFW-labelled with a total of 200 SAR images.
Samples from xView3 dataset annotated with bounding boxes. Due to the lack of bounding box information in the dataset, we have considered every object to have a size of 10x10 pixels for creating the bounding boxes (note how all the boxes are of the same size). {red: "non-vessel", green: "fishing vessel", blue: "non-fishing vessel"}
CHALLENGES
Along with the challenges inherent to small object detection and SAR images, there are additional challenges in this dataset due to which traditional methods for object detection fail:
Severe Class Imbalance
For a given detection in the dataset, we only have the information about its center pixel location. Reframing the problem into a segmentation task results in huge class imbalance i.e for a given image of size 300x300 with 2 maritime objects in it, we only have 2 pixels for the object ground truth with 300x300 - 2 = 89998 background pixels.
Addressing the issue, we prepare the ground truth labels using a Gaussian kernel fixed at the object centroid. Instead of having a single pixel to correspond to an object, we spread the center pixel label across the center location to have more pixels belonging to the foreground class. Precisely, we generate a Gaussian kernel of the size equal to the size of the input image with mean set to the centroid location of an object. For multiple objects in an image, we sum up all the generated individual kernels to obtain a final kernel with multiple modes at the location of the centroids present in that image. We then normalize the kernel to the range (0-1) and consider all the pixels with value greater than threshold T to be a foreground object. In our experiments, we used T=0.5. This label generation procedure ensures that, the total number of pixels belonging to the foreground class is considerably more and majority of the true object pixels are considered a foreground class. Although, it should be noted that, merely using this procedure was not enough for addressing the severe class imbalance problem.
SAR Domain for Computer Vision
Not much has been done in the field of computer vision using SAR datasets and hence using pretrained vision models for feature extraction doesn't work directly in this situation. It may seem as the data is copious, a pretrained model is not required, but when it comes to experimentation and faster convergence, training deep learning models with pretrained feature extractors comes in handy in many different tasks.
Keeping in mind the practicality and usefulness of incorporating pretrained feature extractors, we convert the SAR domain scenes into RGB optical scenes using the following formulation:
4. Small Size of Objects
As previously mentioned, the small size of the objects makes the detection task harder for conventional methodologies and even harder in images with sea clutter and noise. This invokes the idea to use feature extractors that doesn't scale down images in terms of spatial resolution. In our proposed methodology, we have used a Feature Pyramid Network that pools features across different scale as the feature extractor.
5. High resolution of the SAR images
The input SAR images are approximately 20000 pixels in height and width. This issues a problem of high computational resource requirement for using the images in its true size with different algorithms. But fortunately, these high resolution images can be chipped into non-overlapping patches without losing much generality for processing. We in our proposed solution, chips the input SAR scenes into (800x800) non-overlapping patches while employing our methodology. Although, the competition organisers suggests using the entire SAR scene as a whole for better performance gain but during our experiments, we have observed no significant performance boost in accordance with the increased computational requirements.
6. Noisy Label Annotations
The dataset contains some noisy label annotations, i.e, there are labels for objects that are not discernible in the input image along with absence of labels where the corresponding SAR scene has an object visually noticeable. These two types of noisy labels may affect the training negatively.
In order to solve this challenging problem, we propose an end-to-end two-stage progressive training methodology. The two stages are primarily differentiated by the specific loss sampling techniques employed. During the first training phase, where a higher learning rate is used, Stochastic Sampling is employed in order to reduce the impact of mislabelled data. In the second refinement stage, where the learning rate is constrained to be lower, Hard Negative mining is used to improve the model further.
EVALUATION
1. Post Processing
2. Evaluation Metrics
Maritime Object Detection
This metric uses the standard definitions of precision, recall, and F1 score. Here F1 score is calculated using the true positive, false positives and the false negatives based on the detected ships in a batch of SAR images.
Close-to-Shore Object Detection
This metric also uses the standard definitions of precision, recall, and F1 score. Here F1 score is calculated using the true positive, false positives and the false negatives based on the detected ships that are in the distance of 2km from the shoreline in a batch of SAR images.
Vessel Classification
This metric also uses the standard definitions of precision, recall, and F1 score. Here F1 score is calculated using the true positive, false positives and the false negatives based on the class of the detected objects that weather they are a "vessel/ship" or a "non vessel/ship" which comprises of many things example human made structures, oil rigs, small islands, wind turbines etc in a batch of SAR images.
Fishing Classification
This metric also uses the standard definitions of precision, recall, and F1 score. Here F1 score is calculated using the true positive, false positives and the false negatives based on the class of the detected ships that weather they are a "fishing ship" or a "not fishing ship" in a batch of SAR images.
Vessel Length Estimation
This metric uses the percentage error between the predicted length of detection and its label in a batch of SAR images.
6. Overall Metric
The overall metric to score the whole pipeline between 0 to 1 was a amalgam of all the individual metrics which is given by:
RESULTS
Once the segmentation network is trained, the predicted segmentation map must be converted into object centroids for evaluation. All the evaluations are performed according to the metrics provided by the xView3 Challenge. We test both small object localization models and generic object localization models. We report scores for FarSeg, FactSeg and PSPNet while, for generic objects, we have tested DeepLabv3, DeepLabv3+ and DenseASPP. As expected, the generic object localization models perform poorly on the xView3 dataset, with the state-of-the-art segmentation model DenseASPP achieving a Detection F1 of 0.1893. The results are presented in Table 1.
1. Ablation Study
To establish the effectiveness of each component of our strategy, we train and evaluate our model architecture during all the stages. The scores are computed on the validation split of the dataset and are given in Table 2. For evaluating scores for Stage 1, we directly trained the softmax classifier on the dataset with the Random Sampling procedure while the calculation of Stage 2 is done with and without the Hard Negative Sampling procedure. It is evident from the results that after each stage, the scores improved for almost all the metrics, with Stage 2 and Hard Negative Sampling achieving the highest detection F1 of 0.6207.
Table 2: Performance of each of our progressive training stages. HN means Hard Negative sampling performed
2. Method Robustness
Input scene with multiple maritime objects.
Predicted Foreground-background segmentation map.
Predicted Semantic segmentation map.
Ground truth segmentation map {red: "non-vessel", green: "fishing vessel", blue: "non-fishing vessel"}
A challenging scene with high sea clutter and noise.
Predicted Foreground-background segmentation map.
Predicted Semantic segmentation map.
Ground truth segmentation map {red: "non-vessel", green: "fishing vessel", blue: "non-fishing vessel"}
Scenes containing landmass.
Predicted Foreground-background segmentation map.
Predicted Semantic segmentation map.
Ground truth segmentation map {red: "non-vessel", green: "fishing vessel", blue: "non-fishing vessel"}
3. Qualitative Comparison
Qualitative comparison of our approach with other methods for small object localization on the validation split of xView3 dataset.
RED: non-vessel
GREEN: fishing vessel
BLUE: non-fishing vessel
RED (O): Represent false detections or misclassifications by the model
(Ours)
Input SAR Image
PSPNet
FarSeg
FactSeg
CONCLUSION AND FUTURE SCOPE
In this work, we propose a model architecture, DeepSAR, for maritime object localization and present a novel two-stage progressive training methodology involving two-loss sampling (Stochastic and Hard Negative) techniques. The first stage focuses on training the model to detect foreground regions, and the second stage aims at refining the foreground predictions into respective classes of objects while pushing the model to reduce false positives. Our experimental results on the challenging xView3 dataset show our method to out-perform conventional small object and generic object localization methods in Synthetic Aperture Radar images.