The detection of the features of the Earth’s surface in an accurate and timely manner is essential for understanding how natural phenomena and humans interact in order to encourage good decision-making. With increasing development in remote sensing technology, change detection methods based on remotely sensed images have found their use in planning land resources, monitoring disasters, and expanding urban infrastructure, among several fields.
We aim to build a change detection model for aerial images corresponding to the same geographical location. We will detect the changes between the two images taken at different time periods across the years.
The code can be found at https://github.com/mridulk97/change_detection
Change detection is essentially the computer-vision equivalent of spotting the difference. If the model is given two different images, the model must identify all the differing points. In a remote sensing context, these images are the satellite or aerial images of the same geographical location at two different time instances.
Change detection is an active research area and the literature is rich with algorithms, ranging from image differencing, principal component analysis, to spectral mixture analysis, and artificial neural networks. Different algorithms proposed so far have their own benefits and no single algorithm is optimal and applicable to every scenario. In our case, we will be aiming to detect the changes between two images while dealing with image variations due to occlusion, changes in viewpoint, and illumination effects.
In order to effectively detect the changes between two images of the same scene by overcoming the variations due to changes in viewpoint and illumination, our approach will employ the U-Net based encoder-decoder model architecture described in a recent paper titled “The change you want to see”[1]. The original implementation of their approach can be found here. The model described in the paper follows a Siamese architecture that consists of a UNet model for generating feature descriptors. The descriptors are modulated using a co-attention model [2] that mainly captures the correspondences between the images. The descriptors generated by the encoder and the modulated descriptors from the co-attention model are concatenated in the decoder. The paper also describes a method with no-attention that skips the modulation by the co-attention layer and combines the descriptors generated by two encoders straightaway, passing it on to the decoder. We have used this method as a baseline for comparing the performance of co-attention in our approach.
The original implementation of this change detection method focuses on detecting changes in the form of bounding boxes. This is however not suitable for aerial images as the changes in this case are more nuanced and can span a wide region. In some cases, the bounding boxes would have to cover more than a third of the entire image in order to cover the region of change, as illustrated in Figure 2. This could potentially lead to the model producing poor results. Therefore, in our approach, we have instead focused on detecting the changes using segmentation masks.
For this purpose, we modify the original architecture of using a Bbox head (used for predicting bounding boxes) and rather introduce a segmentation head at the end of the UNet model to produce the bit masks of the regions where the changes have happened over time. This layer helps the model predict the region of change (1) and no change (0) at a pixel level. The segmentation head is a single convolutional layer that generates a bit mask from the feature representation produced by the UNet model. We further experimented with a segmentation head with multiple convolutional layers and the results obtained are provided in Table 1.
We have used cross-entropy loss on the pixel level as our loss function. Along with this, we are also computing the Dice coefficient as a metric to quantify the similarity between the predicted masks and the ground truth masks. Dice is a metric that is widely used for segmentation tasks as it is highly suited for tasks with imbalanced data. In general, most segmentation tasks deal with imbalanced data because the foreground which is to be detected is always less in comparison to the background. In our case, the changed pixels are fewer in number as compared to the unchanged pixels.
Figure 1 describes the overall model architecture.
Figure 1: Model Architecture of the Co-attention segmentation module
We are using the SEmantic Change detectiON Dataset (SECOND) 2022 [3] dataset which contains aerial image pairs. Each image pair consists of photographed images of the same region over a period of time and the corresponding pixel-level segmentation maps for each of the images in the pair denote the changes. These images were captured over the cities of Hangzhou, Chengdu, and Shanghai in China.
The dataset contains a total of 4662 image pairs and out of these, we have 2968 images for training and 1694 images for testing. We further divide the training images into a 90-10 split for training and validation purposes.
Figure 2: Original Image and the corresponding ground truth with bounding box
The originally provided segmentation map has 6 class labels for different types of changes like changes in vegetation, water bodies, buildings, etc. In our approach, we converted the segmentation mask into a bitwise mask to make it a binary image segmentation problem i.e. detection of only the change between the image pairs.
The images in the dataset are of size 512x512 pixels. Due to limited computing resources, most of our experiments were run after resizing the images to 256x256 pixels, the models which we used for our comparison. It is important to know that the model is robust enough to run for higher-resolution images. This can be seen in a few experiments which we ran with images of size 512x512 pixels, which give slightly better results as shown in table 3.
We initially started our experiments with Resnet50 as the backbone and achieved a dice score of 0.87 on the test dataset. As we noticed the validation loss diverging after a few epochs, we tried experimenting with a learning scheduler that reduces the learning rate by half every time the validation loss plateaus or increases. We observed that this did not help with improving the test dice score, but only caused the training loss to saturate, so we continued with using a constant learning rate of 0.0001. Later on, we tried a ResNet18 backbone and found the model to be performing marginally better compared to ResNet50. Therefore we continued to perform the rest of the experiments with ResNet18 as the backbone.
Furthermore, we also tried out a model with multiple convolutional layers in the segmentation layer, instead of a single convolutional layer. As this did not contribute to improving the performance, we continued with using a single convolutional layer in the segmentation head.
Table 1: Evaluating the Results for different backbone architectures
On comparing the model with co-attention against the model without any attention mechanism, we observed the performance of the co-attention model to be marginally poor, which matched the trend observed in the original paper. Building on this, we evaluated the models with different data augmentations that involved affine transformations (combination of rotation, translation, and scaling) on both images of the pair. We chose the rotation to be of a random angle between -30° and +30°, the scaling factor to be between 0.8 and 1.5 along the x and y-axis, and the translation between 0 to 20% along the x and y-axis (as specified in the original implementation).
The results after the transformations were not as we expected for co-attention (this can be observed from the table below). The original paper claimed the co-attention model to work better with transformations for predicting bounding boxes. But, in our case, we observe that the non-attention model is performing better while predicting the segmentation map.
We further tried only rotation-based transformation since the aerial images from different times are generally unlikely to be scaled by a different factor along the x and y-axis. We tried out random rotation between -30° and +30. Here also, we observed the same trend of the co-attention model performance being no better than the no-attention model.
Table 2: Results for Co-attention vs No-attention for different affine transformations
Lastly, we tested the model for high-resolution images i.e of size 512x512 pixels. We saw the results improving for both the co-attention and no-attention models which makes the model robust for higher-resolution images. Do to the lack of computational resources, we could not test the other models on 512x512 sized images.
Table 3: Results for higher-resolution images of size 512
The following images illustrate the predictions of our best performing model with Co-Attention module on the test set:
Before Change
After Change
Orginial Image
Ground Truth
Model Prediction
Figure 3: Results showing with model run with co-attention network
The following images illustrate the predictions of our best performing model without attention on the test set:
Before Change
After Change
Orginial Image
Ground Truth
Model Prediction
Figure 4: Results showing with model run with no attention
We observe that our model performs reasonably well (almost resembling the ground truth) given the several landscape changes.
In addition, our model is also able to distinguish illumination, as is evident from the images.
However, we can also infer that the model without attention does a little better in identifying the finer details as compared to the co-attention model.
From our quantitative and qualitative observations, it appears that the co-attention method is not performing any better than the method without attention when it comes to remote sensing change detection.
Based on the results obtained from our experiments it can be inferred that co-attention is not so advantageous for remote change detection using aerial images. Although the co-attention model was claimed to perform better for image pairs with a high affine transformation between them, we did not observe any such improvement over our baseline even upon performing the affine transformation.
We identify two possible explanations for the results that we have obtained,
Co-attention is designed and evaluated specifically for bounding box predictions and it may not be an ideal method for finding changes in the form of segmentations.
The other possible explanation is that co-attention might inherently not be a suitable method for detecting changes in aerial images which are generally very high-resolution images. Downsizing the image may be leading to lose of critical low level information that might help the model to make better predictions. We observed the performance to slgihtly improve when we increased the image resolution to 512 as can seen in Table 3. But we couldn't perform any more experiments to validate this claim due to limitations in computing resources.
Furthermore, we would like to expand and verify the same on higher-resolution images and also try it on different aerial datasets like CLCD[8].
We would like to express our gratitude to Dr. Lynn Abbott for his instruction and for giving us the necessary skill set to work on Computer Vision at such a complex level. Additionally, the research has given us an in-depth understanding of the fundamental but crucial core algorithms and concepts necessary to work on real-world computer vision challenges. Furthermore, we would like to thank our coursemates for being a part of this learning process.
We are also grateful to the work of Ragav Sachdeva and Andrew Zisserman, authors of the paper "The Change You Want to See" which served as the foundation for our project.
[1] Ragav Sachdeva, Andrew Zisserman. The Change You Want to See. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2023.
[2] Olivia Wiles, S´ebastien Ehrhardt, and Andrew Zisserman. Co-attention for conditioned image matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15920–15929, 2021.
[3] Yang, Kunping, et al. "Semantic change detection with asymmetric Siamese networks." arXiv preprint arXiv:2010.05687 (2020).
[4] Yang W, Song H, Du L, Dai S, Xu Y. A Change Detection Method for Remote Sensing Images Based on Coupled Dictionary and Deep Learning. Comput Intell Neurosci. 2022 Jan 17;2022:3404858. doi: 10.1155/2022/3404858. PMID: 35082842; PMCID: PMC8786482.
[5] https://www.azavea.com/blog/2022/04/18/change-detection-with-raster-vision/
[6] Change detection techniques, Taylor & Francis, International Journal of Remote Sensing, 2003, Vol. 25, No. 12, 2365-2407
[7] You, Y.; Cao, J.; Zhou, W. A Survey of Change Detection Methods Based on Remote Sensing Images for Multi-Source and Multi-Objective Scenarios. Remote Sens. 2020, 12, 2460. https://doi.org/10.3390/rs12152460
[8] M. Liu, Z. Chai, H. Deng and R. Liu, "A CNN-Transformer Network With Multiscale Context Aggregation for Fine-Grained Cropland Change Detection," in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 15, pp. 4297-4306, 2022, doi: 10.1109/JSTARS.2022.3177235.