Wide-angle fisheye images are common in a variety of places such as traffic intersections, home security cameras, surround action cameras, and increasingly in autonomous vehicles. The images captured by these kinds of cameras have a wide field of view thereby reducing the cost of having to use multiple cameras for surveillance purposes. However, the major challenge in processing these kinds of images is the nonlinearity induced by fish lenses, especially towards the boundaries the distortion increases. This project aims to further carry out existing work done in the field of object detection on fisheye images. In this project, we are using UNET deep learning architecture for vehicle segmentation on unrectified fisheye images. Additionally, we will also check whether the perspective transformation of spherical images will improve the segmentation accuracy or not.
This project aims to further carry existing work done in the field on object detection on fisheye images, here we are comparing the performance of UNET deep learning algorithm on an unrectified fisheye image with the same algorithm applied to fisheye image after perspective transformation.
Fig.1 - Fisheye cames on vehicle in parking lot
Fig.2 - Fisheye cames on a tranffic intersection
Fig.3 - (Desired outcome) - object detection on fisheye images
Fig.4 - Perspective Transformation from spherical to rectangular [5]
Fig.5 - UNet Architecture
Fish lenses are used to capture very wide-angle images or 360° images. These kinds of image-capturing devices find application in a variety of places such as traffic intersections, home security cameras, surround action cameras, and increasingly in autonomous vehicles. The images captured by these kinds of cameras have a wide field of view over which semantic segmentation, object detection, and classification techniques can be applied. However, the major challenge in processing these kinds of images is the non-linearity induced by fish lenses, especially towards the boundaries the distortion increases.
Successful implementation of object detection algorithms on this kind of image will facilitate in variety of applications by augmentation or improvement, to name a few:
Objective detection in 3D stitched images for efficient or autonomous parking
Wide field object detection at traffic signals
Augmentation for LiDAR in autonomous vehicles
Surveying using Drone cameras
We found a few notable approaches applied to this problem.
Cohen and et al (2018) proposed a different type of cross-correlation method which can be used in various deep-learning networks designed for fisheye images. Instead of using 2d cross-correlation filters that is the building block of normal convolutional neural networks often used in 2d planar images, they proposed a definition of spherical cross-correlation filters that is both expressive and rotation equivariant and hence can be used as the building block of neural networks designed for fisheye images. They built a simple neural network based on spherical cnn to classify mnist dataset and got comparable results.
Yaozu et al (2020) had employed a neural network named SwiftNet-18 on the synthetic dataset of fisheye images and got satisfactory results. SwiftNet has architecture similar to UNet but it uses pretrained resnet18 model as an encoder network. The model shows satisfactory results but the authors only focused on exploring various data augmentation methods in order to generate synthetic fisheye image dataset from 2d planar images.
Yogagami and et al(2019) at valeo ai introduced the first fisheye image dataset named WoodScape dedicated to autonomous driving. It contains images from four cameras and annotations for 3d object detection and semantic segmentation.
We solved the vehicle detection problem by modeling it as an image segmentation problem and using the UNet architecture. It can localize and differentiate borders by performing classification on each pixel, allowing the input and output to be the same size. The symmetric design comprises of an encoder and decoder network. This architecture was demonstrated in the original study to be capable of being taught end-to-end from very few photos.
We investigated two approaches for recognizing vehicles in unrectified fish eye images using the UNet model. The first strategy is to create a deep learning model and use it to recognize automobiles directly from fish eye photos. Several articles have stated that rectangular convolution is unsuitable for extracting features from distorted fish eye images. As a result, our second strategy is to transform unrectified fisheye photos to equirectangular images before training and testing the UNet model on the transformed images.
Out of the entire Woodiccv dataset, we only used 250 photos because of the high processing expense. The dataset contains total of 8234 images. The 250 images dataset was divided into a holdout, testing, and train set at random. 150 photos were used for training, 80 images were used to tune the hyper-parameter (learning rate), and 20 images served as a holdout set for various metric calculations. We employed the random cropping data augmentation strategy to avoid the model from overfitting.
A pretrained unet model is initially downloaded from the torch hub in order to train the model. Researchers can obtain a variety of pretrained models simply at Torch Hub, a model library. Since it is simple to use through python api, DL researchers can immediately download and test the models to suit their needs. The downloaded model was then changed to ensure that its input and output shapes were appropriate for our use. After creating the training loop, the model was trained using our dataset. Binary cross entropy was employed as the loss function. We used a simple data augmentation strategy during the model training procedure to prevent overfitting i.e. Our training loop randomly selects a portion of the image for each epoch and uses that image for training. Because of the random cropping strategy used throughout the training procedure, our model's input shape was (3x640x640) when working on unrectified images. We tried to do 1024x1024 random cropping for unrectified images but the height of the image was less than 1024. Hence we chose to use 640x640 size. For testing and holding datasets, no augmentation was used. To update the parameters of the model, we utilized the RMSProp optimizer with a learning rate of 1e-05. We utilized RMSProp optimizer over Adams optimizer because it has been found that training model with RMSProp optimizer makes convergence faster. After each epoch, the dice coefficient for the test set was determined as we trained the models for significant number of epochs. In order to calculate metrics for the holdout set, the best dice score and the model based on it were both recorded and saved using pytorch function.
Model finetuning on unrectified images
Model finetuning on rectified images
Training loss and validation dice score during training for unrectified images
Training loss and validation dice score during training for rectified images
Figure 13: UNET model
Video 1 : What is UNET architecture? Video 2: How to Implement UNET?
The UNet architecture is symmetric and consists of two major parts — the left part is called the contracting path, which is constituted by the general convolutional process; the right part is an expansive path, which is constituted by transposed 2d convolutional layers. The above two videos provide a general description of UNET model and how to implement any UNET model.
As is with the implementation of any segmentation methodology the two most common challenges are:
High GPU memory requirement - Deep learning models are memory intensive. A single pass consumes a lot of memory if the model is big and the image has a higher resolution. Techniques such as Random cropping to some smaller size and choosing less batch size are often used to counter this problem.
False positive detection - A false positive detection is when the model falsely identifies a pixel value as a true pixel. Figure 14 (a), (b), and (c) depicts this issue where a false positive identification is done on the building.
FALSE negative detection - A false negative detection is when the model fails to identify a true pixel in reference to ground truth. Figure 15 (a), (b), and(c) depicts this issue where a false negative identification is done on the truck.
Figure 14(a): Ground truth
Figure 14(b): Prediction of UNET model
Figure 14(c) : False Positives
Figure 15(a): Ground Truth
Figure 15(b): Prediction of UNET model
Figure 15(c) : False negatives
Figure6 : Transformations (source : [6] https://github.com/xtile/py-fisheye-dewarp/blob/master/README_img/entire_transformation.jpg)
Transformation of fisheye to rectangular images is done in two steps
Step 1: Conversion of fisheye to the spherical coordinate system
P’(x’,y’) = P(cos φ sin θ, cos φ cos θ, sin φ)
where,
P ′ − Fisheye coordinates
P − Spherical coordinates
φ = (r∗aperture)/2
θ = atan2(y′, x′)
Step 2: Converting spherical coordinates to rectangular coordinates
longitude = asin(z/R)
latitude = atan2(y,x)
Figure 7: Input Fisheye Image
Figure 8: Unscaled fisheye transformation
Figure 9: Forward mapping
Figure 10 : Forward mapping with wrong aperture
Figure 11: Inverse mapping attempt
Figure 12 : Final output Image
This presented many challenges, some of whom are listed below
Dimension of rectified images - As is evident from the formulae, while unwrapping a fisheye image the information available is depended upon the intrinsic properties of the lens i.e Radius of the fisheye lens (R), vertical field of view (vfov) and horizontal field of view (hfov). Also while mapping the output image is of the size (R,2 π R) ideally.
Simplifying assumptions were made in order to limit the scope of the program development, in our case we developed a program for isometric aperture lenses (those having equal vfov and hfov), also the fov of the lens is assumed to be 180deg. The programs work well for other fov as well but need to be modified for center point definition. Upscaling needs to be done to deliver an output image of the same size as the input. Figure 8 depicts an unrectified and unscaled output from the transformation.
Aperture identification - Figure 10 depicts the transformation with the wrong aperture value.
Our [8]dataset included images taken from 4 cameras namely FV (Front View),RV (Rear View),MVL (Mirror View Left) and MVR (Mirror View Right). Out of these four, only images taken from FV cames were chosen as it will be the primary camera for oncoming vehicle detection. By specifying camera intrinsic properties for all the images remained constant, thus aperture is determined for one image and hardcoded into the program.
Forward mapping information loss - Forward mapping is when we loop over all the points of the input image and find the corresponding point in the output image. The values of the corresponding points can then be cross-matched but this method leaves gaps in the output image as evident from Figure 9.
This can be overcome by inverse mapping i.e looping over points of the output image to find corresponding points in the input image. Due to time constraints, we were unable to develop a working algorithm using inverse mapping of points. Such an algorithm would solve loss issue as is shown by the work of [6](Scott, Fish eye lens dewarping and Panorama Stiching) , [5](Bourke, Converting a fisheye image into a panoramic, spherical or perspective projection) and [7](Ha, Github/Py-Fisheye-dewarp: Fisheye image dewarp / equirectangular rotation)
Additionally, it was observed that these losses are concentrated more toward the periphery of the output image, also since our ground truth data contains a mask these small losses toward the periphery can be glossed over by using a mask.
Mirror Inversion - Output images from the transformation are flipped (mirror image) along the vertical axis.
This can be easily remedied by flipping the output image matrix.
The performance of both approaches was compared and measured using predefined metrics such as Dice score, IOU score, Accuracy, Precision, Recall, and F1 score.
Dice Score - The Dice score is the metric used to assess model performance. The score ranges from 0 to 1. A score of 1 corresponds to a pixel perfect match between the deep learning model output and ground truth annotation. The formula for the Dice score is given below.
Here, TP is true positive, FP is false positive, TN is true negative and FN is false negative.
IOU Score - IOU score is also the metric commonly used in object detection. It is the ratio of area of overlap and area of union.
Accuracy, Precision, Recall, and F1 score are the most commonly used metrics for classification tasks. These scores represent metrics for pixel level classification. Out of all these, the recall score is the most significant because it represents how many percentages of pixels of vehicles are accurately being predicted.
Figure 16: Metrics for UNET model on unrectified images.
Figure 17: Metrics for UNET model on rectified images.
The deep learning model performs better when trained and tested on the unrectified fish eye images. Although the accuracy metrics seems to be marginally higher, this doesn't reflect the actual model performance because there are fewer vehicle pixels than background pixels.
Moving forward a number of steps can be taken to further improve the detection rate and metric scores namely
Inverse transformation mapping - With inverse mapping we would get better transformation images as there will be no forward mapping information loss. This method would also avoid upscaling of output images and the size of the output image will be similar to the input image.
Training with a bigger dataset - Training with bigger data would allow for generating a better model. Also with an improved GPU, we can feed a bigger randomly sampled image into the UNET model.
Experimenting with other models such as deeplabv3, lenet etc to see whether these models perform better than general UNet model.
Making rectified output image the same size as the unrectified ones.
[1] Y. Ye, K. Yang, K. Xiang, J. Wang and K. Wang, "Universal Semantic Segmentation for Fisheye Urban Driving Images," 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2020, pp. 648-655, doi: 10.1109/SMC42975.2020.9283099.
[2] Orsic, I. Kreso, P. Bevandic, and S. Segvic, “In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images,” inProceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2019, pp. 12 607–12 616.
[3] Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation . Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9351, 234–241. https://doi.org/10.1007/978-3-319-24574-4_28
[4] Yogamani, S., Hughes, C., Horgan, J., Sistu, G., Varley, P., O’dea, D., Uřičář, M., Milz, S., Simon, M., Amende, K., Witt, C., Rashed, H., Chennupati, S., Nayak, S., Mansoor, S., Perrotton, X., & Pérez, P. (n.d.). WoodScape: A multi-task, multi-camera fisheye dataset for autonomous driving Front Camera & Semantic Segmentation Front Camera Left Camera Right Camera Rear Camera Lidar 3D View 3D Box Lidar View Lidar Bird-View. https://github.com/valeoai/WoodScape
[5] Converting a fisheye image into a panoramic, spherical or perspective projection. Converting a fisheye image to panoramic, spherical and perspective projection. (n.d.). Retrieved October 15, 2022, from http://paulbourke.net/dome/fish2/#1
[6] Scott, K.A. (no date) Fish eye lens dewarping and Panorama Stiching, kscottz. Available at: https://www.kscottz.com/fish-eye-lens-dewarping-and-panorama-stiching/ (Accessed: November 29, 2022).
[7] Ha, S. (no date) Xtile/Py-Fisheye-dewarp: Fisheye image dewarp / equirectangular rotation, GitHub. Available at: https://github.com/xtile/py-fisheye-dewarp.git (Accessed: November 29, 2022).
[8] Valeo (no date) Dataset, Valeo Woodscape. Available at: https://woodscape.valeo.com/dataset (Accessed: November 29, 2022).