[per-ˌī-ˈdō-lē-ə] | noun
The tendency to perceive a specific, often meaningful image in a random or ambiguous visual pattern
URL: https://sites.google.com/andrew.cmu.edu/pareidolia-in-clouds
Team: Adhokshaja Madhwaraj, Deepti Upmaka, Neha Boloor, Vineeth Reddy Vatti
In this project "Pareidolia in Clouds" we aim to model pattern abstractions that cannot be inherently modeled by mathematical equations. By focusing on learning patterns in clouds we aim to learn latent pattern representations in a creative fashion. We will be using CLIPasso as our backbone which will enable us to abstract sketch like representations of the cloud images. We will be focusing on finding animal images within the clouds. One of the parts we explored in depth was generating cloud like images of animals since there does not exist an extensive cloud database to train on. Our results show that it is possible to learn patterns in clouds that might not be obvious.
Patterns are one of the most intriguing things we observe in nature; especially those that repeat themselves in different contexts. We are surrounded by a plethora of visual patterns, some of which include symmetry, fractals, and spirals which can be modeled mathematically while others are more abstract.
As humans, we are able to recognize these patterns with ease and even take it one step further by imagining things that are not explicitly defined. Now imagine what it would be like to give the computer this power of imagination. Without any predefined patterns or features, are we still able to learn latent representations? This project aims to explore what kind of low level features we can learn from natural scenes, in our case, cloud formations. Much like the childhood pastime of finding shapes and animals within cloud formations, we seek to implement this in a visual learning context.
The novelty of our idea is detecting something that is not there. We are attempting to model pattern abstractions that can not be inherently modeled by mathematical equations. The challenge in this problem is that the computer needs to learn these latent pattern representations creatively.
Image Credits: Adobe Stock
Image Credits: Shutterstock
Figure 1: Examples of animal like cloud formations
In this section we will be introducing a few related works. These works will discuss the theory behind line drawings, and finding a low level sketch representation for images as well as other technical concepts we have implemented in our pipeline.
One of the fundamental concepts of computer vision is to find high-level features like edges to represent an image or to segment. This paper [2] discusses the nuances of why this might not always be the best representation for line drawings. Humans can perceive beyond just edges as we take into consideration illumination and negative space. This also means an image can be represented by more features like depth which are not encoded in edge representations. This is applicable to our use case since we are not just trying to detect the edges of clouds but also incorporate the specularity and opacity of the clouds themselves to uncover the underlying image.
CLIPasso [4] is a very recent work that was published in February 2022. In this work, they are able to generate a high fidelity sketch or doodle-like abstract depiction of an input image. By following an optimization-based photo-to-sketch generation technique it achieves different levels of complexity without requiring an explicit sketch dataset. Complexity can be defined by the number of strokes used in the sketch. They use an image encoder to inform the geometric structure and output a sketch at varying levels of complexity, while still maintaining both the semantics and structure of the subject depicted. What makes this approach unique is that it is not trained on sketch-dataset, and is class-agnostic.
Neural Style Transfer was first introduced in 2015 in the paper A Neural Algorithm of Artistic Style [9]. This revolutionary paper introduced the idea of blending together a content image and a style image such that the output image resembles the content image, but has stylistic features of the style image. This synergy between style and content creates new art. Built on a VGG19 backbone, it is able to use a Deep Convolutional Network to take interesting features from both the content and style images. The content loss handles the ability to retain similar data in the higher layers and the style loss understands lower level details like brush strokes [10]. The trade off between which layers are used for both the content and style features is one of the ways to control what the output image should look like.
Neural Style Transfer works very well with images that do not have noisy backgrounds. As mentioned before, this blends a content and style image in an artistic fashion. In our problem, we deal with highly noisy images of the sky, with animals painted on in the style of clouds in the sky. To ensure that the training images are not noisy, we consider Mask RCNN [7] to select only those regions from the base image we need to consider. The Mask RCNN is the current state of the art for instance segmentation and returns both bounding boxes, and region masks for each instance of the required class. The architecture has a feature extractor backbone and a region-proposal network. Mask RCNN improves upon the Fast RCNN model, by adding an additional branch for predicting the instance segmentation masks of the queried class, over each Region of Interest (RoI) obtained. In our implementation, we use pretrained Mask RCNN on the COCO dataset and consider the classes that are common to both COCO and our animal dataset.
ResNet [8] is the holy grail of Computer Vision at the moment. With the development of the field of Computer Vision and the progress toward deeper networks, it was observed that these networks were harder to train due to the challenge of vanishing gradients, due to backpropagation through multiple layers. Repeated derivatives and matrix multiplication of small values, lead to infinitesimally small gradients. The ResNet model smartly bypasses this issue by considering a shortcut connection, where the input to a ResNet block is also passed through to the last layer (also called a skip connection). We use a pretrained ResNet model, with an extra output linear layer to train our classifier on the outputs of CLIPasso.
In both the related works, the initial image is given whereas in our use case we do not know what we can "see" in the clouds. Part of our goal is to teach the computer how to be imaginative. Much like illusions, we do not always see what the underlying structure is. Likewise, we will be combining a childhood pass time of finding animals in cloud formations with visual learning. The main concept of our project is to identify these animal structures in clouds that are not always evident or might even differ from person to person. Our proposed solution includes first finding these latent representations of curves and high-level features in the clouds and second labeling them with the closest animal.
Below we have identified a couple key steps in our execution pipeline:
Building our dataset: In order to conduct our various experiments, we need labeled data. Since there are not any readily available datasets for animal cloud representations, we will need to build our own. We will first collect a series of animal images with a plain background or crop it from the background. Then we will apply cloud style-transfer which will give us both our training dataset as well as the annotations from the ground truth pre-transformed images.
Run Inference with CLIPasso: Using the annotated dataset that we built from the previous step, we can run it through CLIPasso to get the sketch like representations for a set of animals.
Train Network: Train a network to identify sketch-like curves and match it to the closest animal class. The final output will be the original image with an overlay of the detected animal sketch
Now that we have a better idea about our proposed solution, let us discuss our dataset before jumping into the approach.
The main dataset that we used was the Animals-10 Kaggle dataset [5]. This dataset, as the name suggests includes 10 categories of animals: dog, cat, horse, spider, butterfly, chicken, sheep, cow, squirrel, and elephant. For this reason we were limited in the types of animals we can predict in our clouds. One of the caveats of this dataset is that they have been collected via Google Images and then looked over by a human annotator. To simulate in-the-wild scenarios when web scrapping for images, they included some datapoints that had some interesting characteristics like drawings of animals instead of images or datapoints that had watermarks on them. We did not do any pre-processing on this dataset since we wanted to use Style Transfer to change the entire appearance of the image.
Figure 2: A block diagram representing our training pipeline
Figure 3: A block diagram representing our testing pipeline
We have split our approach into two sections for training and testing. During the training phase shown in Figure 2, we need to prepare the data, get the CLIPasso output and train our classifier. One of the main challenges that we had during this project was this data preparation step. Many of the clouds we see do not look like animals to us and there are no cloud datasets readily available to us. For this reason we decided to create our own. This step encompasses the first three parts of the training pipeline where we needed to be thoughtful about how we use Style Transfer to convert internet images of the animals to cloud-like images.
We tested out three different methods for Neural Style Transfer [6]. The first method is the vanilla implementation. This implementation takes in a style image and a content image, uses Mean Squared Error to compute the style and content loss and then runs on a VGG-19 backbone. Some of the parameters that we have control over is deciding which convolutional layer(s) to use for the style and content, as well as how much to weight the style image and content image during the style transfer. The content is taken from the fourth convolutional layer and the style is taken from all the five layers. In Method 1 which is the vanilla implementation, the weights are set to 1,000,000 for style and 1 for content. As seen in the left most image in Figure 5, it is clearly a horse without much representation from the cloud. In Method 2, we weighted the content lower at 0.001 which definitely helps with retaining more of the cloud imagery. However, the issue still stands that the style transferred image is highly similar to the input animal image. Our goal here is to generate cloud images that have a slight resemblance to an animal in our dataset while still retaining some realistic qualities.
In Method 3, we made the largest upgrade to our style transfer. Since we want to represent the entire cloud as an animal, it was important for us to have a clear mask separating the animal from the background. Through our experiments we also found that cloud images that have mostly a singular cloud and have depth to it work the best. In the cloud images we used in Method 1 and Method 2, they either lacked depth as the left image in Figure 4 or had too many smaller clouds as the center image in Figure 4. This brings us to the addition of Mask R-CNN as a precursor to feeding the images into Neural Style Transfer. As shown in the third column, the cloud we chose had depth and did not contain many smaller clouds. The output image bears a slight resemblance to a sheep while still looking like a cloud we can find in real life.
Method 1
Neural Style Transfer (Vanilla)
Method 2
Neural Style Transfer (Changed Hyperparameters)
Method 3
Mask R-CNN + Neural Style Transfer (Changed Hyperparameters)
Image Credits: Unsplash
Image Credits: iStock
Image Credits: Unsplash
Figure 4: Input cloud images to Style Transfer
Horse
Dog
Sheep
Figure 5: Output cloud-like Style Transferred animal images
This is then passed into CLIPasso. As mentioned earlier, CLIPasso generates a class-agnostic sketch from a given masked input ( An additional step of masking is performed if an image with background is fed into the network), allowing for varying levels of abstraction. Essentially, given an input image, we synthesize the corresponding sketch while maintaining both the semantic and geometric attributes of the subject at a high level of abstraction through CLIPasso. We set the number of strokes to 10 in our case, to ensure a high level of abstractness while still preserving the semantics of the subject. A sketch in our case is a set of 10 black strokes (bezier curves) whose initialization is done based on the activation map generated from the input image giving us a rough idea of the locations of the salient features of the image. A differentiable rasterizer is used whose parameters are optimized with respect to CLIP-based perceptual loss for preserving the semantics and L2- loss for geometric aspects of the sketch. At each step, the stroke parameters are fed into the rasterizer to render a sketch. The resulting sketch, as well as the original image, are then fed into CLIP to define a CLIP-based perceptual loss. One of the important ideas used in this method is the use of intermediate layers of pre-trained CLIP model to constrain the geometry of the output sketch. The loss is backpropagated through the differentiable rasterizer and the strokes' control points are updated directly at each step until convergence. This way we are able to generate abstract representations (sketches) of the input animal-like looking cloud images and thereby enabling the model to learn these abstract patterns creatively.
This way we generate sketches for all the fed images and then train a classifier network to perform a classification of the sketch so produced.
The pipeline for testing detailed in Figure 3 is straightforward. Here we assume we have cloud images already, whether they are from the style transferred data or real cloud images. This is then passed into CLIPasso which has the improved masking and then finally outputs classification results on the sketches.
In this section you can find our results from running CLIPasso on the cloud transferred images as well as real images. In the end since we used Mask R-CNN which is training on the COCO database, we needed to eliminate some of the categories in Animals-10. Our final list of animals is dog, cat, elephant, horse, and sheep. We then consider the outputs from CLIPasso, and pass them through to our ResNet based classifier to label them as one of the five previous classes. In order to guarantee some reliability for our classifier, we trained the model on ground truth labels from the original animal images (pre-style transferred).
Here in Figure 6 we have a few examples of the synthetic data. On these synthetic images, we are able to generate reliable sketches of the clouds and predict the animal classes with high accuracy. It is also able to pick up finer details in some cases like Image 1 and Image 5.
Image 1: Dog
Image 2: Dog
Image 3: Elephant
Image 4: Cat
Image 5: Elephant
Image 6: Horse
Image 7: Cat
Image 8: Sheep
Figure 6: Output Sketches from CLIPasso overlaid on Input Cloud Image with Class Label for Synthetic Data
Figure 7 shows results from the real data. These are images that we took of clouds and then passed into CLIPasso. The main difference between this data and the real data is that we have additional smaller cloud pieces outside the main cloud. While some of the clouds look very much like animals, they are not all included in our dataset. For example, Image 3 can look like a rabbit to us (and we observed that many clouds do look like this) but it was not one of the categories available to us during our training so it was not possible to include it in the final list. As you can see we also included images that are visibly outside of the trained data distribution. Image 6 and Image 8 are not the traditional cloud images with the blue sky and the white cloud, instead the represent clouds with shadows and the negative space. We define negative space here as trying to find an animal in the sky region instead of the clouds themselves. These two were interesting use cases and we wanted to see how well CLIPasso would perform. It was able to capture high level detail with the outline in the region we expected it to highlight.
Image 1: Dog
Image 2: Cat
Image 3: Sheep
Image 4: Elephant
Image 5: Cat
Image 6: Dog
Image 7: Sheep
Image 8: Cat
Figure 7: Output Sketches from CLIPasso overlaid on Input Cloud Image with Class Label for Real Data
Figure 8 details the classification loss and accuracy for different hyperparameters. We were able to achieve the highest accuracy for a learning rate of 0.0001 and batch size 16 (green curve). If we see closely, we can also observe the loss is also the lowest fro the green curve. We obtain a training accuracy of 88.24% after this hyperparameter tuning.
Classification Loss Plot
Classification Accuracy Plot
Figure 8: Classification Loss and Accuracy Plots
While we have a great foundation for the basic pipeline, there are a couple things we would like to consider for the future. Many of our experiments consisted of the synthetic dataset that we generated. We mostly focused on single animal images so one of the natural extensions would be to include multiple animals in the image. This is not something CLIPasso can currently handle so that would be an interesting idea to explore. Ideally, we would like to extend this to a robust system that can handle realistic data. During inference we did test on images that we collected by taking images of good clouds when we see them. However not every cloud can have a pattern. So it is important to also consider those instances that are less likely to represent some animal to truly understand how well the computer can learn creativity. Below we have summarized the experiments that we would conduct in the future.
Experiment 1
Images with multiple labeled animals with and without background
Experiment 2
Inverted images where we try to find patterns in the negative space (in between clouds or sky) We did try out a few examples in this category.
Experiment 3
Run on random unlabeled internet cloud images that may have not have anything that resembles an animal
Experiment 4
Extend this to categories other than animals.
Overall we showed that it is possible to find patterns in something that inherently cannot be described by a mathematical equation. CLIPasso has been extended to handle natural cloud images and output a sketch that lines up from one of the animals in our classification list. Our classification results were reliable and can be extended to include more categories easily. There is much scope for extending this idea into more use cases but as an initial product, we can say that we were able to train the network to model a level of imagination.
[1] Mike Tyka Alexander Mordvintsev, Christopher Olah. 2015. Inceptionism: Going Deeper Into Neural Networks. (2015). https://ai.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html
[2] Aaron Hertzmann. 2021. The Role of Edges in Line Drawing Perception. CoRR abs/2101.09376 (2021). arXiv:2101.09376 https://arxiv.org/abs/2101.09376
[3] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. (2015), 1–9.
[4] Yael Vinker, Ehsan Pajouheshgar, Jessica Y. Bo, Roman Christian Bachmann, Amit Haim Bermano, Daniel Cohen-Or, Amir Zamir, and Ariel Shamir. 2022. CLIPasso: Semantically-Aware Object Sketching. arXiv:2202.05822 [cs.GR]
[5] Corrado Alessio, Animals-10 https://www.kaggle.com/datasets/alessiocorrado99/animals10
[6] Alexis Jacq, Winston Herring, Neural Transfer Using PyTorch. https://pytorch.org/tutorials/advanced/neural_style_tutorial.html
[7] Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick: “Mask R-CNN”, 2017; [https://arxiv.org/abs/1703.06870 arXiv:1703.06870].
[8] He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2016). Deep Residual Learning for Image Recognition. 2016 [https://ieeexplore.ieee.org/document/7780459]
[9] Gatys, Leon A., Alexander S. Ecker, and Matthias Bethge. "A neural algorithm of artistic style." arXiv preprint arXiv:1508.06576 (2015).
[10] Shubham Jha, A brief introduction to Neural Style Transfer. https://towardsdatascience.com/a-brief-introduction-to-neural-style-transfer-d05d0403901d