GLOBE CLOUDS

Background

Clouds are a very important part of the atmosphere. They work to regulate Earth's energy balance, redistribute heat from the equator to other parts of the globe, and are an essential part of the water cycle. Different clouds have different effects on the world around them, and so it is useful to be able to identify which cloud is which type.

Historically, identifying a cloud's type (such as cumulus, stratus, or cirrus) has been done manually, using the cloud's altitude and physical features. Cloud Computing students working on this project follow in the footsteps of past computer scientists and train computers to identify cloud types based on photographs.

NASA's GLOBE Cloud Project

The GLOBE Cloud project is a service developed by NASA which allows citizens to participate in gathering scientific data. Anyone can get involved by downloading the GLOBE app and photographing the sky around them. Their photo is then sent to a NASA database, where it is cross-referenced with satellite images from the time and location of the photo.

Our Goal

The GLOBE Cloud database is a valuable resource for any scientist working with images of clouds. We can feed image identification models the GLOBE images to teach them to identify cloud types.

This can be done in one of two ways. In supervised learning, the machine is given example images of which cloud is which type, to act as a reference when identifying other clouds. In unsupervised learning, the machine is given a large amount of images and left to figure out the distinctions between them itself.

There are a large number of easily accessible machine learning programs out there, some of which are discussed below. There are complications to this process, however, such as:

The sheer number of machine learning programs makes it hard to pick one that will best meet our goal.
The GLOBE photos may contain objects other than clouds, such as trees or buildings. This could confuse a machine learning program made specifically to identify clouds.
The GLOBE service relies on its users to accurately identify the type of cloud in their photo. Incorrectly-labelled photographs will confuse our machine learning program, feeding it false information.

Much of the research our team has done as been to address these problems.

Pursuing Accuracy

2021 cohorts Alyssa Mazzone, Jaxon Ko, Jo Crooks, and Tyler Gordon performed research to identify an accurate machine learning program. The two programs pitted against each other were the professionally-made VGG-16 image identification model, and a homebrew model made by the students. Both of these models used unsupervised learning. In the end they found that their own model was slightly more accurate, most likely because it was made specifically to identify clouds. Each point on the cluster plot to the left represents an image, and the dot's color represents which cloud type the VGG-16 model thought the image was of. A more accurate model would have better-defined clusters of images. Their research can be seen here.

Supervised classification of Cloud Type

The goal of this project was to use supervised machine learning to classify images of clouds taken by citizen scientists within the NASA GLOBE Observations data set. As explained in detail below, we accomplished this by using tools such as Google Colaboratory and Google Teachable Machine to clean and extract from our data set and train and analyze our model. The images used to train and test the model contained only a single cloud type. The data, we have discovered, has a lot of potential issues, including, for example, obstructions (such as trees or buildings), nighttime images with low lighting, etc. These images may need to be filtered out, or the model may need to be tested on a larger amount of data in order for these kinds of images not to skew the results. You can see the details of this research here.

Examining misclassifcation errors for the cirrus and cirrostratus cloud type

It appears as though one of the misclassifications isn't actually a misclassification. It was an image marked to be a cirrus/cirrostratus image, but it is actually a clear sky image to the model marked it as something else. The others do actually appear to be misclassifications. Some of them appear to have other clouds that are not cirrus or cirrostratus in them, which is probably why the model thought they were something else.

It appears that some of the images flagged as misclassified are not actually misclassified. Though they were marked as clear images by those who submitted the data, these images actually do have some clouds in them. Some of the others appear to have some lens flare captured in them as well, which might lead to their misclassification if they were really clear images. Overall, the model did a good job of classifying clear images, since some of the images that were marked to be clear are not really clear, and the model picked up on that.

This team created a teachable machine model that determined whether an image contained the cloud types of interest, was clear, or was some other cloud type. Based on our findings, our team has discovered that many of the current issues with the model are not so much with the model itself, but instead with the data being fed into it. Many of the images have incorrect truth labels (saying a clear image is a cloud image) and have been incorrectly placed in the wrong categories. The model was actually able to pick up on this, and correctly misclassified images. You can see more of the team's analysis here.

Quality Controlling for Obstructed Images

2020 cohorts Dhvani Patel and Charles Karafotias developed a quality-control system to allow our machine learning programs to identify obstructions in GLOBE photos. They used two different models to output whether an image contained a cloud, no cloud, or an object that wasn't a cloud. They found that the supervised Teachable Machine performed better at this task than an unsupervised model. The bar plot to the left depicts the accuracy of successive Teachable Machines, each building off of the last to become more accurate. Their research can be found here.

Cloud Coverage

2020 cohort Aastha Senjalia focused her research on making a model to identify the amount of the sky covered by clouds in a photo. To do this she used Google's Teachable Machine, a supervised image identification service. She found that the Teachable Machine was able to identify cloud coverage, but its accuracy was handicapped by incorrectly-identified clouds in the training data. The image to the left depicts Teachable Machine's attempt at identifying a GLOBE image: it believes the image depicts isolated clouds, with 71% confidence.

Finding Features

2021 cohorts Eesha Kaul, Greg Roden, Zainab Siddique, and John Lauterbach used a student-made unsupervised model to identify what features of a GLOBE image were most important in deciding what type of cloud it was. After running the model many times over with different features, they found that the amount of blue and the amount of shadows in an image were the deciding features. The bar plot to the left depicts the relative importance of different features. Their research can be seen here.

Cloud Coverage Variation

2019 cohort member Emma Bonnano explored differences in cloud coverage observed from satellites: GEO, Aqua, and Terra using Python's Pandas and Numpy library. From the work, several graphs were produced showing low, mid, high, and total cloud coverage averages for each satellite. The plot for the total level cloud cover for all three satellites is seen to the left, depicting how the average total cloud coverage for the satellites was pretty close overall, with Aqua satellite having the highest percentage and Terra having the lowest percentage. Overall, the average for all three combined was 70% total level cloud cover. This work has progressed the stream's understanding of the GLOBE Cloud dataset and provided insight into the differences of each satellite's cloud coverage averages. Her other graphs and research notebook can be seen here.

Characterizing Ground-Satellite Cloud Cover Discrepancies

2020 cohort members worked to reproduce the figures in the Eclipse across america paper that related to cloud cover, but using more data. Here they are showing a box plot distribution for each qualitative cloud cover category compared to the cloud cover derived from satellite. Each horizontal line represents the threshold between qualitative category cloud cover ranges. Overall the median for each category is within the range of corresponding cloud cover as observed from satellite. You can see their fully detailed results here.

Characterizing Ground-Satellite Cloud Cover Discrepancies

2020 cohort members Rahul Joshi, Jack Larcome, Josh Smith, and Michelle Badalov explored how to characterize discrepancies between satellite and citizen observations. With the help of Python's Pandas, Matplotlib, and Numpy packages they were able to identify these discrepancies and categorize them by cloud altitude. The heatmap (or 2D histogram) on the left shows the frequency of a particular cloud coverage from satellites and ground observers only for observations that have good agreement between the two sources. Most of the observations with good agreement were either 0% or 100% cloud coverage. Being able to identify discrepancies between the citizens and satellites observations can help us identify conditions where disagreement is more prominent. Their research can be seen here.

Demonstration of distribution of cloud type and cover for a particular geographic location

This code was designed to focus on a particular geographic region with high concentration of observations in the GLOBE database and look at the distribution of cloud types. The motivation for this work was looking at whether the region would be suitable for solar power generation.

For the same region the distribution of cloud cover from both ground and satellite observations was also plotted for the same purpose of determining how well the region is suited to sola power generation.

Further Reading