Student Projects

Spring 2021

The following are projects completed by students in EEPS 1960D: Machine Learning for the Earth and Environment at Brown University in Spring 2021.

The projects below are shared with permission of the student authors and should not be redistributed.

Climate & Environment

A Machine Learning Approach to Synoptic-Scale Waterspout Prediction in the Florida Keys by Jonny Benoit


Machine learning and statistical methods are used to predict waterspout occurrence in the Florida Keys from a global weather model (NCAR). The results here match the skill of previous regression models that used in situ data from Key West. Bayes Theorem is used to determine the significance of the new waterspout prediction method and convert model outputs to waterspout probability. Conducting machine learning on a pixelwise basis indicates that the pixels near the Florida Keys are most predictive of waterspout occurrence, and ensemble methods are used to increase predictive power. [report]

Predicting Streamflow Using Machine Learning in Northern Canada by Sarah Esenther & Ekaterina Lezine


Predicting streamflow under varying climate conditions is critically important for managing water resources. This is especially true as climate change intensifies the hydrologic cycle in certain regions, such as the northern latitudes. In the past, physical models have been used to predict streamflow based on complex equations describing how water moves across a landscape. These models are difficult to implement and require large amounts of data for validation that is not always available, especially in remote, northern watersheds. Machine learning offers a new way to predict streamflow without significant expertise with physical modeling. In this study, we demonstrate the utility of three different machine learning methods - random forest, multilayer perceptron, and long short term memory – to predict streamflow for three different river gauges in northern Canada. We find that none of the models produce consistently accurate discharge predictions across all stations. We emphasize the need for larger training datasets, better training data, and more precise tuning methods to build more accurate and robust models. [report] [code]

Identifying Moulins on the Greenland Ice Sheet using Ice Sheet Surface and Bedrock Topology by Carolyn Lober & Emma Perkins


Vertical drainage columns in the ice sheet called moulins make up a significant portion of surface water drainage on the Greenland Ice Sheet. Here, digital elevation model (DEM) data were used to train a model to identify moulins based on topology of the ice sheet surface (Greenland Ice Mapping Project/GIMP) and bedrock (IceBridge BedMachine Greenland/BedMachine). Features used were elevation, aspect, slope, and planform and profile curvature for both GIMP and BedMachine; moulin data labels used were from a Landsat-based mapping study. All were sampled at a 30m pixel size. Undersampling of non-moulin pixels was used in combination with oversampling of moulins with SMOTE to attempt to account for an extremely unbalanced dataset (only 872 moulins in approximately 41 million pixels). All combinations of resampling and algorithms resulted in either an extremely high rate of false positives or no predicted moulins at all. Among all of these poor-performing models, the weighted logistic regression on data with random undersampling of the majority class performed the best on the validation data with an overall model accuracy of 0.0898, a true positive rate of 90.783%, and 9,755,154 false positives. Surprisingly, this model performed substantially better on the test set than the validation set with a total model accuracy of 0.1668, a true positive rate of 96.92%, and 4,964,118 false positives. [report]

Investigating cheatgrass extent in the western US by Tamara Rudic


I use a time series of Landsat data spanning 2011-2020 and a comprehensive set of ground-truthed cheatgrass percent cover training data to investigate the efficacy of using machine learning models to detect cheatgrass presence and expansion in the Western US region. I select six different feature sets including both annual and seasonal spectral-temporal metrics and three different pairs of percent cover thresholds for defining cheatgrass absence versus presence to test two different baseline classifier models, random forest models, and the impacts of synthetic data augmentation using the borderlines SMOTE algorithm. I find that the application of synthetically augmented data is crucial for increasing random forest model performance relative to the baseline and using a threshold of 0%/20% to define cheatgrass absence versus presence (respectively) is most effective for increasing model performance. I find inconclusive results concerning the comparison of annual and seasonal metrics which suggest that annual metrics are preferred if high cheatgrass presence recall is the optimal performance metric, but seasonal metrics are preferred in high cheatgrass presence precision is the optimal performance metric. Finally, I suggest important future steps and limitations that need to be addressed in order to increase the performance of machine learning models on cheatgrass detection, including gathering ground-truthed data sets that are more representative of the study region at large and correcting for class imbalances between cheatgrass presence and absence classes. [report]

Detecting Deforestation with a Convolutional Neural Network Ensemble by Henry Talbott


Detecting deforestation from satellite imagery is a crucial problem both in terms of monitoring ecosystem health and predicting global and regional climate change. I trained multiple convolutional neural networks (CNNs) on year-2000 and year-2019 satellite images and accompanying deforestation data, with separate models trained on western Canada and the Amazon. I then created an ensemble model for each region by training a random forest classifier on the CNN outputs for each region, with the classification problem of determining is a given 2.4km by 2.4km patch of land was over 20% deforested. The Amazon ensemble model performed moderately well, with a test accuracy of 85% and a false positive rate of 10%. Both models had false negative rates lower than 10%, and outperformed baseline random forest and multilayer perceptron models. However, both ensemble models performed very poorly when tested on opposite regions of the ones they had been trained on, suggesting features learned during training were highly ecoregion-specific. Further research should focus on increasing the models' ability to generalize across distinct ecoregions. [report] [dataset] [data viz tool]

Using machine learning to diagnose teleconnection patterns of different ENSO modes by Rosa Xu & Karen Wang


El Niño Southern Oscillation (ENSO) affects the lives of millions of people. However, its periodicity and mechanisms are still not fully understood. There are two modes of El Niño generated through different physical processes -- the eastern Pacific (EP) and the central Pacific (CP) El Niño. In this study, we analyzed the skin temperature, sea level pressure, geopotential height, and precipitation from ERA-Interim re-analysis monthly data during 1978-2020 to better understand the two modes of ENSO. We applied K-means clustering to the global dataset of typical El Niño years since 1978 to identify the differences in teleconnection patterns between EP and CP El Niño. We also applied dimensionality reduction analysis on the time series data of all 42 years since 1978 to characterize key changes through time. Our results suggest that the generation of EP El Niño is related to positive Indian Ocean Dipole while CP El Niño is not. Geopotential height shows zonal Rossby wave patterns and Antarctic amplification, which are responsible for the teleconnections of the ENSO cycle. [report]

Geology & Geophysics

New approaches to understanding the structure of Alaska using machine learning by Isabella Gama


Seismology is a data rich science that is growing very fast. This growth is due to increasing technology capabilities and computational power. In the past five years, hundreds of seismometers were employed in Alaska, and today we have unprecedented amount of information about Alaskan subsurface. As scientists we need to find ways to keep up with this rapid technological growth by finding robust techniques that allow us to more accurately interpret these new data. The goal of this project is to use unsupervised learning techniques from machine learning to understand patterns of Earth’s subsurface with large amounts of seismic data recorded in Alaska. The data for this project have been processed to facilitate interpretation of results calculated here. I standardized the data, tested K-means and DBSCAN clustering algorithms, and implemented PCA and NMF dimensionality reduction methods. However, given the difficulties in interpretation of results from these unsupervised learning methods in terms of Earth’s characteristics, the main takeaway from this project it that it served as an opportunity to use all these method tests as exploration data analysis. [report]

Utilizing Machine Learning to Predict Volcanic Eruption Times for Seismic Waves by Erin Lincoln


Even with modern detection systems, humans are still at risk to the dangers of unpredicted eruptions, especially in places where volcanoes are not routinely monitored. While approaching a potential eruption, the movement within a volcanic system can generate seismic waves, which can be recorded and analyzed. Utilizing a Kaggle dataset, this study aims to use seismic recordings from around 600 volcanoes to train machine learning algorithms to detect when there could be an eruption. The potential for eruption is measured in three tasks: predicting the time until eruption (regression), determining if the time until eruption is in the lower quartile of the data (classification), and determining if the time until eruption is in the lower 50% of the data (classification). To do this, tsfresh is utilized to both dimensionally reduce the data and create features for three different algorithms: Random Forest, MLP Neural Network, and KNN. Although this study is not currently practically applicable due to the Kaggle dataset lacking units, the potential for predicting eruptions with more data and more complex algorithms is a potential extension to this study. [report] [dataset]

Planetary Science

An Exploration of Unsupervised Machine Learning Algorithms for Planetary Remote Sensing Datasets by Carol Hundal & Cody Schultz


The image products of hyperspectral remote sensing missions often contain an immense amount of information which is difficult to display, analyze, and interpret. Unsupervised machine learning algorithms––including dimensionality reduction and clustering techniques––offer effective means by which these datasets can be more thoroughly understood. Here, we show the results of a variety of dimensionality and clustering techniques as applied to hyperspectral images from the Moon Mineralogy Mapper (M3) and Compact Reconnaissance Imaging Spectrometer for Mars (CRISM). We find that, contrary to our initial expectations, the simplest techniques (PCA and KMeans) offer the best results for broad use in spectroscopy in terms of computational speed, ease of use, and interpretability. [report]

Sustainability

Smart Models, Smart Meters by Matthew Park


Balancing the energy grid is crucial in preventing grid failure and blackouts. This balancing requires cooperation from consumers and suppliers of energy, demonstrating the importance of energy providers to understand consumer behavior. This project aims to predict the daily overall energy consumption of households based on a collection of weather metrics and a socio-demographic classification, referred to as the Acorn group. A collection of daily energy measurements for 5,567 London households from November 2011 to February 2014 were merged with daily numerical weather metrics and a series of engineered features. The data was used to train and evaluate a series of Regression models. The results demonstrate that it is possible to generate predictive models of daily energy consumption with accuracies of 0.8492 and 0.8812. The trained models generalized for the behavior of each Acorn classification in order to account for the variability in individual household behaviors. Similarly, the models revealed that the key feature in prediction was the Acorn classification. Thus, there is importance in understanding the relation between energy consumption and socio-demographic background, a feature itself which can be further broken down to distinguish and understand the behavior of an individual household. [report] [dataset]

GuildGuide: Collaborative Filtering Recommender for Permaculture Practices by Spencer Small


Permaculture is an approach to land management and agriculture based on leveraging the ecosystems found in the wild. Since its conception in the 1970s, the practice has gained considerable traction in light of recent attention to the harm of industrial agriculture. Already, over 75% of the planet’s land is degraded from climate change and monoculture farming, and this number is expected to reach 90% by 2050. Permaculture has been seen as a viable solution to this global desertification, as it enriches soil through its inherent biodiversity, without diminishing yield. However, switching over to a more sustainable operation is easier said than done for most farmers. The proper implementation of permaculture requires a deep knowledge of the site, as well as the symbiotic relationships within a given polyculture (guild). As exemplified by Masanobu Fukuoka’s The One Straw Revolution, this knowledge can take a lifetime to learn. Thus, the motivation for this project was to leverage machine learning in developing a tool that might help accelerate the adoption of permaculture worldwide. My tool generates novel species suggestions to a farmer based on his or her existing polyculture; in other words, a tool for intelligently increasing the biodiversity of a farming operation. Through testing a variety of approaches, I settled on a matrix factorization recommender system using collaborative filtering for this task. [report] [code]

Physical Sciences

Detecting Exoplanet Transits in Kepler Light Curves by Leah Zuckerman & Anna Zuckerman


Identifying exoplanets via transit photometry is increasingly important as new telescope observations become available at an increasing rate. However, visually identifying transits can be time-consuming and tedious. In this work, we develop and evaluate several methods for identifying stellar lightcurves which are likely to contain transit signals, which would then be flagged for visual inspection. We use lightcurves from Kepler Campaign 3. We evaluate two datasets obtained in different ways, as well as several methods of pre-processing and a variety of machine learning algorithms. We find that no algorithm achieves high accuracy using the first dataset, but using the second dataset we are able to achieve 83% accuracy using a Random Forest classifier. This model has the added benefit of favoring false positive results over false negative results, which is preferred for flagging exoplanet transits. [report]

Health and Biological Sciences

Classification of Fetal Health using Machine Learning Methods by Karen Robles [report] [dataset]

Music

Unsupervised Musical Genre Classification: Cluster Analysis of the Million Song Subset by Matt Jones [dataset]