Student Projects

Fall 2023

The following are projects completed by students in EEPS-DATA 1340: Machine Learning for the Earth and Environment at Brown University in Fall 2023. 

The projects below are shared with permission of the student authors and should not be redistributed. 

Climate and Environment

Forecasting Corn Yield in the United States Using NDVI, Cropland and Climate Data by Annie Herring, Serena Vu, and Gabriel Traietti

In the past decade, corn production in the United States has averaged 14.35 billion bushels per year, surpassing all other countries. Understanding the behavior of crop yield in relation to climatological variation is essential in protecting American and worldwide interests in food and economic stability. With increasing variation year to year caused by climate change, the ability to predict the crop yield for a given year is becoming both more important and more difficult. Current yield forecasting, such as the United States Department of Agriculture World Agricultural Supply and Demand Estimates (WASDE) report, are based on subjective expert analysis. In the past 22 years, these midyear reports were only within 5% accuracy for six of those 22 years. This project investigates the possibility of building a predictive machine learning model which could more accurately and efficiently predict yields relying more on objective climate variables than subjective analysis. Using a random forest and boosted tree model structures, we used data compiled between 1984-2021, which includes state level crop yield and weather features. Although the analysis does not produce highly accurate models, this project explores arising issues including the effect of agricultural technological improvements over this time period.  [report]

The Effects of Arctic Temperature and Atmospheric CO2 Change on Arctic Sea Ice Concentration  by Celia Kong-Johnson and Lauren Renna

The Arctic is home to many species that are especially susceptible to change. This means things such as temperature changes, which can in turn affect sea ice concentrations in these regions, can cause detrimental impacts to Arctic ecosystems and their inhabitants. To further investigate this issue, we implemented linear (Ridge and linear regression) and nonlinear (decision tree) machine learning techniques to determine what type of correlation, if any, temperature and atmospheric CO2 has on sea ice concentration in the Arctic and how distance from temperature source affected these correlations. Overall, all of our models performed relatively poor, resulting in much lower test data performance than training/validation data, suggesting high overfitting in our models. This is not surprising considering the small number of observations and features that our dataset had. We did find comprehensive results that our nonlinear model performed better than our linear models. Also we found potential evidence for a decrease in correlation with an increase in distance between sea ice concentrations and temperature data source, resulting in a more similar model performance across training, validation, and test data for the furthest sea. An emphasis on additional features for future machine learning research on sea ice concentrations as well as more observations is essential to higher model performance and accuracy. Improved sea ice prediction models will improve our understanding of climate change in sea ice melting as well as inform policymakers on the actions towards mitigating and protecting these Arctic ecosystems.  [report]

Reconstruction of Antarctic Sea Ice  by Yusen Liu

Sea ice is an important component of the climate system, and is a critical factor influencing future changes in the energy balance, sea level and polar ecosystem. Due to a limited observation of sea ice (satellite observation starts from the 1980s) , the temporal evolutions of sea ice in both the Arctic and Antarctica are not fully understood. In Antarctica, the sea ice variability is more complex, since it showed continuous increase before 2014 and drastic decline in the recent decades, rather than a constant decreasing trend that is seen over the Arctic. The low-frequency variability of Antarctic sea ice is not clear due to the short period of data. Here, I try to reconstruct Antarctic sea ice variability dating back to 1948, using machine learning methods. The results would shed light on the sea ice variability in the early periods, which is absent from satellite observation. Based on the relationship of sea ice and environmental factors, both linear models and non-linear models are constructed. Results show that the artificial neural network (ANN) is most efficient to capture the sea ice variability, which exhibits significant interannual fluctuations superimposed on decadal scale variations.  [report]

Predicting Canopy Disturbances on Barro Colorado Island Using Random Forests  by Ruth Ukubay and Mia Mitchell

Assessing tree mortality is crucial for understanding carbon storage limitations and their relationship with global climate change. Tropical forests have a pivotal role in sequestering carbon. Data from long-term observational studies regarding tree mortality are necessary to improve modeling carbon gains and losses in Earth-System models under future climate scenarios. Despite the importance of tree mortality, causes and drivers in tropical forest ecosystems, particularly compounding mortality effects, remain poorly understood. This study focuses on leveraging advancements in drone photogrammetry and lidar data from large-scale data collection of canopy disturbance rates on Barro Colorado Island (BCI). Preliminary analyses reveal patterns related to the likelihood of tree mortality events, specifically that disturbances are more likely to occur near existing disturbances and on particular soil forms and parent types. Here, we use data extracted from photogrammetric point clouds to predict the location of disturbances in 2023 with a Random Forest Classifier. After exploring multiple algorithms, we achieved a 79% accuracy rate after equally, but randomly sampling from the target class (1 = disturbance; 0 no disturbance).  [report]

Testing the Capability of Various Machine Learning Algorithms to Increase the Resolution of Satellite Precipitation Data over a Deforested Region of the Amazon Rainforest  by Caleb Ukaonu

With the continued expansion of Brazil’s agriculture industry, deforestation continues to alter the Amazonian climate. In order to better understand how local deforestation is affecting local precipitation over the deforested region and over that of its surroundings, the need for higher and higher resolution precipitation observation is always a need of those studying the region. In this study, we looked at the capability of various machine learning methods to artificially enhance the resolution of satellite precipitation data over a deforested region and its surroundings. We found that, out of the methods used, the Random Forest algorithm yielded the best results and would be the best model to use for this purpose. However, none of the algorithms produced results accurate enough to use for deeper analysis of how deforestation influences convection.

Predicting Active Layer Thickness in the CESM 2 LSM  by Bradley Lockhart

Ecology

Enhancing Zebrafish Migration Pattern Prediction Through Machine Learning  by Alana Cho and Kevin Hsu

Understanding zebrafish migration patterns and behaviors is crucial for medical research and clinical studies. However, manual analysis of zebrafish movement is labor-intensive and subjective. In this paper, we propose a novel approach to automate the identification of zebrafish locations in video frames within a noisy environment, addressing challenges such as a tilted fish tank, additional elements in the frame, and varying fish density. We employ a Convolutional Neural Network (CNN) Binary Classifier to predict the presence of zebrafish in tiles of varying sizes within each frame. This automated approach offers efficiency compared to manual methods, potentially enabling large-scale studies. As zebrafish are vital model organisms, our work contributes to advancing biomedical research, providing a foundation for predicting migration patterns and behaviors. Future work may extend this model to estimate 3D trajectories, further addressing challenges posed by erratic movements and shoaling behaviors of zebrafish. The proposed methodology showcases promise in advancing machine learning analysis of zebrafish behavior within complex environments.  [report]

Using Data from New England Aquarium Whale Watches to Predict Humpback Whale Hotspot Probabilities in Stellwagen Bank  by Max Ferguson and Anjali Shah

This project utilized unsupervised and supervised learning techniques to predict the location of humpback whale ‘hotspots’ in the Stellwagen Bank National Marine Sanctuary. Understanding where and when ‘hotspots’ occur can have strong conservation and economic benefits, such as assisting in fishery management and improving the success of whale watching trips. Utilizing a data set from Boston Harbor City Cruises, we clustered data from 2013-2020 using the DBSCAN algorithm based on latitude and longitude as well as month of the year. In order to predict hotspot presence, we performed binary classification via a baseline KNN model, followed by logistic regression and random forest. We were able to compare and contrast the success of the models’ predictions based on cross-validation scores and precision and recall values. After tuning DBSCAN and random forest parameters, we tested our models on 2021 data to see how well our previous model was able to predict new hotspot locations. We found that our random forest model performed very well on the training data, but poorly on the test data. We also found that logistic regression performed consistently across both models, with extremely poor recall for non-hotspot locations.  [report]

A Machine Learning Species Distribution Model to Guide Restoration Efforts in the Face of Climate Change  by Aaron Freeman, Luke Randall, and Shaw Miller

Species distribution models are an established tool used to predict species ranges from environmental, climatic, and human impact variables. Classical implementation of these models involves correlative work between these variables and species range maps produced by networks of professional biologists. Machine learning (ML) offers an improvement over these classical methods by enabling identification of geographically small “niches” in which a species is most likely to occur, allowing finer-scale prediction of species occurrence within its overall range. These sorts of predictive models may be valuable for local conservation efforts that can protect only a small geographic area. This project uses a database of species occurrences and environmental variables from mainland Europe to build a predictive model of habitat suitability for one plant species with a widespread range across France and northern Italy. We show that ML offers strong potential for identifying small-scale variability in species occurrences, but that the small number of observations of any individual species is a barrier to accurate modeling. We attempt to overcome this barrier by augmenting the rarer “present” observations with an additional random sample of “present” observations from the presence-only data to the presence- absence data set, resulting in better model performance for “present” observations. While limited in scope, this model demonstrates the effectiveness of this strategy and data preparation pipeline that can be applied to other plant species in mainland Europe. [report]

Binary Prediction of Bat Sonification with Ensemble Method  by Finnegan Keller, Kieren Leif Dykstra, and Andrew Kim

Cute, crafty, and under increasingly risk of extinction, bats are very ecologically important in plant and insect life histories (U.S Fish & Wildlife Service 2021) while also being a significant disease vector for humans and livestock (Letko et al. 2020). The development of a machine learning model capable of predicting specific bat calls could be used to determine bat diversity in a given area and thus aid in predicting risk of disease spread or other ecological damage. In this report, we use crowdsourced bat sonification audio recordings from the Xeno-Canto online database to build a model capable of performing binary classification on bat call audio recordings belonging to families Phyllostomidae or Vespertilionidae with a test accuracy of over 91% despite substantial class imbalance. This model is limited to binary classification due to a lack of accessible data representing other bat families. Additional data would allow for the creation of a model capable of predicting the sonifications of multiple bat families, greatly expanding the potential ecological and public health applications of the model.  [report]

The Butterfly Effect: Predicting Strontium Isotope Ratios Across North America  by Sydney Roberts and Calvin Kirk

As monarch butterflies develop, their wings adopt a unique strontium isotope ratio that is specific to their region of birth and temporally stable. To understand their migration, scientists predict their birthplaces by matching their strontium isotope ratio to an isoscape, or map of isotope ratios. With the data and scientific context from Reich et al. (2021), we use 400 plant samples from eastern North America and geolocated physical features to predict an isoscape that can be used as a reference for butterfly tracking. Our best model, a Random Forest with standardized data, has an R2 score of 0.5438 and RMSE value of 0.0018 on unseen test data, which is 0.8 standard deviations of the strontium isotope ratios in our sample. This is in line with Reich et al. (R2 = 0.44 and RMSE = 0.0017).  [report]

Geophysics and Geochemistry

Systematic Classification of Calcic Amphiboles using Machine Learning Algorithms  by César A. Bucheli-Olaya

Amphibole systematic classification requires knowledge about the minerals’ composition and the specific distribution of elements within its crystalline structure. High variability in amphibole compositions, the numerous categories within its systematic classification schemes, and the intricate details needed to be known before assigning a name to a specific crystal make classification a tedious but necessary task in a petrologist’s job. However, this task can be taken down from human shoulders by implementing machine learning classification methods. This contribution explores this possibility by training classifier models using three different algorithms: K-nearest neighbors (KNNC), random forests (RFC), and support vector machines with radial basis function kernel (SVM+RBF). In terms of model performance, SVM+RBF classifiers showed the highest training (0.89), testing (0.77) and validation (0.72) accuracies of all trained models, followed by KNNC (0.86, 0.75, and 0.68 respectively) and RFC based on raw compositions (0.80, 0.73, 0.68). Label representation imbalance within the dataset is attributed with the observed inaccuracy of models, and the effects over accuracy and long-term ability for generalization of models after (i) obtaining an increased number of experimental data for misrepresented categories, or (ii) reducing the number of highly represented labels within the dataset are discussed.

Is it an Earthquake? Earthquake detection for Earthquake Early Warning  by Abigail Case

On average, 39,500 people die annually as a result of infrastructure collapse caused by earthquake shaking and the resulting loss of basic resources. There is currently no way to predict exactly where and when a large earthquake will occur, but the growing area of earthquake early warning seeks to give people who live in areas of high earthquake hazard valuable seconds to seek safety before shaking begins. These systems rely on seismometers placed near the faults to record the initial P wave arrival and relay a warning ahead of shaking. Here, we train several machine learning models on ~40,000 examples of short earthquake and noise waveforms from the STEAD dataset to detect earthquake shaking. The traditional earthquake detection method, STA/LTA, which serves as our baseline model, is able to detect an impressive 78% of waveforms correctly. We find that logistic regression cannot improve on this result, but a random forest and a gradient boosted random forest model is able to slightly improve upon the baseline, detecting about 80% of waveforms correctly. While the machine learning methods improve on the traditional method, they are not sufficiently reliable as they stand here to be put to use for earthquake early warning, and future efforts will focus on improving reliability through feature engineering and anomaly detection methods.  [STEAD data set]

Physical Sciences

Predicting Satellite Position and Velocities using Deep Learning  by Christopher Pellinger

Tracking the orbits of satellites and other orbital objects is becoming an increasingly important problem, as more satellites are launched into orbit each year increasing the chance of collision between objects resulting in increased orbital debris. Long-term predictions would allow for the detection and avoidance of collisions far in advance. Currently future orbits are predicted using physics driven models known as simplified perturbations models (SGP), which fail when attempting long term predictions. In this paper a recurrent neural network is proposed as an alternative method for predicting future orbits.  [report]