The following are projects completed by students in EEPS-DATA 1340: Machine Learning for the Earth and Environment at Brown University in Spring 2025.
The projects below are shared with permission of the student authors and should not be redistributed.
Examining Bias Correction on Water Level Forecasts for the Providence River by Aalyaan Ali and Fiona Harrington
Water level forecasts are necessary to prevent damage from flooding and storm surges, chart shipping routes, and understand how a body of water is changing. This study aims to improve water level forecasts for the Providence River by correcting bias in the traditional physical model using machine learning with the addition of meteorological data. We trained and evaluated three models, Linear Regression, Random Forest, and Multi-Layer Perceptron, on a dataset of hourly water level observations from 2020-2024 and a dataset including only observations during storms. The Multi-Layer Perceptron achieved the best results for both the overall dataset and storm dataset, with a time series cross-validated mean absolute error of 7 cm and 21 cm respectively. However, all three models performed similarly, with error between 7-9 cm, compared to the physical model’s 17 cm mean absolute error. Overall, we achieved a significant improvement over the physical water level model.
Detecting and Forecasting Hydrogen Sulfide in the Salton Sea, CA using Satellite Imagery and Machine Learning by Alejandra Lopez
Detecting exceedance days of hydrogen sulfide events (H2S) is crucial for maintaining air and water quality near the Salton Sea, California. However, detection is difficult due to sparse monitoring networks, inconsistent measurements and noisy data. Remote sensing provides an alternative, as gypsum (a product of sulfur cycling) exhibits a distinct spectral signature that can be used to detect H2S. This study present a machine learning framework to predict H2S exceedance events using MODIS surface reflectance data, in-situ measurements and engineered band ratios. Five supervised learning models were tested, including KNN, Logistic Regression, Random Forest, XGBoost, and SVM. Random Forest model outperformed other models, achieving the highest F1 and PR AUC scores while maintaining good generalization. Results indicate that raw reflectance summaries alone are insufficient in detecting subtle water changes, and band ratios are needed to improve H2S detection. This work lays the foundation for using satellite imagery and machine learning to detect air quality anomalies.
Analyzing SWOT’s Potential for Ice Detection on Arctic Lakes by Thomas Howard and Andy Ye
Machine Learning Classification of Ocean Island Basalts using Radiogenic Isotope Ratios by Tianyi Li
ML methods are applied to Harðardottir & Jackson (2024) Ocean Island Basalts (OIB) database including ~1300 filtered data points on 5 features including 87Sr/86Sr, 143Nd/144Nd, 176Hf/177Hf, 206Pb/204Pb, and 208Pb/204Pb radiogenic isotope ratios to visualize geochemically close clusters across different hotspots and provide a ML-based model to classify OIBs into five endmembers (PREMA, DMM, EM1, EM2, HIMU) after labeling a portion of hotspots. The projection of clusters with t-SNE shows somewhat similar relationships with traditionally interpreted endmembers, while the decision boundary provides comparable first-order endmember predictions to unlabeled hotspots. KNN performs better that random forest, while the overlapping of sample datapoints in isotope space between different endmembers poses significant challenge to the selected methods. [report]
Identifying Lunar Craters with Machine Learning by Destiny Wilson, Tiffany Gao, and WaTae Mickey
Craters hold the key to a wide variety of important geologic information used to understand the age and impact history of planets. The importance of this information extends beyond the planets themselves and places constraints on their solar systems and the universe as a whole. Counting and classifying craters has traditionally been a manual task, but this is a tedious and error-prone process. In this study, we present a machine learning approach to automate crater classification to reduce human error and increase the accuracy of the crater size-frequency distribution analysis. Using the Lunar Reconnaissance Orbiter dataset, consisting of 502 fresh craters, 869 old craters, and 3,629 images containing neither, we performed feature extraction and attained 768-dimensional embeddings. Dimensionality reduction was applied to the features, and we reduced the dimensionality, keeping 95% variance of our data. Several models were tested on the dataset, with the two most notable being SVM and Random Forest. The accuracy of SVM was 80%, while Random Forest achieved 82% with higher overall precision. The machine learning model positively identified craters effectively, but struggled to detect all craters in the dataset, highlighting areas for future improvement. However, it was able to identify nearly all images without any craters in them at all. [report]
Evaluating Feature Engineering Techniques and Machine Learning Methods for the Classification of Anuran Vocalizations by Max Newman, Nuttanon Tangsunthornwiwat, and Yoshihiro Yajima
Amphibians are one of the most threatened taxa globally, which makes them the top priority in population monitoring research. However, it is challenging to track amphibians in their native habitats as they are often cryptic. Passive acoustic monitoring (PAM) allows large-scale, passive population monitoring at low cost and is especially effective for species that rely on acoustic communication like frogs. However, PAM tends to generate massive datasets, posing a challenge in data processing. Machine learning (ML) techniques are increasingly becoming popular among ecologists as ways to automate these computationally expensive classification tasks. In this study, we aimed to identify best feature sets and ML models for solving audio classification tasks using anuran vocalization data by Cañas et al. (2023). We developed two feature sets from spectrogram data and four ML models (KNN, RF, AdaBoost, and SVM), checked their performance using single-species vocalization data, and evaluated their generalizability using multispecies data. Overall, we found that AdaBoost performed robustly for both single- and multi-species data, which was expected given its ability to adjust class weights. Our study highlights the importance of proper feature and model selection for classification tasks especially on cluttered datasets like PAM datasets. [report]
Predicting the Impact of Wildfire Incidence on Insurance Premiums Using Machine Learning by Keya Kilachand, John Wilkinson, and Stella Ljung
In this investigation, we explore the relationship between wildfire risk and insurance premiums. With rising wildfire tragedies and subsequent increases in the premiums homeowners are charged, understanding how climate-related incidents translate into financial consequences could help the people most affected make more informed decisions. In order to do this, we used wildfire and insurance premium data from close to 2000 zip codes around California and investigated how this could be used to predict the insurance premium for each zip code. Our exploratory data analysis included clustering to explore possible geographic correlations, and plotting our data on a map to explore the geographic distribution of our data points. We trained and tuned three types of models for our predictive task - ElasticNet, Support Vector Machine with linear and non-linear kernels, and Random Forest. The idea was to try a range of models with increasing complexity to find which model could best capture the underlying structure of our data. The support vector regression with a polynomial kernel performed the best in terms of its RMSE and R2 metrics, but we ultimately chose the random forest model as most suited for our task - a close second in terms of the aforementioned metrics. This is because our exploratory analysis shows differing geographic concentrations of our data points, making the random forest model more likely to generalise to new data. We identify our nuanced test-train split and the differing performance of our models on different target variable values as areas of improvement for further exploration. [report]
Post Fire-Assessment for Forests Fires in Oregon by Helena Soares Barros, Marisol Jimenez, and anonymous student
The study of forest fires has become increasingly critical due to their unprecedented surge in frequency and intensity as a result of climate change. By analyzing data around forest fire events, emergency response teams can gain critical insights on how to better respond after forest fires occur. This response approach could be enhanced, specifically through improving the investigative process of identifying fire origins and assessing damage. This process depends on accurate predictions of a number of factors, which include forest fire cause and total estimated acres burned (Gorbett 2015). Machine learning has the potential to augment this process by making robust predictions based on a comprehensive analysis of relevant factors. By adjusting this process, fire fighting units could reduce the amount of resources traditionally invested in prediction making to other equally important aspects of post-fire assessment. These redistributed resources could have powerful impacts on communities, ensuring quicker accountability to liable parties and quicker rebuilding assistance to impacted areas.
ML-Based Prediction of Diabetes from Individual Health Metrics by Wonjin Ko, Ketan Pamurthy and anonymous student
Type 2 diabetes mellitus remains a major global health burden, with millions of cases remaining undiagnosed due to limited access to early screening. In this study, we explored the ability of machine learning models to distinguish between diabetic/pre-diabetic and healthy individuals based on health data collected in the 2015 CDC Behavioral Risk Factor Surveillance System (BRFSS) telephone survey. We trained five classifiers – Logistic Regression, K-Nearest Neighbors, Random Forest, XGBoost, and Multilayer Perceptrons – and evaluated their performance based on accuracy and recall in the diabetic/pre-diabetic class. Ensemble methods (voting and stacking) were also explored as a way to improve performance, and SHAP analysis was used to interpret feature importance and identify key predictive variables. Our findings highlight that class imbalance significantly hinders models’ ability to predict diabetes or pre-diabetes cases, but that oversampling can significantly enhance recall. We also found that the baseline model (logistic regression) trained on the oversampled data resulted in the best performance of any model. Finally, we identified BMI, General Health, and High Blood Pressure as the most important features across models. These results underscore the importance of balanced data for predictive models, as well as the potential of interpretable, data-driven tools for improving early diabetes detection in public health contexts.