Student Projects

Spring 2022

The following are projects completed by students in EEPS 1960D: Machine Learning for the Earth and Environment at Brown University in Spring 2022. 

The projects below are shared with permission of the student authors and should not be redistributed. 

Climate and Environment

Predicting Causes of Deforestation Using Machine Learning Classification Techniques  by Anvita Bhagavathula

Deforestation poses one of the largest threats to our climate today and one of the fundamental challenges we are facing today is accurately identifying drivers of deforestation. Addressing this challenge is crucial for policy-makers and governments who develop forest-conservation laws. There has been a growing use of machine learning and deep learning techniques on satellite imagery to predict the causes of deforestation. In this project, we evaluated the performance of three different machine learning models in predicting the drivers of deforestation for deforested regions in Indonesia. Specifically, we used the ForestNet dataset (curated by the Stanford ML Group), which contains LandSat 8 images of deforested regions with associated deforestation driver labels, to train our models. We implemented a Random Forest classifier, a Multilayer-Perceptron classifier run with features extracted from the VGG19 pre-trained neural network, as well as a Convolutional Neural Network. From our analysis, we observed that the neural-network models outperform the Random Forest model. We then discussed further steps, such as data augmentation, to improve the performance of our models. Ultimately, we emphasized the role of deep learning techniques for feature extraction from and classification of image data.  [report] [data source]

Biome Clustering and Classifying Land Ecoregions in the United States  by Dan Fiume and Poom Yoosiri

Climate classification is limited by data availability. As a result, historical climate classification models often considered only the most rudimentary climate data such as average temperature and precipitation. This project incorporates up-to-date and detailed climate data from locations across the United States to create a data-driven climate classification model. Supervised and unsupervised machine learning algorithms are performed on collected precipitation data and on remotely-sensed temperature and elevation data to build a predictive model which classifies the climate of any given location. Unsupervised learning techniques K-Means and DBSCAN are initially performed upon the climate data to see any underlying patterns within the data. Climate data is then trained on the EPA’s Land Ecoregions model, a robust classification model built upon vegetation data. Various supervised learning methods are then performed on the dataset and compared to the baseline K-Nearest Neighbors (KNN) method. Neural networks (NN) methods such as Multilayer Perceptron (MLP) generally perform the best in terms of accuracy, precision, and recall on the original imbalanced data. However, support vector machines (SVM) perform better than NN methods on algorithmically-balanced data. Further work is needed to improve the accuracy of the predictive model.  [report]

Time Series Analysis of Italian Aquifers  by Brendan Ho and Katie Orchard

As populations grow worldwide and climate change patterns continue to deplete freshwater resources, water demand is increasing while water supply is decreasing. It is therefore becoming increasingly important to predict the water level in bodies of water to balance daily consumption. Our project focuses on bodies of water managed by the Acea group, which supply water for 9 million people in Italy. In this project we use four models: a time series random forest with absolute depths, a time series random forest with difference in depth, long short term memory (LSTM), and an autoregressive integrated moving average (ARIMA) to predict the depth to groundwater levels for an aquifer managed by the Acea group. The time series random forest with difference in depth performed the best out of all our models. We attribute this to the model's ablilty to detect how different features affected depths in the past few days and using that context to predict the next day's depth. [report: Colab notebook] [data source]

Supervised Learning Techniques for Predicting Rogue Waves and Maximum Wave Height  by Isabel Horst

Rogue waves often come with little to no warning, and pose a serious threat to offshore structures and ocean vessels. The ability to predict these rare phenomena is both crucial to the safety of boat crews and offshore workers, and gives interesting insight into complex ocean dynamics. However, ocean waves have proven to be incredibly difficult to predict because they are influenced by physical factors ranging from wind speed to large-scale wave patterns. I use sea state data taken at Gladstone, Australia to train several supervised learning algorithms to predict both the maximum wave height and the occurrence of a rogue wave. Rogue waves only make up ~7% of the observations, so I use recall and f-scores alongside accuracy to gauge the performance of my models. Although no one model performs well across all metrics, I achieve a recall of 0.52 for predicting rogue waves using a Linear Support Vector Machine. [report] [data source]

Bedrocks / Chokepoints Detection on Rivers of Canadian Shield using  Very High-Resolution Imagery by Nimisha Wagle

The major critical control of the flow of water is the chokepoints which are the bedrocks in the river. These bedrocks play a great role in hydrological connectivity. When there is low flow the bedrocks in the river obstruct the flow of the river and during the high flow the rivers overtop them forming white water riffle. Identifying them manually is a very tedious process. So, in this project, I have various object detection models for detecting bedrocks in the rivers. I used Single Shot Detector (SSD), Faster Region-based Convolutional Neural Network (Faster RCNN) and You look only once (Yolo)s object detection algorithms to detect bedrocks in the rivers. I used panchromatic Worldview imagery of 0.5 m resolution for this purpose. None of the models produce high accuracy. However faster RCNN model was better than other models in terms of precision and recall. The reason for bad results may be the use of panchromatic images because the model got confused with objects and background. More training data, pansharpened images, and the use of precise parameter tuning might increase the result.

Forest Structure Variation in Amazon Rainforests  by Dafeng Zhang

Geophysics and Geochemistry

Using Machine Learning to Identify SS Precursors  by Yiran Huang

SS precursors are a useful phase to determine the depth of lithosphere-asthenosphere boundary (LAB). The arrival time and amplitudes of SS precursors can provide much information about LAB, but identifying this signal is subjective and time-consuming. Machine Learning can be used to pick this phase. In this project, seven datasets are generated including synthetics and individual or stack real seismograms with different features. The features contain the amplitude ratio between SS and potential points in acceleration, the amplitude ratio between SS and potential points in displacement, and the arrival time difference between SS and the potential points. The SS precursors of whole event stacks are used as a reference to label every seismogram or local stacks. Five models are tested to classify the potential points as precursors or not. Almost all models perform well in synthetic datasets, but logistic regression cannot work well in real data sets. The other models cannot work in messy individual seismograms. Decision tree and random forest methods can work in real datasets with more features better than KNN. This project also elucidates some tips for studying SS precursors, like the necessity of stacks, features selection, and the necessary data size for these studies. [report]

Machine Learning for Receiver Function Quality Control  by Hannah Krueger

Observing seismic phases from discontinuities in the upper mantle is challenging. The primary method for these observations are S-to-P receiver functions, which tend to have low signal-to-noise ratios and a relatively low quantity of data at a single station. Picking high quality traces can be tedious and traditional methods for automating data culling often throws out valuable data. In previous work, I utilized an unsupervised algorithm, Kmeans, to provide S-to-P receiver function data from global cratons with labels based on quality. In this project, I both (1) reevaluate the unsupervised algorithm used to initially label this data set and (2) use the previously labeled data as training data for a supervised learning algorithm in order to new seismic data based on its quality. These labels can be used to weight individual receiver functions to create higher quality receiver function stacks. (1) Reevaluation of clustering algorithms reveals that for future development of training datasets (for other tectonic settings) hierarchical clustering is beneficial for data culling. (2) A dense neural network using a mean-absolute error loss function adequately labels a withheld test data set, maintaining a large amount of data, while rating highly the best quality receiver functions. We also investigate using logistic regression for initial data culling but do not find satisfactory results. Future work will result in a complete, multi-step algorithm for culling and weighting new receiver function data. [report]

Geochemical Patterns in Subcatchments of the Wulik River, AK  by Sebastian Muñoz

The source of chemical variation in water samples was investigated in the Wulik river in Northwestern Alaska. Principal component analysis and factor analysis was performed iteratively on three datasets, one with data from multiple geographic locations, one from a single location with added discharge measurements to investigate within site variability, and the final with discharge and chemical concentrations from multiple locations. Data was compiled from measurements taken by the Teck Corporation as part of the Red Dog mine project as required by the EPA. Additional discharge data was taken from USGS streamgauge measurements, as well as from a series of instruments maintained by the mine. Results from the Principal Component analysis indicate that geographic variability was weighted most heavily, with variations in discharge explaining the within site variability in the data. Results also indicate the potential influence of sulfuric acid induced weathering or evaporite dissolution. Temperature explains a large amount of variability suggesting together with discharge that seasons influence the differences in water chemistry. Further work to characterize the factors that result in spatial and seasonal variability will help better understand the seasonal source of solutes in the Wulik River Catchment.

Physical Sciences

Determining Photometric Redshifts via Machine Learning Analysis of Photometric Measurements  by Rachel Hemmer

With the advent of large surveys and increased data collection techniques in astronomy and astrophysics, it becomes more important to have an accurate method of calculating photometric redshifts rather than relying on obtaining spectral data. This investigation aims to use AB magnitudes in the u, g, r, i, and z bands to predict photometric redshift of galaxies, and utilizes data from galaxy clusters A85, A3911, A3921, and A2029 processed by the Brown University Astrolensing group to do so. Several regression techniques are optimzed and evaluated to achieve this aim, and in the end a Gaussian Process Regressor with a mean average percent error from the true redshift of 35.6% is selected. This model provides a significantly more accurate method of photometric redshift evaluation and can give uncertainties in prediction. [report: Colab notebook]

Predicting Critical Temperatures of Superconductors  by Chenyu Zhang

Inspired by the work of Hamidieh et al., this project aims to explore machine learning methods for the prediction of critical temperatures of superconductors. XGBoost models that train on the information of both molecular structures and element components are used. The test performance is MSE = 74.6982 K^2 and R-squared = 0.9351, equal to or slightly better than the model performance in Hamidieh et al’s work. [report] [data source]

Health Sciences

Extend Tuberculosis Prediction Model in Africa with Machine Learning  by Tingyi Li

According to WHO, Tuberculosis (TB) is the ninth leading cause of death all over the world and more than 25% of those death cases are in Africa. Additionally, TB affects those with positive HIV status more seriously as 40% of the HIV deaths were related to TB. Baik et al. (2020) also suggested that it’s difficult for those under-resourced clinics in Africa to get the same day testing results for active TB. In this project, Baik and others’ (2020) study is extended by using machine learning algorithms to predict the TB result based on the self-reported and biological characteristics. The models could help the physicians and nurses to decide whether or not to treat the patient before getting the result in the under-resourced areas. This project also compared two different imputation methods: Mode imputation and Random Forest imputation, combining with seven different machine learning algorithms for prediction. Additionally, different model evaluation metrics are discussed, especially for the particular diagnostic test.  [report] [data source]

Developing a Classification Algorithm for Predicting West Nile Virus Occurrence in Chicago  by Hillel Rosenshine

West Nile Virus has posed a greater threat globally in recent years due to climate change. Better understanding the complex relationship between weather variables and West Nile Virus occurrence is critical for mitigating the spread of the virus. This study employs several machine learning algorithms to classify mosquito traps with and without West Nile Virus throughout the city of Chicago, given weather and human intervention data. Since baseline KNN and logistic regression models exhibit poor performance due to class imbalance, the majority class is undersampled and the minority class is oversampled, leading to better performance by Random Forest and KNN classifiers with this resampled dataset. An SVM classifier trained with balanced class weights also improves performance. KNN trained on resampled data demonstrates the strongest performance, indicated by high recall and ROC AUC scores as well as temporal and spatial distributions matching observations. This study provides a starting point for predicting West Nile Virus occurrence spatially and temporally, but more hyperparameter tuning, other methods of dealing with class imbalance, and feature engineering using human intervention data are suggested to improve predictions. [report] [data source]

Natural Hazards

Data-Driven Fire Behavior Modeling with Machine Learning  by Mason Lee and Will Kattrup

The emergence of catastrophic wildfires is considered among the most dangerous and devestating disasters, accumulating over $10.38 billion in damages in 2020 alone. Predicting wildfire behavior is a computationally complex task because a plethora of vegetative and meteorological factors influence spread rate and direction. Additionally, with a strained fire corps in California, it is imperative that initial suppression efforts are optimized in order to reduce fire size. Traditionally, fire spread is modeled by physic-based modeling. While such models are ubiquitous, spread prediction is improved by a comprehensive set of enviromnetal covariates. Encompassing eight years of historic California fire data, WildfireDB is a dataset based on remote sensing data related to vegetation, meteorological conditions, as well as a fire indicators. This study aims to predict the behavior of wildfires on a daily temporal resolution through the creation of a machine learning model. Our model seeks to predict next day fire intensity in an area of interest given an initial condition of environmental data. Two major approaches were used to solve this problem. The first being a combination of unsupervised clustering methods coupled with supervised traditional machine learning algorithms. The second approach is a time series analysis using an a deep recurrent neural network. Experimental results gave mixed prediction accuracy in fire type classification. Random forest classifier achieved a 68% accuracy in predicting fire and 65% accuracy in predicting non-fire observations. Random forest regression achieved relatively lose mean squared error and exhibited ability to detect large outlier wildfires. Difficulties in dataset structure require further processing and increased computation power; however, this study provides the framework for continued research into next day wildfire spread using WildfireDB(v1/v2). [data source]

Next-Day Wildfire Forecasting  by anonymous student

Accurate models of wildfire spread have the potential to save lives and property by informing decisions about where to build homes and when to evacuate residents. Over the last decade, the availability of worldwide remote sensing data from Google Earth Engine has enabled data-driven approaches to the task of predicting where and how quickly wildfires will spread. In this project, a variety of machine learning methods—logistic regression, k nearest neighbors, random forest, and adaptive boosting—are applied in an attempt to forecast where a wildfire will spread 24 hours after it is observed. Almost all models outperform the baseline precision and recall, but none performs well enough to be of use in practical applications. Suggestions for future work include consultation with experts about potentially relevant features to add to the dataset, acquisition of data at finer temporal resolution, and specialization of machine learning models to particular geographical contexts.

Other topics

Classifying News Headlines as Satirical or Real with Machine Learning  by Dan Wexler

Machine learning methods are used to classify news headlines as satirical or real. Two popular classification methods for natural language processing, Multinomial Naive Bayes (MNB) and Stochastic Gradient Descent (SGD), are used and their results are compared. Different representations for the headlines are also used, such as Bag of Words and Word2Vec, and the results between these representations are compared. The effects that different methods of data cleaning had on the models are also discussed. [report] [data source]

Environmental News Sentiment Analysis and Classification  by TzuHwan Seet

Bank Marketing Classification using Machine Learning  by Jiaxin Tang