The effects of climate change are far-reaching, and the oceans are not immune to the changes that human activities have on the natural patterns of the Earth. Approximately 25% of annual CO2 emissions are absorbed by the world's oceans[1], and current predictions estimate that by 2100, ocean acidity levels may be 150% higher than at the beginning of the Industrial Revolution[2].
The cause of this increased acidity is the carbon dioxide in the atmosphere dissolving in ocean water, combining with carbonate ions to form carbonic acid (Equation 1).
CO2 + H2O + CO3-2 -> 2 HCO3- (Equation 1)
Decreasing the pH of ocean water inhibits calcifying species of their ability to form calcium carbonate exoskeletons[2]. This damage to coral life and species such as oysters, clams, and other food sources could cause ocean food chain collapse and limit human access to seafood. A two minute video explanation of ocean acidification from the North Carolina Aquarium at Fort Fisher is shown to the right[3].
The Global Ocean Data Analysis Project (GLODAP) is a database of ocean water samples with datetime stamps, locations, and complete water chemistry analysis. This database is publicly available and regularly updated with new data from scientific cruises. The goal of this project was to fill the gaps in the total carbon dioxide measurements (TCO2) in the v2.2020 version of the dataset. The GLODAP v2.2020 dataset has approximately 65% of entries missing TCO2 values. This project used a regression ML model to fill the missing entries in the dataset, using measurements of pH, alkalinity, geolocation, temperature, date, etc. as model inputs.
Increasing the robustness of this data set could help predict future net carbon dioxide absorption areas which may be at higher risk for earlier ocean acidification than predicted. It may be possible to protect these areas in ways that slow or mitigate the changes in TCO2 that can damage marine life and a valuable food source for mankind.
The overall project process map is shown below. Raw CSVs split into the ocean regions are downloaded from the GLODAP v2.2020 repository, and imported into Jupyter Notebooks. The CSVs are combined and filtered and the data quality columns and others are removed from the dataset. The input variables of station, potential temperature (θ), longitude, latitude, depth, date, salinity, temperature, pressure, potential temperature at a given depth (σ), bottom depth, and max sample depth are used to test and train machine learning (ML) regression random forest (RF), decision tree (DT), and linear regression (LR) models. The models are tested and trained for each input variable (oxygen, nitrate, nitrite, silicate, phosphate, actual oxygen utilization (AOU), pH at standard temperature and pressure (STP), in situ pH, neutral density (γ), and alkalinity) missing values (>5% of rows) for carbon dioxide modeling.
The best performing model for each column variable is saved (Python Pickling) and applied to the variable column to fill the input variables for carbon dioxide modeling in the dataset. The filled dataset is then trained and tested on DT, RF, LR ML regression models with carbon dioxide concentration as the target value. The GitHub project repository is connected to Heroku platform as a service (PaaS) to create a user interactive dashboard of the dataset with Jupyter Voila notebooks.
Complete notebooks, data files, and interactive dashboard can be found using the buttons to the right.
Complete notebooks, data files, and interactive dashboard can be found using the buttons to the right.
Phase I notebooks contain exploratory analysis and cleaning of the dataset to remove quality control columns (from the dataset owners') and mostly empty columns. The dataset is split into smaller CSVs and a notebook is provided for combining the files to a single dataset.
Phase II notebooks contain ML regression model testing of random forest, decision tree, and linear regression as well as the building of user interactive dashboards. The models are used to fill gaps in the data set and for the project goal of filling total carbon dioxide concentrations.
Phase III notebooks split the dataset into ocean regions in order to make final conclusions about the dataset. Attempts were made for future CO2 concentration projections using Facebook Prophet, but data granularity was too limited for the API. The final dashboard was formatted and published via Heroku.
Carbon dioxide in the atmosphere does not evenly absorb to all ocean regions and ocean depths, proven in work by Becker et al. and Broullón et al. Overall findings of Broullón et al. included that that surface total carbon dioxide (TCO2) decreases from high to low latitudes, the Indian and Pacific oceans have lower concentrations of TCO2 at higher latitudes than Atlantic , and that TCO2 increased with depth and plateaus at certain ocean depths[4]. In addition, large concentrations of surface TCO2 were identified in the North Atlantic, Nordic Sea, and Mediterranean Sea. These findings align with those of Becker et al., who found that most (northern sea) regions were net sinks for CO2 [5]. However, the research of Becker et al. also found that the Southern North Sea and Baltic Sea emit CO2 to atmosphere[5].
This project data source is the Global Ocean Data Analysis Project (GLODAPv2.2020); a collection of bottle data from scientific cruises (Figure 1) spanning 1972-2019[6]. The data consists of several CSV files totaling one million rows by 104 columns (740 MB). Complete documentation can be found using the link below.
Columns with mostly null (>90% rows) are removed from the dataset. The definitions of the cruise IDs were not clear in the documentation of the dataset. It was determined that the cruise IDs and locations do not repeat the same areas from year to year. Therefore, the cruise ID should not be used in the ML model as a categorical input variable, as the locations are not repeated over the lifetime of the dataset.
Figure 2. Cruise locations by year. Each color represents a scientific cruise ID. (Bottom) Cruises video link; each color represents a year.
Phase II notebooks contain ML regression model testing of random forest (RF), decision tree (DT), and linear regression (LR) as well as the building of user interactive dashboards. The models are used to fill gaps in the data set (Figure 3) and for the project goal of filling total carbon dioxide concentrations. Missing input variables are filled using [O2, AOU, nitrate, pH in situ and at STP, silicate, neutral density (γ), phosphate, alkalinity] using individual ML random forest regression models (Figure 4). The models are saved using Python pickling and applied to those rows missing data.
Figure 3. Null or unavailable values are represented as white for each column variable of the GLODAP dataset. Null values for rows with variable columns with <5% missing data are removed from the dataset.
Figure 4. Performance of LR, RF, and DT models on each variable column.
After these input variables were calculated, the dataset was tested and trained for modeling total carbon dioxide concentration (tco2). A random forest regression model was the best performing with a coefficient of determination and explained variance of approximately 0.99 for all sets. The mean square error of the train, test, validation sets were 9.84, 77.0, 68.7, respectively. No model tuning was performed after fitting. The model performance on the shuffled test, training, and validation [0.6, 0.2, 0.2 split] sets is shown in Figure 5.
Figure 5. Machine learning random forest regression model performance on Train, Test, and Validation sets (R2~0.99 and explained variance ~0.99 for all sets).
Figure 6. Views of feature importance of ML random forest regression model for total carbon dioxide concentration.
Feature Importance
The most important features of the RF regression ML model for carbon dioxide concentration (Figure 6) were phosphate and total alkalinity, actual oxygen utilization, pH at STP, and silicate.
It was expected that time of year and geolocation would be the most important features prior to model training. However, γ is a common oceanic measurement calculated value from salinity, temperature, pressure, and geolocation[7] .
The model was retrained without the filled input variables [O2, AOU, nitrate, pH in situ and at STP, silicate, neutral density (γ), phosphate, alkalinity] to insure no bias was introduced to the data; the feature importance results were the same.
So why is phosphate such an important feature in the model?
The relationship between ocean nutrient composition and carbon dioxide in the atmosphere has several theories in the scientific community[8]. The Redfield Ratio represents the nearly constant ratio of C/N/P [106:16:1] in plankton organisms that match ocean water composition[8]. This relationship and the possible global warming feedback loops that impact the phosphorus cycle could explain the feature importance.
Phosphorus is deposited into the ocean by surface runoff, which is increased with the strength and frequency of storms[9] from the 'greenhouse' effect. In addition, changes in upwelling and currents from increased ocean temperatures may change the patterns of absorption of phosphorus in the water column[10].
After the ML regression model was tested and trained, the tco2 rows in the GLODAPv2.2020 dataset were filled. As the complete dataset is too large for efficient user interaction, this dataset was trimmed to every 25th row for use with Heroku platform as a service[11]. The data and notebooks were committed to the project GitHub repository and the dashboard notebook was deployed to the Heroku project, with automatic deploys from GitHub. The first visual allow users to interact with the shortened data by selecting year and variable. There are 33 cruises with <100 data points out of 936 cruises, this data loss is acceptable to make the dashboard respond to users in real time; there were 1.2E5 rows before the trim and 4.9E3 after.
The second visual shows the changes from a start and end year per a variable in the dataset, first by grouping the complete dataset by latitude and longitude and averaged per year. The latitude and longitude of the ending year in the query is used with sklearn BallTree to find (haversine distance) the closest cruise geolocation in the starting year. The difference between the starting year and end year variable of choice is displayed (end year- start year value). One shortcoming of this method is that depth of the averaged samples is not taken into account.
Ocean Region Definition
In order to evaluate the filled dataset and make final conclusions, the data needed to be divided to representative groups. The work of Beaulieu et al. was used to approximate rectangular polygons for ocean regions[12]. These polygons were used with geopandas sjoin function to merge the cruise coordinate locations to a ocean region. These grouped ocean regions were used for later statistical analysis and visualization.
Figure 7. Ocean biome regions[12].
Figure 8. Rectangular polygon ocean biome polygon approximations.
Facebook Prophet
After the ocean regions were apply to the dataset, Facebook Prophet was tested for predicting future carbon dioxide concentration values for one year into the future. The FB API was unable to project with low uncertainty because the data is expected as repeated, near daily measurement[13], and the dataset is cruise measurements concentrated to certain times of the year.
Visualizations and Dashboard Update
A 3D plot with latitude, longitude, and depth was added to the dashboard. Users can select a variable for the color hue of the points. Box plots were added with the ability to group data by different region types and the change over 1 or 4 years.
Figure 9. Visualization examples from Heroku Dashboard; (1) 3D depth plot of CO2 concentration, (2) Percent change of CO2 concentration from 1993-2001, (3) Box plot of the percent change in CO2 concentration every four years grouped by latitude range and (4) major ocean region.
Statistical Analysis Results
Visual conclusions of the CO2 concentration changes over time, region, and depth could not be made. A statistical analysis was performed using independent, 2 sample t-tests. The data was grouped by region and season (winter, fall, etc) and iterated through the available years for each group.
The later year in each iteration was compared to the earlier year using two-sided and one-sided t-test. A Levene test for equal variance was first conducted in order to use the correct type of t-test.
The normalized counts are presented as final results as each ocean region is not sampled equally. The null hypothesis for the two-tailed and one-tailed t-tests were as follows, respectively.
H0 : Mean of Start Year = Mean of End Year | Two-tailed
H0 : Mean of Start Year >= Mean of End Year | One-tailed
For the variable t-tests, rejection of the null for the two-tailed test would mean there was a difference between the average variable values for the years compared. Rejection of the null hypothesis for the one-tailed t-test would mean that the average value of the earlier year cannot be confirmed to be greater than or equal to the average variable value of the later year.
Table 1. Normalized count of the (yearly) results to reject or fail to reject the null hypothesis (H0 ) at α=.05 for CO2 concentration and pH.
The use of additional datasets to increase granularity and frequency of measurements would improve modeling capabilities and analysis. Without repeated measurements on a regular schedule, the ability to project carbon dioxide concentrations is greatly diminished. Ocean region grouping proved challenging as limited scientific resources exist established definitive borders between ocean biomes[12]. While making large-scale climate change conclusions is beyond the scope of this project, several points were identified during the statistical analysis:
•For more than ½ of observations in all ocean regions, the average TCO2 and the average pH from year-to-year are likely not equal.
•More than ½ of observations in the Arabian Sea, Oligotrophic North Atlantic, Bay of Bengal, and the Southern Ocean Pacific rejected the null hypothesis that the average TCO2 concentration of earlier years are greater than or equal to the average concentration of later years.
•More than ½ of observations for all ocean regions cannot reject the null the that average pH for earlier years is higher than later years (except for the Arabian Sea)
[1] H. C. Bittig et al., “An Alternative to Static Climatologies: Robust Estimation of Open Ocean CO2 Variables and Nutrient Concentrations From T, S, and O2 Data Using Bayesian Neural Networks,” Front. Mar. Sci., vol. 5, 2018, doi: 10.3389/fmars.2018.00328.
[2] PMEL Carbon Program, “What is Ocean Acidification?,” NOAA. https://www.pmel.noaa.gov/co2/story/What+is+Ocean+Acidification%3F (accessed Sep. 03, 2020).
[3] “Fort Fisher: NC Aquariums | Kure Beach, NC 28449.” https://www.ncaquariums.com/fort-fisher (accessed Sep. 22, 2020).
[4] D. Broullón et al., “A global monthly climatology of oceanic total dissolved inorganic carbon: a neural network approach,” Earth System Science Data, vol. 12, no. 3, pp. 1725–1743, Aug. 2020.
[5] M. Becker et al., “The northern European shelf as increasing net sink for CO2,” Biogeosciences Discussions, pp. 1–28, Jan. 2020.
[6] A. Olsen et al., “GLODAPv2.2020: A data product of internally consistent ocean biogeochemical observations,” p. 1, Jun. 2020.
[7] L. Talley, “SIO 210 Talley Topic 2: Properties of seawater,” Univeristy of California San Diego, 2000. http://sam.ucsd.edu/sio210/lect_2/lecture_2.html (accessed Oct. 13, 2020).
[8] A. Paytan and K. McLaughlin, “The Oceanic Phosphorus Cycle,” Chem. Rev., vol. 107, no. 2, pp. 563–576, Feb. 2007, doi: 10.1021/cr0503613.
[9] J. Watson, T. M. Lenton, and B. J. W. Mills, “Ocean deoxygenation, the global phosphorus cycle and the possibility of human-caused large-scale ocean anoxia,” Philos Trans A Math Phys Eng Sci, vol. 375, no. 2102, Sep. 2017, doi: 10.1098/rsta.2016.0318.
[10] D. M. Sigman and E. A. Boyle, “Glacial/interglacial variations in atmospheric carbon dioxide,” Nature, vol. 407, no. 6806, p. 859, Oct. 2000, doi: 10.1038/35038000.
[11] P. D. Kazarinoff, “Deploy a Jupyter Notebook Online with Voila and Heroku,” Python for Undergraduate Engineers, Apr. 08, 2020. ./deploy-jupyter-notebook-voila-heroku.html (accessed Oct. 13, 2020).
[12] C. Beaulieu et al., “Factors challenging our ability to detect long-term trends in ocean chlorophyll,” BIOGEOSCIENCES, vol. 10, no. 4, pp. 2711–2724, 2013, doi: 10.5194/bg-10-2711-2013.
[13] Facebook Open Source, “Quick Start,” Prophet Quick Start. http://facebook.github.io/prophet/docs/quick_start.html (accessed Nov. 08, 2020).