Data Exploration

What Our Data Is, Where It's From, and What It Says

The Community Earth System Model (CESM)

The usage of climate modeling can provide information on potential future climate and weather conditions under greenhouse gas warming scenarios. While Earth system models are simplified representations of the Earth system and their future climate projections are likely imperfect, models provide valuable information to the public, government, and other researchers that can be used to guide adaptation and mitigation strategies under a changing climate (Mycoo et al. 2022). Models produce an evolution of future temperature anomalies in response to increased CO2 and other factors that have arisen since the rise of the Industrial Revolution, as well as natural forcing agents (Xie et al. 2010). Differences in the initial conditions in individual model ensemble members can produce a different evolution of climate variability and weather on decadal timescales. Projected climate states are highly sensitive to small differences in initial conditions, similar to the dependence in synoptic weather prediction (Lorenz 1963). Different initial ocean-atmosphere states can produce very different decadal SST patterns in the future, with implications for climate projections under global warming over the 21st Century (Abraham et al. 2022). The Community Earth System Model (CESM) is a numerical model that includes atmospheric, ocean, ice, land surface, and carbon cycle components that can be used to simulate past climate or can be used for climate projections (Hurrell et al. 2013). The CESM can be used to analyze projections on short-term (e.g. weeks, years) and longer (e.g. decadal to century) time scales to assess global trends, global warming hiatus periods, and conduct a statistical evaluation for specified meteorological, oceanic, and land surface components (Kay et al. 2015). The newest version, CESM2, has a grid resolution of 1.25 x 0.94 (Danabasoglu et al. 2020). CESM2 model improvements relative to the previous version involve a better historical simulation, and better representation of global oscillations, circulations, and teleconnection patterns (Rodgers et al. 2021). CESM2 models have also been shown to have better representation of the oscillations compared to the CESM1 (Danabasoglu et al. 2020). Shared socio-economic pathways (SSP) incorporate different assumptions about economic activity and mitigation and adaptation strategies to model Greenhouse gas (GHG) emissions (Riahi et al. 2017; Mycoo et al. 2022). Thus, climate model ensembles having different initial conditions allow each member to have a different future evolution of decadal climate variability. While future predictions of climate are more complicated with decadal variability in the climate system (Diffenbaugh and Barnes 2023), ensembles provide opportunities to characterize this uncertainty. Even though the CESM is useful, it still has limitations. For instance, modeling errors can cause potential biases in objective analysis of predictive climate patterns. One study shows that model mean state biases lead to errors of variability patterns that reduce the validity of future projections (Richter and Doi 2019). Models like the CESM2 have some biases with more warming toward the eastern Pacific than in the observed record. Most models haven’t been able to replicate the recent La Ni ̃na-like warming trend in observations (Yang et al. 2018). This warming preference can greatly impact the weather in those locations. Further, the relatively coarse resolution model gives average spatial conditions over the grid box, which may not be truly reflective of reality and lead to an inadequate representation of the impacts of future climate change on islands (Mycoo et al. 2022). Furthermore, this can cause inaccurate climate action strategies for vulnerable communities in this region. 

Data Selection And Chosen Variables

The CESM2 produces a vast quantity of high-quality, usable data that lacks NaN values by default, in addition to clear notation of each variable's function in the overall atmospheric simulation. However, the vastness of said data was also a major challenge when it came to choosing variables for evaluating the dataset concerning questions and hypotheses, in addition to the sheer volume of data required to simulate historical, current, and future predictions about something as complex as Earth's atmosphere. Per the CESM2's website, the model possesses a total of 3,208 variables that combine to create the full model, each with their function (Community Earth System Model, 2020). Our task was to select variables that would answer our questions while remaining manageable regarding the size. For our hypotheses to be answered, we had to choose variables that would produce a reasonable estimation of the climate for as small a range as possible, given the limitations of the average computer to process and chunk large-scale data. We also needed a date range that would aid us in examining our hypotheses about the climate of Polynesia, but that would not produce similar issues. So, our data covers the range of 1970 to 2065, and we selected a total of 3 variables for the area: Precipitation, also known as total (convective and large-scale) precipitation, Surface Temperature (radiative), and Surface Wind, or horizontal wind speed average at the surface. We chose the date range of 1970 to 2065 to allow us to get a reasonable estimation of those effects for a historical, current, and future range, while also limiting the data for processing reasons as stated earlier. The variables allow us to see and simulate changes in temperature, precipitation, and wind speed. Our questions are concerned primarily with the effects of El Niño Southern Oscillation (ENSO), as well as climate change, and how current efforts to halt climate change could affect the region in the future if they do so at all. Per our earlier introduction, the ENSO cycle primarily affects the trade winds and Sea Surface Temperatures (SSTs), which in turn alter the global climate, also primarily affecting precipitation and rainfall in the area. Therefore, our chosen variables will showcase the ENSO in action, as well as allow us to answer hypotheses regarding its effects on the climate, and how that climate has changed over time. 

Data Preparation (Overview of Code)

The CESM2 data for temperature, surface wind, and precipitation is downloaded from the Climate Data Gateway at NCAR under the Earth System Grid. First, the step to receive the data is to access through this free website. The only requirement is that you sign up and inform the purpose of the data usage. However, the CESM2 is made as an open source for everyone to help increase accessibility to weather data and predictions. The data is downloaded in a Python script from Group 1231 from Ensemble Members 1-10 for each of the three variables.

The next steps are to read the following packages in order to read the data. This means that running the python script on the jupyter notebook is the first step to open the data, which puts the data in one's workspace. A function is created to read each file in the pyscript and put that specific ensemble member into a saved folder with the variable name. This step is repeated on a for loop based on the length of the ensemble members that need to be analyzed. It is worth noting that reading the dataset under a py script rearranged the time components so the usage of decode_cf and cf_units.num2date is used in order to change the time to a data frame for each ensemble member before being exported as a NetCDF. Additionally, the data is cropped from 30 deg South to 30 deg North, because the tropics is the region of interest for the study, which also contains oceanic and their associated Pacific islands.

Then, the data is read in again under a function, whose purpose is to concatenate all the ensemble member files and apply chunking methods for the dataset size is very big. Chunking the data helps reading and applying methods faster, however, one needs to load the data in order to show figures. While the CESM2 with post-processing from NCAR has little to no missing, duplicated, or incorrect values. Thus in order to make sure NaN are not being accounted then the usage of skipna in xarray is used, which skips missing values (as marked by NaN).


Acquiring and Processing The Data

The data was taken from the CESM2 model for each variable individually. The data, in its raw format, was a set of 10 files for each variable in a NetCDF format, otherwise known as a .nc file. While this raw data was extremely clean, lacking NaN variables, and consistent, given that the only change from file to file was the variable in question, this format made the data unworkable in a raw format. Our main challenge in exploring this data was to change this data to a readable format that could have pandas functions and seaborn plots used on it to produce meaningful visualizations. The first step was to initialize all of the given libraries, and then create a series of files in the colab environment that the raw data could be loaded into. Once that was done, and the data was uploaded, each file contained 10 files of raw data that had to be concatenated and transformed into data frames. The following images
are printouts of the first .nc file for each variable, printed using the netCDF library in Python, that showcases the given variables, along with the 3-dimensional data points for every entry ('Before' Snippet):

The first step was to concatenate all 10 files for each variable together into a single NetCDF file. Each of the initial 10 files was for a given date range, such as 1970-1979, and they have to be put together to be of any use in later visualizations. The first function after the creation of the folders that store the raw data serves this purpose by opening each file to obtain the Python data, normalizing decoding the time data, and converting the 'time', or date, to a datetime format that can be utilized later to concatenate the variables together. It also served to crop the data to our given latitude and longitude ranges for the region of Polynesia, as all other ranges are superfluous to our investigation.

Next, the data had to be fully concatenated into usable datasets for each individual variable. However, there was a major challenge in producing code that could fully process the data in the constraints of a single colab file, which has limits on power and RAM capabilities. Potential data processing issues were avoided using a 'chunking' method. By 'chunking' the data into smaller portions, this enabled the data to be read in a size that colab could handle, while also enabling some parallel processing that would ensure the data would also process faster, enabling better computation. A visualization of the dictionary format that was used to hold and concatenate the data is shown below, along with a visualization of the chunking as read through xr.concat:

Next, the spatial data had to be transformed into a 2-dimensional format that was workable with our systems. This was accomplished by taking the mean of all 3 variables for the latitude, longitude, and ensemble variables. The 'ensemble' was created by the above functions as a way to average out the spatial data to get rid of the 3rd dimension without compromising the data's informational value. Finally, each of the chunked and processed datasets was transformed into DataFrame structures, allowing the use of libraries such as pandas, matplotlib, and seaborn to visualize and analyze the data tendencies and behaviors:

This produced a set of three datasets for each of the variables. These were combined with each other as needed in order to produce datasets that could be used to create graphs, while still keeping the original data whole. Here's the final snippet of what the processed data looks like, as shown in this 'Weather' dataset of all the variables ('After' Snippet):

Data Shape and Tendency

Now that the data is processed, some initial exploration of the shape of the data, as well as its central tendencies, are in order. First, a check of the NaN value count. This should produce a value of 0, as there were no NaN values in the original data, but such values can occur during transformation if not carefully processed:

We can see that all 3 base datasets have no NaN values, meaning that all the data is processed and usable.


Next, we check the shape of the datasets to see the differing dimensions of each set, if they differ at all:

We see that each dataset is the exact same size. This was expected not only due to the fact that the date ranges pulled for each variable were of equal length, and would thus produce equal amounts of data, but also because the concatenating function would refuse to work with uneven or unequal datasets. Each variable has 1020 entries, with 2 columns each. If combined into a single file there would still be 1020 rows, but there would be 4 columns in total, for time, temperature, precipitation, and wind speed. Additionally, there would be a column for months. This showcases that the data is evenly processed, but the behavior requires further insight.

The next step is to use .dtypes to showcase the different types of data, and identify numerical vs. categorical data:

Each of the three main variables is a float32, or float, as expected given the long decimal entries for each row. This indicates numerical data, so all of our significant data, and areas of interest,  are numeric values. The time column is always a datetime64 value, as expected thanks to our earlier transformation during the initial stages of loading and creating the datasets. The final area of interest is the dtype: object entry which refers to the month column, which is a string entry for each month in a 3-letter format with a period at the end. This is technically categorical data, as we utilize them in several of our later exploratory graphs to look at each variable when grouped by month, to examine the overall differences in the behavior of each variable for those given time periods. So, we have a total of one datetime variable, three numeric variables, and one categorical variable.


Next, an examination of the central tendencies, which will help to highlight the behaviors of each variable, as well as showcase the spread of the datasets:

The .describe outputs for each dataset are promising. The minimal spread indicates a lack of superfluous outliers, as shown by the relatively small gap between the minimum and maximum values for each set, which will be further explored below by calculating the range and IQR. The even spread of each quartile also indicates a lack of outlier data. The mean for temperature is the largest, which is to be expected given its output in Celsius, compared to the outputs for precipitation and wind speed, which are output in millimeters per day and meters per second, or mm/d and m/s. The low values for standard deviation, especially that of precipitation, indicate that the datasets are largely centered around the mean, and are consistent and precise. This is likely due to the extremely precise computations involved in this level of atmospheric simulation, which leaves little room for superfluous errors.

Next, the Range and IQR:

The highest range still goes to temperature, at 3.96, and the lowest is Precipitation at 0.4. This is also true for the IQR. These values serve as another benchmark on the lack of outliers, which is crucial to our valuation of the predictive data, as the general 'trend' of the data is more important when considering long-term evaluations of current and future efforts in climate change reduction, or potential changes in the ENSO cycle. 

Finally, we want to consider the measure of similarity, which is how correlated, or 'similar' one variable or object is to another. By doing pearson coefficient calculations between each of our three chosen variables, we can evaluate potential relationships to examine later on in our visualizations:

Based on the p-values of each of the tests, none of the three variables have statistically significant correlations with one another. They can only be taken as significant if the p-values were less than 0.05, for a standard significance level of 0.95. Additionally, only the correlation between wind and precipitation could be taken as significant if it were true, because a Pearson value goes between -1 and 1 based on the positive and negative strength of the relationship between the given values. The result of 0.7 would be the only significant relationship. These results could indicate a lack of relationship, or lack of a 'linear' relationship, indicating that modeling the data should likely take other formats beyond a standard linear regression model. And given the complexity of the model from which these values were taken, it is entirely possible that a single Pearson test does not accurately capture the variable's relationships between one another.

In terms of other complex data values, such as integration, reduction, or transformation, these values were handled during the initial loading and formatting of our data. The data was already clean and well-defined, it simply required some work. The data was reduced and transformed to a set of dataframes that were limited to the data for the Polynesian region, and normalized to one another through a concatenation function.

Now, to visualize our data.

Visualizations

Throughout creating our visualizations, we created some basic hypotheses or guessed what relationships and values we would uncover as we went. They are by no means the formal modeling of our 10 questions, but rather some inferred estimates of how the data would behave.

First, we examined the values for each variable when grouped and indexed by month, using histogram cat plots, to showcase the changes month-by-month over the whole period from 1970 to 2065:

Hypothesis #1 states there is higher rainfall in the winter months, but warmer temperatures in the summer months. This hypothesis was correct because there is a reduced amount of precipitation from June to August, and colder temperatures from December to February. The histograms also help color code the visualization. In the precipitation histogram, the blue represents months with more rainfall amount, but the brown represents more of the dry months. Additionally, the temperature histogram for maximum and minimum shows the colder months in blue and the warmer months in red. One interesting feature is the temperature maximum and minimum tend to follow each other. However, the coldest months tend to have the most precipitation, which tends to match up with potential Atmospheric River Activities. Steps to go beyond the dataset is to find the year that El Nino, which is an inter-annual climate variability, is the strongest because these seasons tend to amplify Atmospheric Rivers. Another interesting feature is that July tends to have the warmest temperature max, but August tends to have the warmest temperature minimum. We can also infer that temperature gradients during July are the largest out of all the months. Furthermore, the highest amount of precipitation tends to be in November in the dataset. 

Next, we created box plots of each variable to examine the data's spread by month:

Hypothesis #2 states there is a higher spread of distribution in the winter months rather than the summer months due to colder temperatures and more rainfall. This hypothesis is mostly correct because the spread for precipitation and temperature min is greater in the winter months. However, the temperature maximum tends to have the same spread and almost look uniform with each other. The usage of box plots was important to show the spread of the distribution in each month to see if different months are more skewed with each other. It is worth noting that there are more outliers in summer months for precipitation and temperature min, but more outliers in winter months for temperature max. This spread can be highly influenced by each month's respective temperature gradients. Moreover, the precipitation was evaluated by a logarithmic scale to normalize the data since rain amounts can vary greatly for each day, especially by month. One interesting feature is that July and August don't have an apparent box pot because it is rare for rain to occur during these summer months. Thus, all points plotted are automatically considered an outliers since the box plot is constrained to be 0. Evaluating the density of the data with each month is the next step to be analyzed.

For more analysis of Hypothesis #2, which states there is a higher spread of distribution in the winter months rather than the summer months due to colder temperatures and more rainfall. This is very reflective of the box plot. One assumption was that the temperature maximum tends to have the same spread and almost looks uniform with each other in the box plot. By using the violin plot, this seems to be accurate the density spread for each month is the same. Moreover, the precipitation and temperature minimums are far more spread and skewed with each other. While the logarithmic analysis is applied to the precipitation, the data still seems to be skewed compared to each month. This helps to highlight there is more of a density distribution in March and November, even though there was a greater spread in the box plot. One interesting feature is that the temperature min has a wider spread density in the colder and Winter months. This could be due to rain usually associated with more cloud cover and the cloud cover can block more of the sunlight, which is needed for warming the atmosphere and the Earth. The usage of evaluating the relationship between variables can help infer how they interact with each other, especially based on the different months. 

Next, we created a graph to evaluate the relationship between temperature and precipitation by month, using a scatterplot:

Hypothesis #3 states there is a positive relationship between temperature maximums and minimums with winter months having lower temperatures for both, but summer months having higher temperatures overall. This hypothesis was correct because warmer temperature maximums tend to have warmer temperature minimums as well. Furthermore, the scatter plot shows that the winter months tend to have lower temperature max and min than the summer months. This overall infers that the temperature gradient in the day in each respective month is not usually the same. There would be more of an uncorrelated line in tropical locations because they tend to be warm all year long. Also, tropical locations tend to be more humid, which results in maintaining a warm temperature during the night, which greatly influences the temperature minimum. Temperature minimums tend to be measured more at night because it is usually cooler at night than during the daytime. Overall, a positive relationship can be identified from the scatter plot. Another way we can potentially examine part of this relationship is to visualize the percentages of precipitation by month:

The hypothesis for this was that the winter months, which would be June - August because of Polynesia's location in the southern hemisphere's tropics region, would have a higher percentage of rainfall due to the cooler temperatures causing increased condensation, which leads to higher rainfall rates. This isn't shown to be true, as nearly all of the months, when turned into sums and then percentages, are almost exactly the same. This could be due to a programming error, but it could also be due to the area's location geographically. The fact that all of the months are similar, with February and April having the highest percentage, and September having the lowest, could be due to the fact that tropical regions tend to receive large amounts of rainfall and precipitation year-round, so differences could be negligible due to that fact. But now, we want to go back to the relationship between temperature and precipitation. The goal is to apply linear regression techniques as well as a statistical significance test of the correlation through a Pearson correlation analysis:

Based on our earlier analysis of the Pearson correlation values, the hypothesis for the regplot of the relationship between temperature and precipitation was that if the two variables did in fact have a noticeable relationship, the observed behavior would be non-linear. This hypothesis is proven correct by the above graph, which displays a parabola, indicative of a quadratic type of relationship. We can see from the regplot that in spite of some outliers at the edges, the general relationship between the two variables is higher at the extreme values, either minimum or maximum, and lowest near the median values. This could be due to colder weather producing more precipitation, and higher weather often being associated with higher humidity levels and summer rainstorms and monsoon seasons. To further illustrate the relationship, the next set of graphs will show a jointplot for both temperature and precipitation, as well as a graph for wind and precipitation, to see if there is any similarities between the two:

The hypothesis here was that there would be a relationship similar to what was shown by the regplot for temperature and precipitation, and a positive linear relationship between precipitation and wind speed. Possibly due to the quadratic nature of the temperature/precipitation correlation, there is no visible pattern in the graph for those two variables, although there is a noted concentration around 3.5 mm/day and just under 3.4 mm/day for precipitation, both for temperatures around 26 degrees Celsius. This concentration could potentially be due to a frequency of certain weather patterns around that temperature, or perhaps they are simply the most common temperature for the area. For wind speed and precipitation, there is in fact a somewhat 'linear' relationship that is positive, with the data trending upwards as wind speed increases. Interestingly, the data for wind/precipitation is more spread out, based on the distribution at the top and sides, when compared to the graph for temperature and precipitation. Next, we compare the correlations for all three variables by using a correlation matrix plot:

The hypothesis for this graph was that there would be no visible correlation between temperature and precipitation due to the quadratic nature, which is shown to be true by the zero values shown. The mild negative correlation between wind and temperature is interesting and does make sense, as high wind speeds often produce wind chill, which is shown to negatively affect the temperature of an area. High wind-speed days in any place are often colder than those with minimal wind speed. The high correlation between precipitation and wind speed is intriguing and does make sense, as high wind often accompanies storms and monsoons, which produce high amounts of precipitation. Finally, the next 3 sets of graphs will evaluate the data change over time, especially in regards to geography. The following line graph set showcases the variable changes over time:

The hypothesis here is that due to climate change and human activity, temperature and potentially precipitation would show a positive linear increase over time. This is correct, as both variables show an increase over time. This is potentially caused by a simulated increase in dangerous weather patterns, such as more frequent storms, hurricanes, and monsoons. It should be noted that the unusual line in the middle is due to the gap in historical vs. current vs. future data. Temperature does showcase the highest increase overall, going up by 4 degrees Celsius over the course of the graph. Wind showcases little increase over time, which makes sense because the base wind speed isn't affected as much by climate change. It could increase due to more frequent storms, which frequently have higher wind speeds, but that would require further insight. Regardless, this is a good sign for our future work in modeling the effects of the ENSO cycle over time. Finally, a series of choropleth maps made using the raw data worldwide in cartopy showcases the variables over time:

For our final visualizations, we hypothesized that both temperature and precipitation would increase from the past, and the magnitude would show an increase as well. These were proven correct, with the first part of our hypothesis being especially dramatic in its increase. Temperature increased all over the globe, and precipitation showed an increase in the region of Polynesia, which is shown in the middle of the graph near the area of Papua New Guinea. The 4-degree Celsius increase also matches our earlier line graph. Based on the future graph, we could expect a 5-6 degree Celsius increase in the worst-case scenarios, with a less dramatic but noticeable increase in precipitation in some areas, and some areas getting no increase, likely due to an increase in droughts due to the effects of climate change. The increase in surface wind change could be due to the cumulative effects of the ENSO cycle, which has been hypothesized to be more erratic and extreme in future years due to climate change. 

Model Implementation

For the Linear Regression and Logistic Regression Models, the data had to be normalized using scaler.fit_transform from sklearn.preprocessing's StandardScaler package, similar to our past assignments and classwork. Below is a snapshot of the non-normalized data:

And below is the snapshot of the data once it's been normalized:

Finally, we have a printed snippet of the data after division into a training set and a testing set, as well as indexing each of the predictors: weather features, precipitation, and outcome, as the outcome was not scaled (since it was the outcome feature being trained for):

Logistic Regression Model

For the logistic regression model, the base dataset was split into a new training and testing pair, and indexed based on weather features and the outcome. Precipitation in particular was binarized based on the mean of the training set, and a new column was created for the data based on the classification outcome of the model. The following is the output of the training data, and the x and y sets for the training data:

Neural Networks

First, a training set, testing set, and validation set were created using the collected data we had, which were then split into features and outcomes. Then, the data was normalized. Below is a view of the Neural Network data:

The FNN is created using stacks of 'dense' and 'dropout' layers, which are fed data images that are flattened into vectors. Below is a snippet of the image 'vector', though it isn't very exciting:

And below is a printout of the one model's outputs from the function that created it, listing its layers and parameters. There are several models, this is just one to give an illustration of the output:

Convolutional Neural Network

The CNN is trained very similarly to an FNN, but with more emphasis on robust dropout layers. Additionally, the data isn't flattened, but is instead reshaped to approximate the original raw 3D form that we imported it as. A snippet of the resulting vectors, interpreted by colab, is below:

Below there's also an output for the CNN, to demonstrate it in contrast to the FNN:

The GitHub for this project can be found through this link, or by clicking the icon below: https://github.com/LeoNgamkam/Data-Mining-Weather-Patterns

GitHub