Here I will be adding a new data set to the Climate Change data environment. This data set contains temperature, precipitation and evaporation data mainly for the USA.
The data set is available from Kaggle and described here.
As a first step, I have loaded the data set into a staging table in my Postgres database. Below is a screenshot of this table. It has almost 156 million rows (one row for each day for each weather station for a period of 30 years.
This table is too big for my system, especially if multiple users would access it at once.
Therefore, I created a Materialized View called mv_USAdata_US_2020_2021, which only has the rows for the USA for the years 2020 and 2021. Below you can see the definition of this materialzed view in Postgres. The materialized view has about 16 million rows for two years of USA data.
The workflow below will take the yearly maximum for each weather station and write it to the USAdata_yearly_max table. (I have modified the definition of the materialized view mentioned above to incorporate data from 2018, 2019, 2020 and 2021).
I have also created a database table with the USA weather station information from Kaggle. However, I would like to perform analysis by State, but the state information is missing from the stations input file.
I developed the workflow below to solve this issue. The GeoFile reader takes the cb_2018_us_state_5m.shp shapefile as input. I downloaded this file from https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html (under the 'States') category. This gives me all the USA states as you can see in the output of the Geospatial View node. Now I can read my original stations database table and using the Spatial Join node, obtain the State each weather station lies within. I write this infomation to a new database table, called usa_stations_with_state.
Now I have everything in place to perform analysis by State.
Below you can see an extract from the usa_stations_with_state table.
Note I have been a bit sloppy here, I should have used the Column Renamer node to rename the NAME column to state_name. Mixing uppercase and lowercase column names is not a good practice.
The workflow below calculates, for each year, the maximum temperature recorded grouped by State and writes this to a database table:
Below you can see the resulting table, ordered by maximum temperature in descending order.
The temperature values are in tenths of degrees Centigrade. The top value for the state TN (Tennessee ) and the year 2019 is 1117, which would be 111,7 degrees Centigrade. This value is not correct (I hope). Let's use SQL to investigate a bit further.
I will run the following query against our base table:
select * from "USAdata" where "TMAX" = 1117;
Below is the output. We are interested in the first row, which is for 2019 (as we haven't loaded the data for 2007 which is the second row)
Let's now look up the station USR0000TCOK:
SELECT * FROM usa_stations_with_state WHERE station_id = 'USR0000TCOK'
This tells us the name of the weather station is COKER CREEK TENNESSEE. I asked ChatGPT and it tells me the maximum temperature on that day was around 11 degrees Centigrade. So this looks like a data entry error in our source date (one '1' too many ?)
La temperatura máxima en Coker Creek, Tennessee, el 1 de febrero de 2019 fue de 52 °F (aproximadamente 11 °C).
So, for what follows we have to keep in mind there is very likely some erroneous data in our data source. Unfortunately, this is not unusual. A possible solution would be to determine some 'reasonable real' maximum temperature and use this as a threshold to filter out values that lie beyond it.
I will use 48 degrees as the threshold value for now and adapt the previous workflows accordingly by adding Row Filter nodes. Note that I chose this value for no specific reason. The best practice would be to check with the 'owner' of the original data set to determine the reason for the 'extreme' values and deal with them at the source.
Below you can see the new usa_max_per_state table, 'capped' at 48 degrees Centigrade. California seems to be the hottest state...
The data loaded from the Kaggle data set stops at 2021. On the next page I will try to add more recent data to our system.