Preview of clean dataset used for clustering methods
Link to code: https://github.com/Rokkaan5/5622-PublishedCode/blob/main/data/API-NOAA/NOAA-clust-data-cleaning.py
CDSD: Cooling Degree Days (season-to-date). Running total of monthly cooling degree days through the end of the season
CLDD: Cooling degree days - computed when daily average temperature is more than 65 degrees Fahrenheit/18.3 degrees Celsius
DP01: Number of days with >= 0.01 inch/0.254 mm in the year
DP10: Number of days with >= 0.1 inch/2.54 mm in the year
DP1X: Number of days with >= 1.0 inch/25.4 mm in the year
DT00: Number of days with minimum temperature <= 0 degrees Fahrenheit/-17.8 degrees Celsius
DT32: Number of days with minimum temperature <= 32 degrees Fahrenheit/0 degrees Celsius
DX32: Number of days with maximum temperature <= 32 degrees Fahrenheit/0 degrees Celsius
DX70: Number of days with maximum temperature <= 70 degrees Fahrenheit/21.1 degrees Celsius
DX90: Number of days with maximum temperature >= 90 degrees Fahrenheit/32.2 degrees Celsius
EMNT: Extreme minimum temperature for year. Lowest daily minimum temperature for the year. (in Fahrenheit)
EMXP: Highest daily total of precipitation in the year. (in inches)
EMXT: Extreme maximum temperature for year. Highest daily maximum temperature for the year. (in Fahrenheit)
FZF0: Temperature value of first freeze (<= 32 degrees Fahrenheit/0 degrees Celsius) during August - December. (in inches)
FZF1: Temperature value of first freeze (<= 28 degrees Fahrenheit/-2.2 degrees Celsius) during August - December. (in inches)
FZF2: Temperature value of first freeze (<= 24 degrees Fahrenheit/-4.4 degrees Celsius) during August - December. (in inches)
FZF3: Temperature value of first freeze (<= 20 degrees Fahrenheit/-6.7 degrees Celsius) during August - December. (in inches)
FZF4: Temperature value of first freeze (<= 16 degrees Fahrenheit/-8.9 degrees Celsius) during August - December. (in inches)
FZF5: Temperature value of last freeze (<= 32 degrees Fahrenheit/0 degrees Celsius) during January - July. (in inches)
FZF6: Temperature value of last freeze (<= 28 degrees Fahrenheit/-2.2 degrees Celsius) during January - July. (in inches)
FZF7: Temperature value of last freeze (<= 24 degrees Fahrenheit/-4.4 degrees Celsius) during January - July. (in inches)
FZF8: Temperature value of last freeze (<= 20 degrees Fahrenheit/-6.7 degrees Celsius) during January - July. (in inches)
FZF9: Temperature value of last freeze (<= 16 degrees Fahrenheit/-8.9 degrees Celsius) during January - July. (in inches)
HDSD: Heating Degree Days (season-to-date). Running total of monthly heating degree days through the end of the season.
HTDD: Heating Degree Days. Computed when daily average temperature is less than 65 degrees Fahrenheit/18.3 degrees Celsius.
PRCP: Total Annual Precipitation. (in inches)
TAVG: Average Annual Temperature. - computed by adding unrounded TMAX and minimum TMIN and dividing by 2. (Fahrenheit)
TMAX: Average Annual Maximum Temperature - average of mean monthly maximum temperatures (Fahrenheit)
TMIN: Average Annual Minimum Temperature - average of mean monthly minimum temperatures (Fahrenheit)
Documentation to explain the attribute labels in the GSOY data: GSOY attribute documentation, GSOY README link
Requirements for clustering: numeric and unlabeled data ONLY
Starting dataframe of data from initial API request.
You can see that each row in the dataframe has a unique combination of date, "datatype" (which is explained in the documentation), station, value, and attribute. Where value is associated with the data type.
To get a better understanding of exactly what the data tells us, we'll first break down the information of an example row.
I would like to use the 4th row (index = 3) as the example:
The 4th row has a data type of "DP10", which according to the documentation, indicates that the row gives information on the "number of days with >=0.1 inch/2.54 mm [of precipitation] in the year", so then the quantity in the column "value" gives the number of days of that data.
In other words, the 4th row in the data tells us that in the year 1990, there were 45 days that had >=0.1 inch/2.54 mm of precipitation at station "GHCND: AG000060390."
There are still some NaN's in the data, but this is a situation where NaN's could be replaced with zeros (because if the API didn't give values associated with the data type that means there were no days in the year associated with that data).
No missing values, all numeric, etc.