Our dataset, pulled from Kaggle, was originally obtained from the NASA Langley Research Center (LaRC) Power Project, funded through the NASA Earth Science/Applied Science Program, as well as the U.S. Drought Monitor (USDM). The dataset includes daily measurements, by U.S. FIPS code, of certain weather variables along with their corresponding weekly USDM drought score between 1997-2020. Weather variables include temperature, humidity, windspeed, precipitation, and pressure. A full list of variables can be found in the appendix. Latitude and longitude by county was included as well.
From the original dataset, only counties located within California, or FIPS codes in the 6000-6999 range, were used. Daily weather variables were then averaged for each week, resulting in a full dataset of 63,568 rows of weekly drought scores and corresponding weekly weather variables averages for 58 CA counties or 1096 weeks of data for each county.
The dataset was split into training, validation, and test using a 70%, 10% and 20% split, respectively. Training data included data from 2000-2014, validation 2015-2016, and test 2017-2020. This ensured that the model would encounter the full range of possible drought scores in training, though there were mostly severe drought scores (greater than 2.5) in the validation set, and no exceptional drought scores (greater than 4.5) in the test set.
Figure 2: Variables from our dataset
(NASA Langley Research Center)
Figure 3: California USDM drought scores over time, split into our training (2000-2014), validation (2015-2016), and test (2017-2020) datasets