Analyzing Ann Arbor's Weather Using Data Mining Techniques -- --- Humidity, Spring Length and Precipitation Prediction
Mengying Zhang, Victor Yang, Hyojin Kim
Challenges: Big Data, Machine Learning, Time Series, Data Manipulation
Tools: R
Mengying Zhang, Victor Yang, Hyojin Kim
Challenges: Big Data, Machine Learning, Time Series, Data Manipulation
Tools: R
According to a climate study, global warming has on average increased the global temperature by 1.4-Fahrenheit degrees since 1880 and the last two decades of the 20th century have been hottest in the last 400 years (“35 Facts About Global Warming”). But how does this affect local weather? In this project, we utilize the data from a website called Weather Underground to explore how the weather has changed in Ann Arbor over the past 50 years.
Our project consists of three parts. The first and second parts focus primarily on exploratory data analysis of the past 50 years’ weather change in Ann Arbor. Aside from studying how the temperature has changed, we also look at how the humidity level changed as well as how the time of the season has been shifted. In the third part of the report, we also developed three precipitation status models that use previous day’s weather information to predict next day’s precipitation using data mining techniques. We use variables that are selected by Gini Index through random forest as predictors and we apply logistic regression, linear discriminant analysis and random forest, respectively, to construct three parallel precipitation prediction models. We then use cross validation to compare the prediction performances of the three models.
Our analysis finds that the driest time in Ann Arbor is around the beginning of the May, which has not changed over the past years. However, comparing to 30 years ago, the humidity level has become more unstable within a year and the weather has become drier. Our analysis also finds that the length of spring varies significantly from year to year. For our three precipitation prediction models, they all perform impressively well and have an average testing error around 17%. We hope that our project can contribute to the meteorology study of Ann Arbor’s area in general and shed light on how data mining techniques can be applied to weather forecasting.
The data comes from Weather Underground and we use an R package called “weatherData” to extract the summarized weather. The project was intended to study Ann Arbor’s weather over the past 50 years, but we found that the nearest weather station KARB (the weather station code) did not record the weather until 1999. Therefore, we resort to a bigger, much older weather station in Detroit(airport code: DTW) to approximate Ann Arbor’s weather. The difference between the temperature and humidity of the two places is very small, since they are only 57km (about an hour’s drive) away from each other. The original data contains weather information from very long ago, but in this project we only use the weather data for the recent 50 years (1967-2016). For each day, the data contains temperature, humidity, dew point, sea level pressure, visibility and wind speed, where each variable has corresponding maximum, mean, and minimum value recorded. The data also records precipitation inches, cloud cover, wind direction degrees and also the weather of the day. There was some missing data in year 2000; however, data from other years contained no missing data.
a. What's the driest month in Ann Arbor?
We compare recent years’ humidity level with 30 years ago’s humidity level to study how the driest month has changed in Ann Arbor. Because the definition of “driest time” differ under different contexts and the data only contains maximum humidity, mean humidity and minimum humidity percentage for each single day, here we define the driest day of the year as the day that has the lowest maximum humidity percentage and the driest time of the year as the time period that is around that driest day. We did not compare the humidity level in a time interval to determine the driest time but only in a day by day basis for simplicity because we assume the humidity level would not vary a lot for consecutive days. So in a sense, we assume the humidity level of the day represent the humidity level around that day. Notice that we didn't choose mean humidity as indicator because mean value is subject to extreme values, for example, even during a driest season, it could rain and when it rains, the maximum humidity level would reach to 100% and therefore raise the mean humidity level of the day regardless if it’s very dry most of the time.
We first compare the humidity data from 2012-2016 with the data from 1980-1984 to analyze how the driest time of the year has changed (Figure 1.1).
The plots validate our assumptions that the humidity level wouldn't vary a lot for consecutive days. For example, in early May, the maximum humidity levels are relatively lower than the rest of the months in a year. So for simplicity, it’s sufficient to look at the single day that has the minimum Max_Humidity level. We find the minimum of the maximum humidity level in each of these years and find that the driest month did not vary a lot for the 1980s and 2010s.
The driest day happens from in around late April to mid May, except for 2015, the driest day was October 20th. Therefore, we draw a rough conclusion that the driest season in Ann Arbor is spring and specifically in May.
b. How has the humidity level changed in Ann Arbor?
From the Figure1-1, we find that the humidity was on average higher in the 1980s than recent years. The maximum humidity level stays quite high most of the time in the year in the 1980s but in recent years, the maximum humidity level varies more than in the 1980s. For most of the days, whether it’s 1980s or 2010s, the maximum humidity level is around 90%, although the variability of maximum humidity level has increased a lot these years.
a. Has the length of spring become shorter?
These days, many people think the length of summer and winter is getting longer because of global warming. As a result, many media outlets broadcast that spring is missing. Over the past few years in Ann Arbor, we feel that the length of Spring has become shorter. Winter suddenly transitions into Summer and we are left questioning where Spring went.
To see whether if this is true, we sought to find the length of spring over the past 50 years from 1967 to 2016. We define the start of Spring to be when the maximum temperature is greater than 46 Fahrenheit degrees for at least 7 days. We got this number through averaging the temperature on the “traditional” start of Spring – March 20th. Similarly, we signify the end of Spring through the start of Summer (traditionally around June 20th), when the maximum temperature is greater than 76 Fahrenheit degrees for at least for a week.
Our analysis shows that there is no obvious trend in the length of spring throughout the years, as shown in the Figure 2.1. The longest length is 115 days in 1979 and the shortest is 29 days in 1982. However, the average length is 71.86 days, or about 2.5 months. We usually assume the length of each season is about 3 months. The result is close to 3 months. Therefore, we cannot conclude that the length of Spring is decreasing.
b. How has the starting time of spring changed over the years?
If the length of Spring has not changed for the past 50 years, then what about the starting date of spring? Is Spring starting earlier or later? Typically, Spring usually starts around in late March. Figure 2.2 shows the starting week of Spring by year. The starting week of spring varies between week 9 to 16, or mid-March to late April. In 1983, spring “started” on February 27th which is the earliest starting date. The latest “starting” date is April 18, 1981. On average, Spring has started between week 12 and week 13, which is late March or beginning of April. Therefore, we can conclude the starting date of Spring is similar from year to year and has not shifted very much over the past 50 years.
a. Variable Selection
There are many aspects of weather that can be predicted. In our analysis, we investigated whether it will precipitate on a given day based on the previous day’s predictor variables. Since our data contained 13 explanatory variables, and because of the “curse of dimensionality”, we decided to first look at the random forest approach. This method allowed us to prune out variables that are insignificant in our analysis, reducing the number of variables needed to fit our models. We then looked at other classification methods, such as logistic regression and linear discriminant analysis, and compared the performance of all three methods.
b. Model Comparison: Random Forest, Logistic Regression Prediction Model, Linear Discriminant Analysis Prediction Model
We first looked at a tree-based approach, using random forest to predict the next day precipitation status. We used five years’ worth of training data (2011 to 2015) and one year of testing data (2016). We found its test error to be ~15.8%. Next, we measured the Gini index of each explanatory variable to find which variables are significant in our model. To avoid dependence in our predictors, we included only the most important variable between similar explanatory variables. We found that the four most important values were “Min_VisibilityMiles”, “CloudCover”, “Mean_Humidity”, and “Min_Sea_Level_PressureIn”, respectively. Using these four explanatory variables, we ran a logistic regression and linear discriminant analysis prediction model to see how well these other methods performed.
In our logistic regression model, we again used five years’ worth of training data (2011 to 2015) and one year of testing data (2016). We found its test error to be ~18.0%. In our linear discriminant analysis, under the same conditions, we found the test error to be ~17.2%. We conclude that all three models performed surprisingly well, with random forest having the best prediction performance.
One of the limitations to our exploratory data analysis was the way we define the direst day and the length of spring. We did not use formal authoritative meteorologist definitions to those concepts, so there might be some differences in how the meteorologist defines the term. However, we strictly constrained ourselves to using the same standard to compare across different years so in a sense our conclusions should not be so far from the results if we had used the formal definitions. A major concern for our prediction models was the predictors selection. We only included the last one day’s information as predictors but did not concern the weather information for several days ago in our models. Also, as a result of our limitation in getting the data, we did not concern any other variables that are not present in our dataset. We did not refer to any meteorology literature to construct the models so the models we have constructed are subject to criticism from authoritative meteorologists.
This report analyzed various aspects of Ann Arbor’s weather: humidity, spring length and seasonal shifts in Spring, and precipitation. We first compared 2012-2016’s humidity level to 1980-1984’s humidity level to know whether the driest month has changed in Ann Arbor. The driest month is defined to be the month which has the lowest maximum humidity percentage. In 1.a, we explore shifts in the driest month of Ann Arbor and found that the driest month does not vary a lot between the past and the present. We found the driest month of Ann Arbor to be May. In 1.b, we investigate the humidity level changes in Ann Arbor. The maximum humidity level is about 19% for both 1980s’ and 2010s’ dataset. The maximum humidity level in the 1980s remained high, but in the 2010s, the maximum humidity level varied more compared to the 1980s. We can conclude that the variability of the maximum humidity level has increased a lot over the years.
The second part of our report analyzes Ann Arbor’s spring. In 2.a, we studied the length of spring, and whether or not it was shortening. However, we found no obvious trend or pattern in the length of Spring. We found the average length is about 2.5 months, so it has not become shorter. In addition, we also explore the starting time of spring – whether the starting point of spring is delayed. The starting date fluctuated between mid-March to mid-April, which is consistent to the traditional start date of Spring. Therefore, Ann Arbor’s spring has not changed.
In the third part of our report, we predict the precipitation on a given day using previous day’s predictor variables. We used 4 variables selected by random forest. Then we use 4 classification methods; tree, logistic regression and LDA. We use five year’s training data and one year’s testing data. The testing error rates of tree base model, logistic regression and LDA are 15.8% and 18.0% and 17.2%. The models perform surprisingly well and we found random forest to have the best performance out of the three models.