West Nile Virus is a mosquito-borne disease that first appeared in the U.S. in 1999. A major outbreak occurred in 2012, and since then, California has had the most human infections across all states, with there have been growing numbers of positive mosquito samples trapped. When a human is infected, there is a 20% chance of developing debilitating symptoms.
My goal was to create a predictive model that can use environmental data to forecast the presence of West Nile Virus in certain California counties. I used positive mosquito sample data from the California West Nile Virus website and climate data from the National Oceanic and Atmospheric Administration's (NOAA) database. To carry out the modeling, I used the programming language Python. The type of model I used was logistic regression, Virus presence was determined by whether the overall weekly frequency of positive mosquito samples in that county exceeded the 5-year average,
In the preliminary stage of my project, I learned the programming language and researched about the various counties in the state. The three counties i narrowed down my study to were Los Angeles, Orange, and Riverside, I then restructured and merged the climatic datasets and mosquito datasets into a format I could readily use. To develop the model, I split the entire list of datasets into a training set and a testing set. I created a logistic regression model using the former, and then used the testing set and other statistical measures to evaluate the model. If the model was not as accurate as hoped, I altered the environmental variables and went through the developmental process again.
I modeled and tested out various combinations of climatic variables. My final model used wind speed, precipitation level, and temperature to generate predictions. All of the data came in the form of weekly averages. When tested on all three counties together, the train/test accuracy was 0.7, the cross-validation accuracy was 0.63, the area under the Receiver Operating Characteristic curve was 0.58, and the F-1 score was also 0.58. The model was most effective in Orange County and scored, for the same metrics listed above, 0.77, 0.68, 0.65, and 0.71, respectively.
Multiple conclusions and inferences can be made from these results. The model predicted an excess of false negatives, which indicates that this combination of climatic variables may not have a very powerful correlation with the presence of West Nile Virus. A possibly influential variable omitted from this experiment was humidity, as there was no available data in the NOAA database. Furthermore, climatic fluctuations in the horizontally-stretched Riverside County may have diminished the potency of the model.
My finished model is able to predict if West Nile Virus will be present or not in the specified California counties based off of wind speed, precipitation level, and temperature. Most predictive success will come in Orange County.