Wealth and Population Density Influences on COVID-19 Cases and Vaccinations Using Machine Learning
by
by
This research investigates the influence that wealth and population density have had on the COVID-19 pandemic in the United States.
Since March 2020, the COVID-19 virus has raged through the US and around the world. To date, there are over 175 million cases worldwide and approaching 35 million cases within the US alone. The US death toll is over 600 thousand. Health organizations have placed a significant emphasis on social distancing and wearing masks to prevent virus spread through airborne droplets. Federal and state governments put lockdowns in place and after a year have only recently lifted lockdowns as vaccines have become available and the population is approaching herd immunity.
The goal of this study is to use machine learning on US county data with a focus on uncovering possible correlations of wealth and population density relative to the virus cases and vaccinations.
The study will serve to answer the following questions:
Have wealthy counties been impacted differently by COVID-19?
Has county population density played a role in COVID-19?
Have county wealth and population density together influenced COVID-19 infections and vaccinations?
Indonesian Study
"The Application of K-Means Clustering for Province Clustering in Indonesia of the Risk of COVID-19 Pandemic Based on COVID-19 Data" [1]
This study used K-Means to cluster provinces in Indonesia based on confirmed COVID-19 cases, deaths, and recoveries.
The clustering generated 3 groups that provided input to the Indonesian government in making policies on lockdowns to overcome the spread of the virus.
India Study
“An Analysis of COVID-19 Clusters in India.” [2]
This study used K-Means to cluster the 50 worst affected districts by case counts in India.
Data studied were COVID-19 cases, population density, and the number of specialty hospitals.
The clustering generated 3 groups that provided insights:
Cluster 1 burdened hospitals most heavily.
Cluster 3 was most effective in controlling the disease
United States Study
“Pattern Recognition of the COVID-19 Pandemic in the United States: Implications for Disease Mitigation” [3]
This study used K-Means and PCA to investigate three time periods to uncover seasonal trends of cases during the COVID-19 pandemic.
County-level total case data was used. Clustered by states.
Results:
Early phase – 3 clusters
Mid-Phase – 2 clusters
Late Phase – 4 clusters
Whole Period – 3 clusters
US County is the unit of analysis for this research.
This unit is represented in the data by the FIPS code
5-digit Federal Information Processing Standards (FIPS) code uniquely identifying counties and county equivalents in the United States.
Data sets are joined using the FIPS code (with corresponding county names)
Used to group data features such as population density, income, COVID cases, deaths, and vaccinations during analysis and machine learning.
Merging Data Set Strategy
Population and Land Area Data are merged on FIPS code column
Vaccine Data merged with Median Incomes Data on County column
Results from 1 & 2 merged with each other on FIPS code column
Results from 3 merged with Virus Data on FIPS code column
Key Features in Data Sets
Final Consolidated Data Set
Of the variety of machine learning techniques, clustering is widely used for revealing structures in data as it works for both labelled and unlabeled data. The special feature of clustering is that it works very well on datasets where simple relationships among data items is unknown. This aspect makes clustering an ideal choice for modeling the data for this study.
There are various clustering algorithms available but simple k-Means Centroid-based clustering will be used for this study. Centroid-based clustering organizes the data into non-hierarchical clusters and it uses a Euclidian distance based clustering mechanism. This method is typically faster than other clustering techniques.
Basic k-Means Machine Learning Implementation[4]:
Select k points at random as centroids.
Assign data points to the closest cluster based on Euclidean distance
Calculate centroid of all points within the cluster
Repeat these steps iteratively until convergence.
The elbow method will be used to determine the optimal number of clusters. To do this, the k means implementation will be executed for multiple k values and plotted against the sum of squared distances from the centroid (loss function)[1]. The elbow of the curve is where the curve visibly bends and this will be selected as the optimum k.
The process above will be applied first to income data, then population data, and the various COVID data. Combinations of the data will then be attempted to glean knowledge from consolidated data clusterings.
Key Feature Statistics
Correlation Matrix
Investigate Relationships:
Population Per Square Mile and Cases Per Square Mile
Income and Cases Per Square Mile
Income and Vaccination Percentage
Focus Study on Subset
Use Only Two States to reduce data set for better K-Means results
California – 58 Counties
Florida – 67 Counties
Note these states handled pandemic very differently
California – Locked Down
Florida – Open
Correlation Matrix (limited to CA and FL)
Investigate Relationships:
Population Per Square Mile and Cases Per Square Mile
Income and Cases Per Square Mile (note stronger correlation)
Income and Vaccination Percentage (note stronger correlation)
The Florida and California data sets are explored further below in some choropleth and scatter/bubble charts to highlight different aspects of features in the data.
Florida and California County Choropleth Maps
Florida and California County Bubble Charts
The exploratory data analysis indicates the K-Means cluster modeling should focus on the following correlations:
Population Density and Case Density
Median Income and Vaccination Percentages
K-Means clustering was performed for both correlations for both Florida and California separately. The K-Means clustering was then completed on a combined Florida and California data set to see the cross-state impact putting them together. Lastly, K-Means clustering was performed on the full US counties data set to see the view from a national perspective.
Below is the algorithm used for the K-Means clustering on each data set:
Filter data to counties in subject state
Scale features using StandardScaler
Loop to determine optimal clusters using Elbow Method
Define K-Means clustering machine learning model
Fit the model with the scaled features
Plot the WCSS (within cluster sum of squares) to show the elbow
Define the K-Means model for the optimal K (i.e., number of clusters)
Fit and predict cluster labels using the scaled features
Scatter plot and choropleth map the resulting clusters
Florida Counties
K-Means -- Population Density and COVID-19 Case Density
Results
Optimal K - 4 clusters
Inertia - 9.5
Convergence - 5 iterations
The blue county represents Pinellas (i.e., Tampa/St. Petersburg) as the outlier with the highest population and case densities.
Orange counties (Miami, Orlando, and Jacksonville) are concerning with high population and case densities as well.
California Counties
K-Means -- Population Density and COVID-19 Case Density
Results
Optimal K - 4 clusters
Inertia = 1.4
Convergence - 2 iterations
Cyan shows SF as the outlier and the area of most concern with very high population and case densities.
Blue counties in the surrounding SF bay area and southern counties exhibit the same high populations and case density characteristics.
California and Florida Counties
K-Means -- Population Density and COVID-19 Case Density
Results
Optimal K - 4 clusters
Inertia = 8.7
Convergence - 4 iterations
Note: Cluster 1 was removed since it represented a single county outlier and skewed the visualization.
Cyan county (i.e., SF outlier) indicates the area of most significant concern with the highest population density and case density.
Otherwise, blue and orange counties reflect other areas of concern with both high population and case density.
Florida Counties
K-Means -- Median Income and COVID-19 Vaccination Percentage
Results
Optimal K - 4 clusters
Inertia - 20.8
Convergence - 3 iterations
Red counties are poorer with low vaccination percentages while blue counties are wealthy with high vaccination percentages.
California Counties
K-Means -- Median Income and COVID-19 Vaccination Percentage
Results
Optimal K - 4 clusters
Inertia - 10.1
Convergence - 5 iterations
Red counties have lower income with low vaccination percentages while cyan counties are wealthy with high vaccination percentages.
California & Florida Counties
K-Means -- Median Income and COVID-19 Vaccination Percentage
Results
Optimal K - 5 clusters
Inertia - 29.9
Convergence - 7 iterations
Yellow counties have lower income with lower vaccination percentages
Blue and Cyan counties are wealthier with higher vaccination percentages.
Orange counties appear to have not reported vaccinations to the CDC.
The Phase II K-Means model execution demonstrated clustering relationships of population density and case density as well as median income and vaccination percentages at the county level for the states of California and Florida. The clustering results appear to answer the key hypothesis questions in that wealthy counties have demonstrated higher vaccination percentages and population density is a strong driver for case density. It remains a question if the clustering results for these two states would be a fair representation at a national level. Initially, the stretch goal was to train a supervised model and predict clusters for counties. However, this type of assumption could be significantly flawed if the California and Florida clusters aren't a good representation of the nation as a whole.
An alternative is to look at the clustering of the select features in regional areas of the country by grouping states. To allow for this flexibility in analysis, an interactive web page has been created to allow a researcher to select any two features in the dataset to cluster using K-Means, plot the clusters, and display a choropleth cluster map.
Example Run
Compare Income and Vaccination Percentage Clusters for Mid-Atlantic and Northeast States
Select Income and Vax Pct as cluster features
Show income on logarithmic scale
Resulting scatter plot shows All US Counties
Shows general increase in county vaccination percentages as income increases
Select focus regions to look at more closely...
Mid-Atlantic
Virginia counties show poor vaccination percentage regardless of median income and case density
Northeast
Vaccinations are good overall and, with exception of one outlier county, percent increases with income level
Run - Mid-Atlantic Region States
Cluster Analysis
Cluster 0 - represents low to mid income counties with vaccinated population between 20% and 50%
Cluster 1 - represents mid to upper income counties with vaccinated population mostly in 40% to 60% range
Cluster 2 - represents upper income counties with centroid vaccinated population at 50% (note some outliers at less)
Cluster 3 - represents low to mid income counties with vaccinated population below 20%
Cluster 4 - represents mid to upper income counties with vaccinated population below 40% (centroid a little over 10%)
Run Interpretation
Clusters 3 and 4 should be the focus area for enhanced vaccination efforts / these are the more rural counties in VA
Choropleth shows I-95 corridor counties from northern VA through NJ have highest income and vaccination percentages
Run - Northeast Region States
Cluster Analysis
Cluster 0 - represents upper income counties with vaccinated population above 50%
Cluster 1 - represents low to mid income counties with vaccinated population between 30% and 40%
Cluster 2 - represents low to mid income counties with vaccinated population between 40% and 55%
Cluster 3 - represents an outlier county in Massachusetts with high income, low vaccination percentage, and case density
Cluster 4 - represents mid to upper income counties with vaccinated population between 50% and 70%
Run Interpretation
Cluster 1 counties should be focus area for enhanced vaccination efforts
Choropleth shows Cluster 1 concentrated in northern Vermont
Choropleth shows inland Maine counties highly represented in Cluster 2, which also would benefit from enhanced vaccination efforts
Returning to the hypothesis questions the following findings surfaced from this study:
Have wealthy counties been impacted differently by COVID-19?
Both the correlation and the clustering did not reflect wealth having impact on COVID-19 case density. However, county wealth does appear to impact the vaccination percentage of the population. The K-Means clustering runs show that both the scaling and the cluster meanings will vary in different regions of the country. This was demonstrated with the K-Means web application in the section above.
Has county population density played a role in COVID-19?
The K-Means clustering reflected the strong linear correlation of population density and COVID-19 case density. Basically, the study shows case density increases with higher county population density. Population density (actually used case density) did not appear as a major influencing factor in the bubble charting showing the affect of wealth on vaccination percentage.
Have county wealth and population density together influenced COVID-19 infections and vaccinations?
This question was not answered directly through a K-Means clustering as part of this study and was left as a future exercise. An enhancement could be made to the K-Means web application to allow selection of three features from the data set. Scenarios to investigate would be clustering on population density, median income, and case density as well as population density, median income, and vaccination percentage.
An overall takeaway from this study is that K-Means clustering proved useful to group counties having similar attributes regarding the features studied in the dataset (i.e., median income, population, cases, vaccinations, density, etc.). It became apparent after crafting many Jupyter notebooks that the need for dynamic clustering and selection of these features to address specific questions is valuable.
The development of the K-Means web application provided the flexible platform for completing the analysis and machine learning steps needed for this study. This tool has value beyond the study and could be used to address "what if" scenarios benefiting public health workers as they react to changes in the COVID-19 cases and vaccinations throughout the pandemic.
Abdullah D;Susilo S;Ahmar AS;Rusli R;Hidayat R; “The Application of K-Means Clustering for Province Clustering in Indonesia of the Risk of the COVID-19 Pandemic Based on COVID-19 Data.” Quality & Quantity, U.S. National Library of Medicine, https://pubmed.ncbi.nlm.nih.gov/34103768.
Sengupta, Pooja, et al. “An Analysis of COVID-19 Clusters in India.” BMC Public Health, BioMed Central, 31 Mar. 2021, https://bmcpublichealth.biomedcentral.com/articles/10.1186/s12889-021-10491-8.
Wu J, Sha S. Pattern Recognition of the COVID-19 Pandemic in the United States: Implications for Disease Mitigation. International Journal of Environmental Research and Public Health. 2021 Mar;18(5). DOI: 10.3390/ijerph18052493.
alifia2. “Centroid Based Clustering : A Simple Guide with Python Code.” Analytics Vidhya, 27 Jan. 2021, www.analyticsvidhya.com/blog/2021/01/a-simple-guide-to-centroid-based-clustering-with-python-code/.
Girgin, Samet. “K-Means Clustering Model in 6 Steps with Python.” Medium, PursuitData, 26 July 2020, https://medium.com/pursuitnotes/k-means-clustering-model-in-6-steps-with-python-35b532cfa8ad.
“Use Sklearn StandardScaler() Only on Certain Feature Columns.” ThiscodeWorks, www.thiscodeworks.com/use-sklearn-standardscaler-only-on-certain-feature-columns-python/605cc55c3c8db10014203c0e.
Linkedin Profile - https://www.linkedin.com/in/ken-noppinger-b0020a
GitHub Repo - https://github.com/knoppin1/DATA-606