What is the issue? Why is it important?
Chronic diseases and other illnesses/conditions are growing at an alarming rate in the United States. Diabetes, for example, has grew from 5.5 in 1980 to a staggering 12.1 million cases in 2000. In a lot of cases, adequate access to healthcare and the use of preventative health services can prevent and substantially reduce morbidity and mortality. Unfortunately, the United States has shown a large inequity in healthcare distribution, and as a result, inequity in illness distribution as well. Key aspects of politics to improve the healthcare system and insure it is distributed in an equitable manner, involve identifying, prioritizing, and deciding how to address social and economic problems that can lead to inequity in healthcare.
This project aims to visualize the distribution of healthcare and illness in the United States to support national efforts in addressing the issues stated above. The main purpose is to create an interactive dashboard that displays how states are doing compared to each other in terms of illness and access to healthcare, as well as how counties of each state are doing in terms of the same aspects. These findings will help in deciding where and in what should we invest the most on a federal as well as on a state level.
The data used in this project was retrieved from the CDC website: https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/CHDI/chsi_dataset.zip (Link to download the dataset)
Below are the datasets used from the zip file:
• Demographics: 44 columns; 3141 rows
• Risk factors and access to healthcare: 31 columns; 3141 rows
• Preventive service use: 43 columns; 3141 rows
• Measures of birth and death: 141 columns; 3141 rows
After merging the datasets, and selecting the needed columns, we have: 34 columns, and 3141 rows.
Source: https://www.americashealthrankings.org/explore/annual/measure/PCP/state/ALL
America’s health rankings dashboard shows how the United States is doing as a whole in different aspects (behavior, healthcare access, outcomes, chronic diseases, education…etc.), as well as shows how each state is doing for the same aspects and its ranking compared to the other states.
The picture below shows an example for Virginia.
• This dashboard will focus on access to healthcare and diseases only.
• In addition to how states are doing compared to each other, I will also add how counties in each state are doing. As a result, we will have outcomes that show us where to invest no only on the federal level, but also on a state level.
• This dashboard will also have ranking of diseases (most urgent to least urgent) in order to know which disease to tackle first in each state, as well as each county.
In this phase, we perform exploratory data analysis, showing several trends in the dataset. We also perform K-means clustering to group our data.
States with lowest Number of PCPs per 100000 pop:
•Missouri
•Oklahoma
•Texas
•Iowa
•Idaho
States with the highest Number of PCPs per 100000 pop:
•District of Columbia
•Massachusetts
•Vermont
•New Hampshire
•Connecticut
States with lowest % of of uninsured:
•District of Columbia
•Rhode Island
•New Hampshire
•Iowa
•Massachusetts
States with the highest % of uninsured:
•Hawaii
•New Mexico
•Arizona
•Texas
•Montana
Our dataset does not have any labels. Therefore, we will leverage the unsupervised machine learning k-means clustering to group our data. After grouping our counties, we got the results below.
Additional details are on the GitHub repository.
Below is a radar chart that shows the difference in features among the different clusters.
In addition to K-means, we have performed two other models on our dataset to group our counties.
K-means with Principle Component Analysis (PCA)
Partition Around Medoids (PAM)
Then, we have compared our models using the silhouette coefficient, and used the best model results to build our dashboard on Tableau.
PCA serves the purpose of dimensionality reduction. On our dataset, we transformed our dimensions from 31 to 2 dimensions (PC1 and PC2). This has helped us better visualize our clusters (see plot on the left).
PAM is an unsupervised machine learning algorithm used to perform clustering. It is very similar to K-means. The main difference between PAM and K-means is that PAM uses medoids which are always actual points on the dataset, while K-means uses centroids which are usually artificial points.
Since the ground truth labels are not known, evaluation must be performed using the model itself.
A higher Silhouette Coefficient score relates to a model with better defined clusters. Scores are between -1 and 1.
PAM model had the best score, and so we have used its result on Tableau to build our dashboard.
Please note that this is a PowerPoint; a static version of my dashboard.
If you would like to access the interactive Tableau dashboard, you can download the .twb file by clicking the following link > Download. You'd need to have Tableau access to view.
Partition Around Medoids (PAM) worked best to group our dataset.
Counties that belong to Group 2 have the least number of Primary Care Physicians per 100,000 population, followed by counties in Group 1, Group 4, then Group 3.
Counties in Group 4 have the highest number of uninsured individuals .
Counties in Group 1 and 2 have the highest rate of chronic diseases.
Coronary Heart Disease (CHD)is the major killer in the US. It is a significantly higher cause of death in group 1 (216:100k) and group 2 (217:100k) than in group 3 (171:100k) and group 4 (171:100k)
Preventive services are not significantly different among our county groups.
Conclusion:
Counties in group 1 and 2 (mostly south region of the US) have less number of primary care physicians (PCPs), and more chronic diseases than than group 3 (mostly west coast and north east regions of the US) and 4 (mostly mid-west region of the US). Counties in group 1 and 2 also have way more deaths due to coronary heart disease. The high numbers of coronary heart disease (CHD) are surely linked with the high rate of chronic diseases as those lead to a lot of health issues. Therefore, to lower the numbers of CHD, it is important to manage chronic diseases in the US by educating the public and emphasizing the importance of healthy eating, physical activity, and a low stress lifestyle. It is also important to improve access to health care. Given the dashboard we created, we are now able to make targeted decisions regarding PCPs distribution based on each county's conditions or circumstances.
The counties we need to prioritize the most (in my opinion) are the ones in group 1 and 2, and the condition we need to focus on the most is coronary heart disease.
2.3. Clustering¶. (n.d.). Retrieved December 08, 2020, from https://scikit-learn.org/stable/modules/clustering.html
Chowdhury, P. P., Mawokomatanda, T., Fang Xu, Gamble, S., Flegel, D., Pierannunzi, C., Garvin, W., & Town, M. (2016). Surveillance for Certain Health Behaviors, Chronic Diseases, and Conditions, Access to Health Care, and Use of Preventive Health Services Among States and Selected Local Areas -- Behavioral Risk Factor Surveillance System, United States, 2012. MMWR Surveillance Summaries, 65(4), 1–139. https://doi-org.proxy-bc.researchport.umd.edu/10.15585/mmwr.ss6504a1
Gamble, S., Mawokomatanda, T., Fang Xu, Chowdhury, P. P., Pierannunzi, C., Flegel, D., Garvin, W., & Town, M. (2017). Surveillance for Certain Health Behaviors and Conditions Among States and Selected Local Areas -- Behavioral Risk Factor Surveillance System, United States, 2013 and 2014. MMWR Surveillance Summaries, 66(16), 1–139. https://doi-org.proxy-bc.researchport.umd.edu/10.15585/mmwr.ss6616a1
Health, S. (2020, July 02). Inequity Archives. Retrieved December 08, 2020, from https://skagitcounty.blog/tag/inequity/
Holtz, Y. (2017, October 09). #392 Use faceting for Radar chart. Retrieved December 08, 2020, from https://python-graph-gallery.com/392-use-faceting-for-radar-chart/
Kryńska, K. (2018, December 28). Using K-means and PAM clustering for Customer Segmentation. Retrieved December 08, 2020, from https://rstudio-pubs-static.s3.amazonaws.com/455393_f20bacf1329a49dab40eb393308b33eb.html#:~:text=The%20main%20difference%20between%20K,actual%20points%20in%20the%20dataset
Meyer, S. B., Luong, T. C., Mamerow, L., & Ward, P. R. (2013). Inequities in access to healthcare: analysis of national survey data across six Asia-Pacific countries. BMC Health Services Research, 13(1), 238–250. https://doi-org.proxy-bc.researchport.umd.edu/10.1186/1472-6963-13-238
Scamman, K., & Scamman, K. (2017, October 20). Language Access and Healthcare Open Enrollment (Infographic). Retrieved December 08, 2020, from https://telelanguage.com/language-access-healthcare-open-enrollment-infographic/
So What Photo Booth Prop Message: Free Printable Papercraft Templates. (2019, May 07). Retrieved December 08, 2020, from http://www.supercoloring.com/paper-crafts/so-what-photo-booth-prop-message
Trishborgdorff, A. (2020, November 08). Now what? Retrieved December 08, 2020, from https://trishborgdorff.com/2020/11/07/now-what/
Unsplash. (n.d.). Beautiful Free Images & Pictures. Retrieved December 08, 2020, from https://unsplash.com/
Yumpu.com. (n.d.). Data Sources, Definitions, and Notes - Centers for Disease Control ... Retrieved December 08, 2020, from https://www.yumpu.com/en/document/read/36342808/data-sources-definitions-and-notes-centers-for-disease-control-