The aim of this project is to explore seasonal influenza data to glean insights regarding trends, risk factors (such as population, vaccine coverage and effectiveness), and more. Additionally, I will determine the statistical significance of trends/correlations if applicable. Finally, I will create an interactive dashboard geared towards healthcare professionals and policy makers. This dashboard will enable healthcare professionals and researchers to make inferences and/or decisions based on their expert domain. This may include preventative measures, preparedness, and outreach .
Part I Presentation Video
It is important to note that this presentation was made before COVID-19 was well understood or prevalent. I made comments comparing seasonal influenza to COVID-19 that are no longer true.
Dataset
The main data set I will be exploring is Influenza Laboratory-Confirmed Cases By County: Beginning 2009-10 Season [1] The data set can be acquired from HealthData.gov [1], or from the original source, the New York State Department of Health (health.data.ny.gov) [2]. The dataset contains 62,286 rows and nine columns. The nine columns are: Season (flu season ranging from October through the following May), Region of New York State (region of lab-confirmed cases, such as Central, Western, Capital District, etc.), County (county in New York State, such as Madison County, NY), CDC Week (week number in season), Week Ending Date, Disease (influenza strain - A or B), Count (number of cases), County Centroid (map coordinates), and FIPS (Federal Information Processing Standard - five digit FIPS code that uniquely identifies county and county equivalents in the United States).
I plan to use other data sets to supplement my research. For example, I plan to use State Specific Influenza Vaccination Coverage (Centers for Disease Control, provided by Kaggle)[3] to analyze New York’s vaccine coverage over time, with respect to the number of flu cases and flu strains. I may also use data to determine county populations and overall vaccine effectiveness per year (provided by CDC) [4].
Rationale
Seasonal flu is deadly and widespread [5]. So far in this 2019 - 2020 flu season, an estimated 10,000 people have died and 180,000 hospitalized in the US [5]. Additionally, seasonal flu and pandemic flu are quite similar (except pandemic flu is much more deadly), thus, seasonal flu data can help us better understand and prepare for the pandemic flu for the next pandemic.
New York State is an ideal state to analyze because it has locations that range from highly populated to rural. New York State also has high influenza-like illness activity [5]. Lastly, plentiful and up-to-date data is available for this State.
Related Work and Literature Review
There are various data dashboards that visualize flu data [6-8] as well as the spread of other viruses [9]. From these dashboards, I learned about what makes a dashboard intuitive, visually appealing, and easy to navigate. The importance of properly disseminating surveillance data is of greater focus in literature. This results in the optimization of a dashboard’s impact on health officials’ and policy makers’ decisions and actions [10-11]. For example, Cheng et al. developed and implemented an influenza surveillance dashboard that displayed intuitive figures from multiple surveillance data streams per panel [10]. Their dashboard was applied to the influenza surveillance data in Hong Kong, while the proposed dashboard in this project will be data from New York State. The current New York State Flu Tracker Dashboard [6] only displays cases and trends in the more recent seasons (2016 - 2020). My dashboard will include all seasons (2010 - 2020). I will also attempt to implement supplemental data in the dashboard, such as the vaccination rate of healthcare workers with patient contact [12], or overall vaccine effectiveness per year [13]. Because the developers of the NYS Flu Tracker dashboard do not provide their methods, the main challenge will be developing a dashboard that works as seamlessly as theirs. I will provide the methods and code for creating the dashboard, allowing others to replicate my work and make improvements. Additionally, providing methodology will provide transparency for users who seek a deeper understanding of the data and dashboard.
Preliminary Exploratory Data Analysis
The preliminary exploratory data analysis revealed that the sum of all confirmed influenza cases (influenza type A, B, and unspecified) were highest in the 2017-2018 flu season (128,247 cases). The 2018-2019 season had the second highest number of cases (107,805 cases). Previous flu seasons (2009-2010 to 2016-2017) had a lower number of cases, with the highest being 64,765 cases in the 2016-2017 season. There are a variety of factors that may contribute to this difference. Factors include (but are not limited to) virulence, vaccine effectiveness, and vaccination rates. Another important factor to consider is the number (or availability) of laboratory tests for the flu. It is often the case that the flu is diagnosed based on symptoms alone, meaning there are more flu cases than those reported in this data set. Regardless, this preliminary exploration revealed one insight that will be represented in my interactive dashboard.
Part II Presentation Video
Methods Overview
For more details, please refer to my presentation or my code provided on GitHub (Delivery 3 Link). A summary of my methods are as follows:
Created subsets for each influenza season.
Each resulting subset has 62 rows representing 62 unique counties, and the “Count” column containing the sum of all confirmed influenza cases for that county.
Created an interactive map visualizing the total number of confirmed influenza cases per county for each season.
Used Python/Jupyter Notebook (Pandas and NumPy)
For the map, I used Plotly Express:
Mapbox Choropleth Maps [14].
Based code off of Plotly Express documentation [14]. Resulting map can be seen in fig. 1(a).
Added population data to each season subset to make an interactive map - illustrating county-level prevalence rates.
Extracted population data from “Annual Population Estimates for New York State and Counties: Beginning 1970” dataset [15]. Estimates are based on census counts (base population), intercensal and postcensal estimates [15].
Added the population data as a new column in my season subsets.
Defined and applied a function to calculate the prevalence rate of confirmed influenza cases (per 10,000 people) for each county in each subset.
Created an interactive map visualizing the prevalence rate of confirmed influenza cases (per 10,000 people) at a county-level.
Followed the same code format as before, but used the subset containing the prevalence rates.
Results/Figures
Figure 1. Side by side comparison of the two density maps from the 2017-2018 influenza season. The density map illustrating the total count of confirmed cases (a) is quite different from the map illustrating the prevalence rate of confirmed influenza cases per 10,000 people (b).
Discussion/Conclusion
After creating interactive maps for all influenza seasons (visualizing the sum of all confirmed influenza cases per county - fig. 1(a)), it was brought to my attention that perhaps looking at the rate of confirmed cases with respect to the population may show something different and more meaningful. I defined a function to calculate the prevalence rate for each row (based on census-based population estimates [15]), and added this new column to all of the seasonal subsets. I chose the prevalence rate to be per 10,000 residents (rather than 100,000), because some counties had populations fewer than 50,000 people. After visualizing the prevalence rates across all counties for the 2017-2018 season, the counties that once seemed to be concerning (Queens, Bronx, etc.) had relatively low prevalence rates in comparison to many counties clustered in the center of the state (fig. 1(b)). Table 1 shows a brief comparison between the sum and prevalence rate of laboratory confirmed influenza of a few counties of interest.
The previous goal of my project was to make an interactive dashboard, but now the focus has shifted to attempting to answer some questions. I will try to determine whether there is a trend with the prevalence rates over all the seasons (in each county). I will also try to implement inferential statistics to identify what the relationship may be. Although there are not many features in my dataset to help identify potential factors affecting each county - there is data about the flu types (A or B). Influenza type A viruses are the only influenza viruses known to cause pandemics because of their ability to change in two ways (antigenic drift and shift) rather than one [16]. Thus, I want to see if there is a relationship between the number of each type of influenza (type A and B), and the prevalence rates in counties. My hypothesis is: If there are more influenza type A cases, the prevalence rate will be higher. This is because research indicates that influenza type A viruses are responsible for approximately 75% of confirmed influenza cases [17]. It will be interesting to see if New York State reflects a similar trend.
Part III Presentation Video
Background image for title and final slide is from: https://www.nfid.org/infectious-diseases/influenza-flu/
References Cited
1. HealthData.gov. Influenza Laboratory-Confirmed Cases By County: Beginning 2009-10 Season . (2020).
2. New York State Department of Health. Influenza Activity, Surveillance, and Reports. (2020).
3. Centers for Disease Control and Prevention. State Specific Influenza Vaccination Coverage. (2019).
4. Centers for Disease Control and Prevention. Vaccine Effectiveness Studies. (2019).
5. Centers for Disease Control and Prevention. Weekly U.S. Influenza Surveillance Report. (2020).
6. NYS Health Connector, “New York State Flu Tracker”, 2020. [Online], Available: https://nyshc.health.ny.gov/web/nyapd/new-york-state-flu-tracker. [Accessed Feb.18, 2020].
7. FluView Interactive, “National, Regional, and State Level Outpatient Illness and Viral Surveillance”, Centers For Disease Control and Prevention, 2020. [Online], Available: https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html. [Accessed Feb. 18, 2020].
8. L. VanWhy, and P. Galebach, “The athenaInsight Flu Dashboard,” athenahealth Inc., September 20, 2019. [Online], Available: https://www.athenahealth.com/insight/flu-dashboard-2017-2018 [Accessed Feb. 18, 2020].
9. L. Gardner, “Mapping 2019-nCoV”, Johns Hopkins Whiting School of Engineering| Center for Systems Science and Engineering, January 23, 2020. [Online], Available: https://systems.jhu.edu/research/public-health/ncov/. [Accessed Feb. 18, 2020]. Full dashboard Available: https://gisanddata.maps.arcgis.com.[Accessed Feb. 14, 2020].
10. C. Cheng, D. Ip, B. Cowling, L. Ho, and E. Lau, “Digital dashboard design using multiple data streams for surveillance with influenza surveillance as an example”, J Med Internet Res. 2011 Oct-Dec; 13(4): e85. Published online 2011 Oct 14, doi: 10.2196/jmir.1658. Available: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3222192/. [Accessed Feb. 19, 2020].
11. S. Hamid, L. Bell, and E. Dueger, “Digital dashboards as tools for regional influenza monitoring”, WPSAR Vol 8, No 3, 2017, doi: 10.5365/wpsar.2017.8.2.003.
12. Health Data NY, “Influenza Vaccination Rates for Health Care Personnel: Beginning 2012-13”, 2019. [Online Dataset], Available: https://health.data.ny.gov/Health/Influenza-Vaccination-Rates-for-Health-Care-Person/jpkp-z76p. [Accessed Feb. 19, 2020].
13. Centers for Disease Control and Prevention. “Vaccine Effectiveness Studies”, 2019. [Online], Available: https://www.cdc.gov/flu/vaccines-work/effectiveness-studies.htm. [Accessed January 28, 2020].
14. Plotly Graphing Libraries, “Mapbox Choropleth Maps in Python”, 2020. [Online], Available: https://plotly.com/python/mapbox-county-choropleth/. [Accessed March 15, 2020]
15. New York State Department of Labor. “Annual Population Estimates for New York State and Counties: Beginning 1970”, 2020. Data.ny.gov [Online]. Available: https://data.ny.gov/Government-Finance/Annual-Population-Estimates-for-New-York-State-and/krt9-ym2k. [Accessed March 30, 2020].
16. Centers for Disease Control and Prevention, “How Flu Viruses Can Change”, 2019. [Online], Available: https://www.cdc.gov/flu/about/viruses/change.htm. [Accessed March 29, 2020].
17. M. Nyirenda, R. Omori, H. Tessmer, H. Arimura, and K. Ito, “Estimating the Lineage Dynamics of Human Influenza B Viruses”, PLoS One. 2016;11(11): e0166107. [Online], Available: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5102436/. [Accessed April 3, 2020].