INTRODUCTION
Since its initial outbreak in late 2019, the COVID-19 pandemic has spread quickly around the world. The virus has had significant health and economic consequences around the globe, causing unprecedented levels of disruption and uncertainty and becoming the 7th deadliest pandemic in the history of the world.
This data analytics project aims to provide an in-depth analysis of the number of cases and deaths associated with the COVID-19 virus in all countries around the world, as well as to investigate the virus's prevalence rate, mortality rate, and fatality rate. To provide a comprehensive overview of the virus's impact, the report will rely on two data sources, including statistics on confirmed cases and deaths as well as population data. The report will provide insights into how the virus has affected various regions and populations by examining these data points.
The study will be a valuable resource for public health officials, policymakers, and researchers attempting to comprehend the virus's impact and devise effective interventions and policies to limit its spread.
For the purpose of this project, Microsoft Excel was utilized for the pre-processing and initial cleaning, SQL was utilized for further cleaning and analysis, and Tableau was used for the visualization.
DATA COLLECTION
The COVID-19 data was downloaded from the Our World in Data website, while the population data was downloaded from Kaggle.
They were extracted from their archives and saved on my local machine as an Excel spreadsheet file and a CSV file, respectively.
DATA PROCESSING (EXCEL)
At this stage, I needed to view the data to see what it contained, and to do this, I loaded the COVID-19 dataset into Microsoft Excel. There were 67 columns and 299,665 data entries containing COVID-19 information for every country in the world from the 1st of January 2020 until the 16th of March 2023.
I deleted the columns that were not needed for the purpose of the analysis and was left with the following columns:
Location
Date
newcases
newdeaths
Then, I loaded the population data into Microsoft Excel and changed the column names from ‘country’ and ‘2023_last_updated’ to ‘Location’ and ‘Population’ respectively.
DATA PROCESSING (SQL SERVER)
I loaded both datasets individually into SQL Server using the Import and Export Wizard:
After this was completed, I viewed the tables, then, I proceeded to delete all columns that had null values from the COVID table:
Then, I found the total number of COVID cases for each country using the GROUP BY statement:
I saved the result of this query as a new table and loaded it back into SQL Server.
DATA ANALYSIS (SQL SERVER)
At this point, I started to perform exploratory data analysis on the dataset to see some key insights that could be drawn from it. First, I found the countries with the highest and lowest number of cases using subqueries:
I repeated the process, but for the highest and lowest number of deaths:
Then, I had to find the total number of cases, deaths, and the total population of the world. To do this, I had to employ the SQL INNER JOIN statement:
Finally, I found the prevalence rate (per 10,000 people), the mortality rate (per 10,000 people), and the fatality rate (in percentage) using simple mathematical calculations:
INSIGHTS - TABLEAU
My exploration had now been completed, and it was time to draw insights from the data about how the COVID-19 pandemic has had and continues to have an impact on our world. The following other insights were found:
With data from 234 countries, the analysis reveals that 760,289,269 cases and 6,872,081 deaths have been recorded between January 3rd, 2020, and March 16th, 2023.
Being the 3rd largest population in the world and a major hub for international travel, it is unsurprising that the United States has been the hardest impacted by the virus, with over 102 million reported cases. With one of the highest testing capacities in the world, diagnosis of infected patients is also very easy.
North Korea and Turkmenistan, on the other hand, appear to have no documented instances, though this could be due to an intentional lack of data accessibility to the world.
The data suggests that there is a strong positive correlation between a country's population and the number of COVID-19 cases (r ≈ 0.70) and a moderate positive correlation between the population and the number of deaths (r ≈ 0.47) experienced by the country. This means that the higher the population of a country, the higher the tendency for the number of COVID-19 cases and deaths to increase as well.
The data indicates a strong positive correlation (r ≈ 0.76) between the number of confirmed COVID-19 cases and the number of deaths in a nation. This indicates that as the number of reported cases rises, so will the number of fatalities.
It is important to note that correlation does not always imply causation. Other factors, such as population age distribution, healthcare quality, and virus control measures, can all have an impact on the number of COVID-19 cases and deaths in a particular country.
Unfortunately, the United States also holds the unfortunate record for the most deaths, with 1,115,666 recorded. Being that COVID-19 is particularly deadly for older adults and those with underlying health conditions, the United States has a large population of older people and people living with chronic health conditions, which has contributed to a higher mortality rate. However, some nations, including the Falkland Islands and Vatican City, have reported no virus-related deaths (probably due to their relatively small population and strict quarantine measures), providing a ray of hope in these dark times.
According to the analysis, approximately 95 out of every 1,000 people worldwide have been infected with the virus since its emergence in 2020, with a mortality rate of approximately 9 out of every 10,000 people. The prevalence rate varies greatly by nation, with San Marino having the highest rate (7 out of every 10 individuals infected). Yemen, on the other hand, has the lowest prevalence rate, with only about 3 out of every 10,000 individuals infected.
The highest mortality rate is in Peru, where every 64 out of 1,000 people in the population have died from the virus, highlighting the importance of proper healthcare facilities and support during the pandemic.
Perhaps most worrisome is the fact that the countries with the highest fatality rates are mainly underdeveloped/developing countries, such as Yemen, where the fatality rate is a staggering 18.07%. This emphasizes the need for global support to ensure that all countries can provide the necessary healthcare facilities and resources to successfully fight the pandemic
CONCLUSION
The COVID-19 analysis was an eye-opening journey that honed my analytical abilities, allowing me to uncover hidden trends and patterns within the data. Amidst the project's excitement, the Tableau dashboard stood out as an exceptional tool for me to delve deeper into the data's dynamism and interactivity, uncovering profound insights that would have been unnoticeable otherwise.
This experience has proven to be a stimulating blend of learning and entertainment, enabling me to apply my analytical skills to a pressing global issue and granting me a comprehensive outlook on the pandemic's worldwide impact. This practical knowledge will undoubtedly come in handy for future projects and critical business decisions.
The full SQL query file can be found on GitHub using this link, and the interactive dashboard can be found on Tableau here.
Check out the dashboard below: