Data Quality :
Organization: The data is structured into two CSV files, one for cases and the other for vaccinations, ensuring clarity and organization.
Common Columns: Both files share common columns such as continent and country, enabling seamless integration for analysis purposes.
Cleaning: Prior to analysis, the data underwent cleaning to remove null values and duplicates in the continent and country columns, enhancing accuracy.
Filtering: Null and repeated values were filtered out during analysis using SQL and Tableau, maintaining consistency and reliability.
Documentation: Links to the cleaned data sheets are provided to ensure transparency and reproducibility in analysis. Access the cleaned data sheets here.
Analysis Methodology as follows:
Data Acquisition: Downloaded Our World in Data's COVID-19 dataset in XLSX format.
Initial Assessment: Reviewed the dataset in Excel, focusing on pertinent columns related to COVID cases and vaccinations.
Column Selection: Identified specific columns relevant to the project objectives for simplicity and focus.
Data Segmentation: Split the dataset into two CSV files: one for COVID cases and another for vaccinations.
Common Columns: Ensured consistency by including common columns (continent, location, population, date) in both files for seamless analysis.
Inconsistency Resolution: Addressed inconsistencies where the continent column was null while location contained continent names.
Data Integrity: Deleted rows with null continent values to prevent duplication issues during analysis.
Final Review: Conducted a thorough check to ensure data consistency and cleanliness.
Further Refinement: Utilized SQL (BigQuery) for additional cleaning and analysis as needed.
In the following SQL queries, I conducted analyses on the COVID-19 dataset to extract insights at both country and continent levels. These queries provide valuable information regarding the overall COVID-19 situation based on total cases, deaths, population, and percentage of population infected.
Country-Based Overall Data: This query aggregates COVID-19 data at the country level, providing insights into the total population, confirmed cases, total deaths, and percentage of population infected. It filters out null continent values to ensure data accuracy.
Continent-Based Overall Data: This query aggregates COVID-19 data at the continent level, showcasing the population, total cases, and total deaths for each continent. It excludes null continent values to ensure consistency in the analysis.
**All codes mentioned below right image**
These SQL queries delve into various aspects of the COVID-19 dataset, exploring the relationship between total cases, total deaths, population, and infection rates across different countries. By filtering data country-wise, these queries offer insights into the impact of the pandemic on individual nations.
Total Case vs Total Death:
This query compares the total number of COVID-19 cases with the total number of deaths recorded over time.
It calculates the death percentage to understand the severity of the pandemic in each location.
The coalesce() function ensures that null values are replaced with 0 for better visualization.
Population vs Total Cases:
This query examines the relationship between a country's population and the total number of COVID-19 cases.
It calculates the infection rate, representing the percentage of the population infected by the virus.
The results provide insights into the scale of the outbreak relative to the population size.
Population vs Highest Total Cases:
This query identifies the country with the highest recorded total cases of COVID-19.
It compares the population of each country with its highest infection count and calculates the corresponding infection rate.
The results highlight countries with the most significant impact of the pandemic relative to their population size, aiding in understanding the spread and severity of COVID-19 outbreaks globally.
**All codes mentioned below left image**
In the following SQL queries, I've analyzed the COVID-19 dataset to identify countries and continents with the highest death counts, as well as to explore global case-death ratios. These insights are crucial for understanding the severity of the pandemic's impact at both country and continent levels.
Countries with Highest Death Count:
This query provides a list of countries with the highest death counts due to COVID-19.
It filters out null continent values to ensure data accuracy.
Results are grouped by country and ordered based on the highest death count, revealing the most severely affected countries.
Continent-Based Overall Data:
This query determines the continent with the highest death count attributed to COVID-19.
It filters out null continent values to ensure data integrity.
Results are grouped by continent and ordered by the highest death count, providing insights into the continents most impacted by the pandemic.
Global Case-Death Data with Ratio:
This query computes the case-death ratio for each continent, providing insights into the severity of COVID-19.
It includes total cases, total deaths, and the calculated death ratio, rounded for clarity.
Results are grouped by continent and ordered by the total death count, offering a comprehensive overview of the pandemic's impact on a global scale.
**All codes mentioned below right image**
The following SQL queries analyze the global COVID-19 dataset to provide insights into the total number of new cases and new deaths. By examining these metrics both cumulatively and by date, we gain a comprehensive understanding of the pandemic's progression and its impact over time.
Global Total New Cases & New Deaths by Date:
This query aggregates data to calculate the total number of new cases and new deaths globally for each date.
The coalesce() function ensures that null values are replaced with 0 for visualization clarity.
It also computes the death percentage, representing the proportion of new deaths relative to new cases on each date.
Global Total New Cases & New Deaths:
This query provides the cumulative total number of new cases and new deaths globally.
It computes the death percentage, representing the proportion of new deaths relative to new cases globally.
The results offer a comprehensive overview of the pandemic's impact, aiding in understanding its severity and progression.
**All codes mentioned below left image**
The following SQL queries focus on joining two tables, Case_data and Vacci_data, to analyze the relationship between total population and vaccination data. By examining this relationship, we gain insights into vaccination coverage and its progression over time, essential for understanding the pandemic's trajectory and response efforts.
Joining Tables Case_data and Vacci_data:
This query combines data from the Case_data and Vacci_data tables to compare total population with vaccination statistics.
The coalesce() function ensures that null values are replaced with 0 for visualization clarity.
Results are ordered by location and date to facilitate analysis and interpretation.
Using Common Table Expression (CTE) for Vaccination Percentage:
This query employs a Common Table Expression (CTE) named popVSvac to calculate vaccination percentages by date.
The coalesce() function ensures that null values are replaced with 0 for visualization clarity.
The percentage of population vaccinated is computed based on the rolling vaccination population divided by the total population, providing insights into vaccination coverage over time.
**All codes mentioned below right image**
For access to the detailed documentation and interactive dashboard related to this project, please visit - Summary doc , SQL queries & Tableau report, Datasets
These insights demonstrate strong analytical skills and a comprehensive understanding of the pandemic's dynamics, guiding future research and decision-making in healthcare and public policy.
Global Mortality Rates:
Developed nations show declining mortality rates, reflecting effective healthcare systems and rapid responses.
Less developed countries face challenges with higher mortality rates due to economic disparities and limited resources.
Seasonal Patterns in COVID-19 Cases:
Analysis reveals seasonal spikes in cases, particularly during winter, emphasizing the importance of seasonal preparedness.
Europe and parts of North America experience severe outbreaks during these periods.
Global Pandemic Trends:
2023 marked the pandemic's peak, but by 2024, efforts led to significant improvements, showcasing healthcare system adaptability.
Impact on High-Income Countries:
Despite economic affluence, the US faces challenges, highlighting the complex interplay of socioeconomic factors and healthcare infrastructure.
Vaccination Success Stories:
India and China lead in vaccination efforts, leveraging their large populations effectively.
Rapid global deployment, with China leading in vaccine administration followed by India.
Disparities in Mortality Rates:
Significant disparities exist across regions, with Yemen and African countries reporting the highest rates.
In conclusion, my portfolio project on COVID-19 data analysis has unearthed profound insights into the global pandemic landscape. From discerning seasonal trends in case occurrences to shedding light on disparities in mortality rates, this analysis offers invaluable knowledge for decision-makers navigating the complexities of public health crises.
Driven by a fervent dedication to harnessing the power of data analysis, I am committed to fostering a deeper understanding of intricate issues and empowering informed decision-making processes. This project has ignited my passion for delving into data-driven exploration further, and I eagerly anticipate the opportunities to continue learning and refining my skills in this dynamic field.
Thank you for taking the time to explore my project, and I am grateful for your interest in the insights it has unveiled.