COVID-19 has rampant effect on humans around the world. To combat this disease we need to know how this virus is spreading. With the right data, individuals and organizations can make the most informed decisions to keep people safe. There are lot many open-source datasets available for pandemic analytics.
I have performed data analysis on COVID-19 data in python. Data when combined with other data can fetch lot of insights and help enhance public awareness. For this project I have combined world Happiness report data to gain more insights. You can find the data here .
Let's start !!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
from shapely.geometry import Point,Polygon
we will need to install geopandas to plot world map. To install geopandas we can run the following line in your python notebook: pip install geopandas and similarly we can install shapely for plotting geometric points : pip install Shapely .
confirm_data= pd.read_csv('covid19_Confirmed_dataset.csv') #read csv file
lat=[i for i in confirm_data['Lat']] #for confirmed cases
lon=[i for i in confirm_data['Long']] #for confirmed cases longitude
death_data= pd.read_csv(covid data analysis/covid19_deaths_dataset.csv')
happiness_report= pd.read_csv('worldwide_happiness_report.csv')
confirm_data.drop(columns=['Province/State','Lat','Long'],axis=1,inplace=True)
confirm_agg= confirm_data.groupby('Country/Region').sum()
death_data.drop(columns=['Province/State','Lat','Long'],axis=1,inplace=True)
death_agg= death_data.groupby('Country/Region').sum()
useless=['Overall rank','Score','Generosity','Perceptions of corruption']
happiness_report.drop(useless,axis=1,inplace=True)
happiness_report.set_index("Country or region",inplace=True)
Here we have dropped unnecessary features like latitude , longitude and province/state as it had null values. In happiness report we have dropped few columns as they were not necessary. To read a csv file we write pd.read_csv('filename.csv') and to drop the columns we use dataframe.drop(columns=[list],axis=1,inplace=True) axis=1 states that we want to delete whole column. Inplace =True argument stands for the data frame has to make changes permanent . I have stored latitude and longitude into a list to plot world map .
We have grouped the data region wise using groupby('attribute').sum() or .mean() etc.
plt.figure(figsize=(10,4)) #adjusting figure size
ax =confirm_agg.loc['India'].plot(color='orange',label='India')
confirm_agg.loc['China'].plot(color='red',label='China')
confirm_agg.loc['Russia'].plot(color='blue',label='Russia')
confirm_agg.loc['United Kingdom'].plot(color='green',label='UK')
ax.tick_params(axis='x', colors='white') #colour the axis to adjust the dark theme
ax.tick_params(axis='y', colors='white')
plt.xlabel('DATE',color='white') #setting labels
plt.ylabel('No.Of Cases',color='white')
plt.legend()
plt.show()
confirm_agg.loc['India'].diff().plot(color='orange',label='Bharat')
confirm_agg.loc['China'].diff().plot(color='darkblue',label='China')
x=confirm_agg.loc['China'].diff().max()
plt.plot(22,x,'r^')
confirm_agg.loc['Russia'].diff().plot(color='deeppink',label='Russia')
plt.xticks(color='azure')
plt.yticks(color='azure')
plt.legend()
plt.show()
Pandas dataframe.diff() is used to find the first discrete difference of objects over the given axis. We can provide a period value to shift for forming the difference
we observe that on 12th feb 2020 there were 373 cases in china , within 24hrs the cases shoot up to 15,136!!
df=pd.DataFrame(confirm_agg.loc['China']['2/11/20':'2/29/20'])
df['diff']=df.diff()
df.style.background_gradient(cmap='Dark2')
US, FRANCE and CHINA are on the top
US, FRANCE and CHINA are on the top
geo=[Point(xy) for xy in zip(lon,lat)]
g=gpd.GeoDataFrame(geometry=geo)
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
fig,ax=plt.subplots(figsize=(15,10))
world.plot(ax=ax,color='green')
g.plot(ax=ax,color='pink')
plt.title('confirm cases in the world',color='white')
plt.axis('off')
plt.show()
The goal of GeoPandas is to make working with geospatial data in python easier. It combines the capabilities of pandas and shapely, providing geospatial operations in pandas and a high-level interface to multiple geometries to shapely. GeoPandas enables you to easily do operations in python that would otherwise require a spatial database such as PostGIS. refer this
country=[i for i in confirm_agg.index]
maximum_rate=[]
for i in country:
maximum_rate.append(confirm_agg.loc[i].diff().max())
confirm_agg['maximum_rate']=maximum_rate
country=[i for i in death_agg.index]
max_death_rate=[]
for i in country:
max_death_rate.append(death_agg.loc[i].diff().max())
death_agg['max_dead']=max_death_rate
corona_data=pd.DataFrame(confirm_agg["maximum_rate"])
death_data=pd.DataFrame(death_agg["max_dead"])
data=corona_data.join(happiness_report,how='inner')
data=death_data.join(data,how='inner')
we have found out the maximum confirmed casee and death cases of each country and created one more column to store it. Using join() we have combined the three datasets. Inner join is the most common type of join you'll be working with. It returns a dataframe with only those rows that have common characteristics. An inner join requires each row in the two joined dataframes to have matching column values. This is similar to the intersection of two sets.
sns.heatmap(data.corr(),cmap='spring',annot=True, vmin=-1, vmax=1, center= 0)
plt.xticks(color='White')
plt.yticks(color='white')
plt.show()
That was it for now . we have health life expectancy which is health measure that combines age-specific mortality with morbidity or health status to estimate expected years of life in good health for persons at a given age . If we look at the plot , we observe that there are few nations or countries which have high LE but the death rate is high as well. One possible reason could be the lack of covid data we have which is of only 3 months. In the beginning, due to lack of covid awareness the death rate and the confirmed cases might be increasing.