ANALYTICS ZONE - COVID-19 DATA ANALYSIS

COVID-19 DATA ANALYSIS

COVID-19 has rampant effect on humans around the world. To combat this disease we need to know how this virus is spreading. With the right data, individuals and organizations can make the most informed decisions to keep people safe. There are lot many open-source datasets available for pandemic analytics.

I have performed data analysis on COVID-19 data in python. Data when combined with other data can fetch lot of insights and help enhance public awareness. For this project I have combined world Happiness report data to gain more insights. You can find the data here .

Let's start !!

Import Useful Libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

import geopandas as gpd

from shapely.geometry import Point,Polygon

we will need to install geopandas to plot world map. To install geopandas we can run the following line in your python notebook: pip install geopandas and similarly we can install shapely for plotting geometric points : pip install Shapely .

Load the data and clean it

confirm_data= pd.read_csv('covid19_Confirmed_dataset.csv') #read csv file

lat=[i for i in confirm_data['Lat']] #for confirmed cases

lon=[i for i in confirm_data['Long']] #for confirmed cases longitude

death_data= pd.read_csv(covid data analysis/covid19_deaths_dataset.csv')

happiness_report= pd.read_csv('worldwide_happiness_report.csv')

confirm_data.drop(columns=['Province/State','Lat','Long'],axis=1,inplace=True)

confirm_agg= confirm_data.groupby('Country/Region').sum()

death_data.drop(columns=['Province/State','Lat','Long'],axis=1,inplace=True)

death_agg= death_data.groupby('Country/Region').sum()

useless=['Overall rank','Score','Generosity','Perceptions of corruption']

happiness_report.drop(useless,axis=1,inplace=True)

happiness_report.set_index("Country or region",inplace=True)

Here we have dropped unnecessary features like latitude , longitude and province/state as it had null values. In happiness report we have dropped few columns as they were not necessary. To read a csv file we write pd.read_csv('filename.csv') and to drop the columns we use dataframe.drop(columns=[list],axis=1,inplace=True) axis=1 states that we want to delete whole column. Inplace =True argument stands for the data frame has to make changes permanent . I have stored latitude and longitude into a list to plot world map .

We have grouped the data region wise using groupby('attribute').sum() or .mean() etc.

plot the rise in cases of india with respect to china,russia and UK

The above graph shows how cases started to rise from january onwards. but it doesnot show the peak value to get the rate of promulgation of virus. For example cases in china were increaseing rapidly from january to march and then it became constant or might have decresed . so we need to find the date at which the cases where on the peak.

plt.figure(figsize=(10,4)) #adjusting figure size

ax =confirm_agg.loc['India'].plot(color='orange',label='India')

confirm_agg.loc['China'].plot(color='red',label='China')

confirm_agg.loc['Russia'].plot(color='blue',label='Russia')

confirm_agg.loc['United Kingdom'].plot(color='green',label='UK')

ax.tick_params(axis='x', colors='white') #colour the axis to adjust the dark theme

ax.tick_params(axis='y', colors='white')

plt.xlabel('DATE',color='white') #setting labels

plt.ylabel('No.Of Cases',color='white')

plt.legend()

plt.show()

plot the rate of spread of virus across the countries

we observe that china reached its peak in one day. On 12 th feb 2020 the cases were below 2000 and on 13th feb 2020 the cases shoot up in 24hrs.

confirm_agg.loc['India'].diff().plot(color='orange',label='Bharat')

confirm_agg.loc['China'].diff().plot(color='darkblue',label='China')

x=confirm_agg.loc['China'].diff().max()

plt.plot(22,x,'r^')

confirm_agg.loc['Russia'].diff().plot(color='deeppink',label='Russia')

plt.xticks(color='azure')

plt.yticks(color='azure')

plt.legend()

plt.show()

Pandas dataframe.diff() is used to find the first discrete difference of objects over the given axis. We can provide a period value to shift for forming the difference

we observe that on 12th feb 2020 there were 373 cases in china , within 24hrs the cases shoot up to 15,136!!

df=pd.DataFrame(confirm_agg.loc['China']['2/11/20':'2/29/20'])

df['diff']=df.diff()

df.style.background_gradient(cmap='Dark2')

Similarly we will plot death cases

maximum death cases

US, FRANCE and CHINA are on the top

maximum rate of spread

US, FRANCE and CHINA are on the top

Plotting world map

geo=[Point(xy) for xy in zip(lon,lat)]

g=gpd.GeoDataFrame(geometry=geo)

world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

fig,ax=plt.subplots(figsize=(15,10))

world.plot(ax=ax,color='green')

g.plot(ax=ax,color='pink')

plt.title('confirm cases in the world',color='white')

plt.axis('off')

plt.show()

The goal of GeoPandas is to make working with geospatial data in python easier. It combines the capabilities of pandas and shapely, providing geospatial operations in pandas and a high-level interface to multiple geometries to shapely. GeoPandas enables you to easily do operations in python that would otherwise require a spatial database such as PostGIS. refer this

Combining happiness report with covid data set

country=[i for i in confirm_agg.index]

maximum_rate=[]

for i in country:

maximum_rate.append(confirm_agg.loc[i].diff().max())

confirm_agg['maximum_rate']=maximum_rate

country=[i for i in death_agg.index]

max_death_rate=[]

for i in country:

max_death_rate.append(death_agg.loc[i].diff().max())

death_agg['max_dead']=max_death_rate

corona_data=pd.DataFrame(confirm_agg["maximum_rate"])

death_data=pd.DataFrame(death_agg["max_dead"])

data=corona_data.join(happiness_report,how='inner')

data=death_data.join(data,how='inner')

we have found out the maximum confirmed casee and death cases of each country and created one more column to store it. Using join() we have combined the three datasets. Inner join is the most common type of join you'll be working with. It returns a dataframe with only those rows that have common characteristics. An inner join requires each row in the two joined dataframes to have matching column values. This is similar to the intersection of two sets.

plot correlation matrix

You already know that if you have a data set with many columns, a good way to quickly check correlations among columns is by visualizing the correlation matrix as a heatmap. from the image we can see that death rate is correlated to HLE(health life expectancy) A correlation close to 0 indicates no linear relationship between the variables. The sign of the coefficient indicates the direction of the relationship. If both variables tend to increase or decrease together, the coefficient is positive, and the line that represents the correlation slopes upward.

sns.heatmap(data.corr(),cmap='spring',annot=True, vmin=-1, vmax=1, center= 0)

plt.xticks(color='White')

plt.yticks(color='white')

plt.show()

That was it for now . we have health life expectancy which is health measure that combines age-specific mortality with morbidity or health status to estimate expected years of life in good health for persons at a given age . If we look at the plot , we observe that there are few nations or countries which have high LE but the death rate is high as well. One possible reason could be the lack of covid data we have which is of only 3 months. In the beginning, due to lack of covid awareness the death rate and the confirmed cases might be increasing.

Page updated

Google Sites

Report abuse