A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and we are asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons such as "late flight", "rude service" or "bad flight". The data is provided on kaggle and Github as well. It won't take more than an hour to implement this project.
Let's Start!!
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import geopandas as gpd
from shapely.geometry import Point,Polygon
These are the basic libraries required to perform data visualization task. We can also use seaborn or plotly to make our plots more aesthetic.
To install shapely : pip install shapely
To install geopandas refer to this documentation: link
data=pd.read_csv('tweet.csv')
data.head()
data.shape
data.isnull().any()
data['airline'].value_counts()
These are also the fundamental measures to know the data and to ensure that the dataset has no NULL values. If it has any null values then we can use data preprocessing techniques to remove the NAN's, missing values and outliers.
plt.figure(figsize=(15,10))
plt.subplot(331)
data[data['airline']=='Delta']['airline_sentiment'].value_counts().plot(kind='barh',color='teal')
plt.title('delta airline')
plt.subplot(332)
data[data['airline']=='American']['airline_sentiment'].value_counts().plot(kind='barh',color='darkturquoise')
plt.title('American airline')
plt.subplot(333)
data[data['airline']=='United']['airline_sentiment'].value_counts().plot(kind='barh',color='cadetblue')
plt.title('united airline')
plt.subplot(334)
data[data['airline']=='US Airways']['airline_sentiment'].value_counts().plot(kind='barh',color='powderblue')
plt.title('US Airways')
plt.subplot(335)
data[data['airline']=='Southwest']['airline_sentiment'].value_counts().plot(kind='barh',color='mediumaquamarine')
plt.title('Southwest airline')
plt.subplot(336)
data[data['airline']=='Virgin America']['airline_sentiment'].value_counts().plot(kind='barh',color='steelblue')
plt.title('Virgin America')
We have used subplots to place various bar graphs side by side. Barh is used for the horizontal bar graph. If we want to plot a vertical bar graph then we can use only bar . value_counts() gives us the unique values in that particular attribute and also keep a count on the number of unique values in a given attribute.
This bar graph illustrates the sentiment of the passengers travelling in different airlines. From the given graphs we can conclude that US airways have the highest negative reviews. Other airlines like Southwest, United and American airlines also have more negative tweets than positive ones. However, although delta and Virgin America have more negative tweets, compared to the rest of the airlines they have a higher proportion of positive and neutral reviews. Keep in mind that the number of reviews is different. So we cannot say that Virgin America has higher reviews because overall only 504 tweets of virgin America are collected while US airways have more than 3000 tweets. So it's better to say that proportion of positive tweets in virgin America and delta airlines are higher than the rest.
geo=[Point(xy) for xy in zip(data[data['airline']=='Delta']['longitude'][:200],data[data['airline']=='Delta']['latitude'][:200])]
geo1=[Point(xy) for xy in zip(data[data['airline']=='Southwest']['longitude'][:200],data[data['airline']=='Southwest']['latitude'][:200])]
geo2=[Point(xy) for xy in zip(data[data['airline']=='American']['longitude'][:200],data[data['airline']=='American']['latitude'][:200])]
geo3=[Point(xy) for xy in zip(data[data['airline']=='United']['longitude'][:200],data[data['airline']=='United']['latitude'][:200])]
geo4=[Point(xy) for xy in zip(data[data['airline']=='US Airways']['longitude'][:200],data[data['airline']=='US Airways']['latitude'][:200])]
geo5=[Point(xy) for xy in zip(data[data['airline']=='Virgin America']['longitude'][:200],data[data['airline']=='Virgin America']['latitude'][:200])]
We first will convert latitude and longitude into Point data type. We have grouped similar airlines and found out the location of tweets based on different airlines.
g=gpd.GeoDataFrame(geometry=geo)
g1=gpd.GeoDataFrame(geometry=geo1)
g2=gpd.GeoDataFrame(geometry=geo2)
g3=gpd.GeoDataFrame(geometry=geo3)
g4=gpd.GeoDataFrame(geometry=geo4)
g5=gpd.GeoDataFrame(geometry=geo5)
We now will convert the points into geopandas data frame in order to plot those plots on the US map.
US = gpd.read_file('States_shapefile-shp/States_shapefile.shp')
You can download the US shape file from here. A shapefile is a simple, nontopological format for storing the geometric location and attribute information of geographic features.
fig,ax=plt.subplots(figsize=(15,10))
US.boundary.plot(ax=ax,color='slategrey')
g.plot(ax=ax,color='lightblue',label='delta')
g1.plot(ax=ax,color='plum',label='southwest')
g2.plot(ax=ax,color='mediumslateblue',label='American')
g3.plot(ax=ax,color='darkgoldenrod',label='United')
g4.plot(ax=ax,color='lightcoral',label='US Airways')
g5.plot(ax=ax,color='forestgreen',label='Virgin America')
plt.legend()
This is how we will plot the points. We first will create and figure and axes. On that axes, we will overlap out geometric points.
This plot shows that tweets of delta and US airways were located in the central region while tweets from the others were concentrated on the southwest region like California and Nevada.
That's it for now. We can make many more visualization from the given dataset. The dataset also has time and date attribute. We can measure the frequency of tweets during the day.
See you jolly soon!