Exercise 3.2

About dataset

Advertising.csv consists of 200 rows and 5 columns

Unamed, TV, Radio, Newspaper, Sales [Unamed need to be removed]

It showed the sales based on different combination of TV/Radio/Newspaper.

Data Preprocessing

#Load library

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

%matplotlib inline

import seaborn as sns

#Dataset import

import pandas as pd

dataexe = pd.read_csv('Advertising.csv')

#Check Information

dataexe.info()

RangeIndex: 200 entries, 0 to 199

Data columns (total 5 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Unnamed: 0 200 non-null int64

1 TV 200 non-null float64

2 Radio 200 non-null float64

3 Newspaper 200 non-null float64

4 Sales 200 non-null float64

dtypes: float64(4), int64(1)

memory usage: 7.9 KB

#Remove unnecessary column

dataexe.drop(dataexe.columns[[0]], axis=1, inplace=True)

Plot Graph

#Pairplot_To find the relationship between features

sns.pairplot(dataexe1,kind="reg");

From the above graph, it shows that TV and radio are having positive linear relationship with sales, if TV/Radio increased, Sales increased, the lessest is the Newspaper. Using TV is more efficient to boost the sales. So focus on TV/Radio.

#Regression plot

sns.regplot(data = dataexe1, x = 'TV', y = 'Sales');

#Scatterplot

sns.scatterplot(data = dataexe, x = 'Radio', y = 'Sales');

#Boxplot to check outlier?

sns.boxplot(x = dataexe1['TV']);

#Boxplot to check outlier?

sns.boxplot(x = dataexe1['Radio']);

A boxplot is a standardized way of displaying the distribution of data based on a five number summary (“minimum”, first quartile [Q1], median, third quartile [Q3] and “maximum”). It can tell you about your outliers and what their values are

Data that contains sum of each column

Shows that invest in which component the most

dataexe.loc['Total'] = dataexe.sum(numeric_only=True, axis=0)

print(dataexe)

OUTPUT

TV Radio Newspaper Sales

0 230.1 37.8 69.2 22.1

1 44.5 39.3 45.1 10.4

2 17.2 45.9 69.3 9.3

3 151.5 41.3 58.5 18.5

4 180.8 10.8 58.4 12.9

... ... ... ... ...

196 94.2 4.9 8.1 9.7

197 177.0 9.3 6.4 12.8

198 283.6 42.0 66.2 25.5

199 232.1 8.6 8.7 13.4

Total 29408.5 4652.8 6110.8 2804.5

#ONLY TOTAL ROW

data2 = dataexe.drop(dataexe.index[:200])

data2

#PIE CHART

y = np.array([29408.5, 4652.8, 6110.8])

mylabels = ["TV", "Radios", "Newspaper"]

plt.pie(y, labels = mylabels)

plt.show()