ABSTRACT
Dataset Title: Contraceptive Method Choice (Source: http://archive.ics.uci.edu/ml/machine-learning-databases/cmc/cmc.data)
Dataset is a subset of the 1987 National Indonesia Contraceptive Prevalence Survey.
Data Set Information:
This dataset is a subset of the 1987 National Indonesia Contraceptive Prevalence Survey. The samples are married women who were either not pregnant or do not know if they were at the time of interview.
The problem is to predict the current contraceptive method choice (no use, long-term methods, or short-term methods) of a woman based on her demographic and socio-economic characteristics.
Number of Instances: 1473
Number of Attributes: 10 (including the class attribute)
Missing Attribute Values: None
Attributes of Dataset:
1. Wife's age (numerical)
2. Wife's education (categorical) 1=low, 2, 3, 4=high
3. Husband's education (categorical) 1=low, 2, 3, 4=high
4. Number of children ever born (numerical)
5. Wife's religion (binary) 0=Non-Islam, 1=Islam
6. Wife's now working? (binary) 0=Yes, 1=No
7. Husband's occupation (categorical) 1, 2, 3, 4
8. Standard-of-living index (categorical) 1=low, 2, 3, 4=high
9. Media exposure (binary) 0=Good, 1=Not good
10. Contraceptive method used (class attribute) 1=No-use, 2=Long-term, 3=Short-term
Data Retrieving - PremGeorge
Data Retrieving - PremGeorge
1. Importing Packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import urllib2
1.a Loading CSV File from web
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/cmc/cmc.data'
set1 = urllib2.Request(url)
cmc_p = urllib2.urlopen(set1)
Assigning the dataset
cmc = pd.read_csv(cmc_p, sep=',', decimal='.', header=None, names=['Wife_Age', 'Wife_Education', 'Husband_Education', 'Number_Of_Children_Ever_Born', 'Wife_Religion', 'Wife_Now_Working', 'Husband_Occupation', 'Standard_Of_Living_Index', 'Media_Exposure','Contraceptive_Method_Used'])
Listing Column Names : cmc.columns
Listing the Column values : cmc.head()
Checking the Data Types: cmc.dtypes
Checking Datatypes : cmc['Wife_Age'].dtype
Checking the attributes :
cmc['Wife_Age'].describe()
cmc['Wife_Age'].unique()
cmc['Wife_Age'].value_counts()
Summary on Data Preparation:
Data Cleaned and Processed for
1.Missing Values
2.Negative Values
3.Impossible Values.
4.Extra Whitespaces
5.Typos
Data Exploration
Wife's age :
cmc['Wife_Age'].plot(kind='hist',bins=10,color='red')
plt.title("Wife's Age Distribution")
plt.xlabel("Wife's Age")
plt.ylabel("Number of wives in the particular age interval")
plt.show()
Number of children ever born :
cmc['Number_Of_Children_Ever_Born'].plot(kind='hist',bins=10,color='brown')
plt.title("Distribution of Number of Children Ever Born")
plt.xlabel("Number of Children Ever Born ")
plt.ylabel("Number of Times the Children Born for each category")
plt.show()
Wife's education :
x_WEaxis=['1 Low', '2','3','4 High']
y_WEaxis=[577,410,334,152]
freqWE=np.arange(len(x_WEaxis))
plt.bar(freqWE,y_WEaxis,align='center', color ='purple')
plt.xticks(freqWE,x_WEaxis)
plt.ylabel("Educated Wife's on Category")
plt.title("BAR CHART: Wife's Education Frequency Distribution")
plt.xlabel("Wife's Education 1 being Low and 4 is Highly Educated")
plt.show()
Husband's education :
x_HEaxis=['1 Low', '2','3','4 High']
y_HEaxis=[899,352,178,44]
freqHE=np.arange(len(x_HEaxis))
plt.bar(freqHE,y_HEaxis,align='center', color ='orange')
plt.xticks(freqHE,x_HEaxis)
plt.ylabel("Educated Husband's on Category")
plt.title("BAR CHART: Husband's Education Frequency Distribution")
plt.xlabel("Husband's Education 1 being Low and 4 is Highly Educated")
plt.show()
Husband's Occupation :
cmc['Husband_Occupation'].value_counts().plot(kind='pie',autopct='%.2f')
plt.title("PIE-Chart: 'Husband’s_Occupation' Percentage Distribution")
plt.show()
Standard-of-living index show :
x_SIaxis=['1 Low', '2','3','4 High']
y_SIaxis=[684,431,229,129]
freqSI=np.arange(len(x_SIaxis))
plt.bar(freqSI,y_SIaxis,align='center',color ='pink')
plt.xticks(freqSI,x_SIaxis)
plt.ylabel("Standard of Living Index on Category")
plt.title("BAR CHART: Standard of Living Index Frequency Distribution")
plt.xlabel("Standard of Living Index 1 being Low and 4 is High Standard")
plt.show()
Wife's Religion :
x_WRaxis=['0 Non Islam','1 Islam']
y_WRaxis=[220,1253]
freqWR=np.arange(len(x_WRaxis))
plt.bar(freqWR,y_WRaxis,align='center',color=['yellow', 'red'])
plt.xticks(freqWR,x_WRaxis)
plt.ylabel('Wifes Religion on Category')
plt.title("BAR CHART: Wife's Religion Frequency Distribution")
plt.xlabel("Wife's Religion 0 is Non-Islamic and 1 being Islamic")
plt.show()
Wife Now Working :
x_WWaxis=['0 Working','1 Not Working']
y_WWaxis=[369,1104]
freqWW=np.arange(len(x_WWaxis))
plt.bar(freqWW,y_WWaxis,align='center',color=['violet','green'])
plt.xticks(freqWW,x_WWaxis)
plt.ylabel('Wifes working or Not')
plt.title("BAR CHART: Wife Working or Not Frequency Distribution")
plt.xlabel("Wife's Work 0 is Working and 1 being Not-Working")
plt.show()
Media Exposure :
x_MEaxis=['0 Good','1 Not Good']
y_MEaxis=[1364,109]
freqME=np.arange(len(x_MEaxis))
plt.bar(freqME,y_MEaxis,align='center',color=['purple', 'yellow'])
plt.xticks(freqME,x_MEaxis)
plt.ylabel('Media Exposure')
plt.title("BAR CHART: Media Exposure Frequency Distribution")
plt.xlabel("Media Exposure 0 is Good Exposure and 1 being Not-Good")
plt.show()
Contraceptive Method Used :
x_cmcaxis=['1 No Use','2 Long Term', '3 Short Term']
y_cmcaxis=[629,333,511]
freqcmc=np.arange(len(x_cmcaxis))
plt.bar(freqcmc,y_cmcaxis,align='center',color =['green','red','blue'])
plt.xticks(freqcmc,x_cmcaxis)
plt.ylabel('Contraceptive Method Used')
plt.title("BAR CHART: Contraceptive Method Used Frequency Distribution")
plt.xlabel("Contraceptive Method Used 1 being Not Used, 2 Long Term used and 3 Short Term used")
plt.show()
Relationships between attributes
Husband's education vs Wife's education
EducationLevel=['4 High','3','2','1 Low']
Husband=[899,352,178,44]
Wife=[577,410,334,152]
bar_width=0.4
x=np.arange(len(EducationLevel))
plt.bar(x,Husband,bar_width,color='red',label="Husband's Education")
plt.bar(x+bar_width,Wife,bar_width,color='pink',label="Wife's Education")
plt.legend()
plt.xlabel("Husband's vs Wife's Education 1:Low and 4:Highly Education")
plt.ylabel('Number of Educated peoples')
plt.title("Husband's Education Vs Wife's Education")
plt.xticks(x+bar_width,EducationLevel)
plt.tight_layout()
plt.show()
Husband's occupation vs Standard-of-living index
Grouping the Standard of Living Index for each category of Husband’s Occupation
SOLIndexHO = cmc.groupby('Husband_Occupation'). Standard_Of_Living_Index.value_counts().sort_index()
SOLIndexHO
SOLIndexHO.unstack()
BARCHART
HO=['1','2','3','4']
SOLIHO1=[7,40,77,5]
SOLIHO2=[29,75,121,4]
SOLIHO3=[107,121,198,5]
SOLIHO4=[293,189,189,13]
bar_width=0.2
x=np.arange(len(HO))
plt.bar(x,SOLIHO1,bar_width,color='red',label="SOL Index 1")
plt.bar(x+bar_width,SOLIHO2,bar_width,color='pink',label="SOL Index 2")
plt.bar(x+bar_width+bar_width,SOLIHO3,bar_width,color='green',label="SOL Index 3")
plt.bar(x+bar_width+bar_width+bar_width,SOLIHO4,bar_width,color='orange',label="SOL Index 4")
plt.legend()
plt.xlabel("Husband's Occupation in different Levels 1 to 4")
plt.ylabel('Standard of Living Index Frequency')
plt.title("Standard of Living Index variation over Husband Occupation Category")
plt.xticks(x+bar_width+bar_width,HO)
plt.tight_layout()
Husband's occupation vs Standard-of-living index
Grouping the Standard of Living Index for each category of Husband’s Occupation
SOLIndexHO = cmc.groupby('Husband_Occupation'). Standard_Of_Living_Index.value_counts().sort_index()
SOLIndexHO.unstack()
BARCHART
HO=['1','2','3','4']
SOLIHO1=[7,40,77,5]
SOLIHO2=[29,75,121,4]
SOLIHO3=[107,121,198,5]
SOLIHO4=[293,189,189,13]
bar_width=0.2
x=np.arange(len(HO))
plt.bar(x,SOLIHO1,bar_width,color='red',label="SOL Index 1")
plt.bar(x+bar_width,SOLIHO2,bar_width,color='pink',label="SOL Index 2")
plt.bar(x+bar_width+bar_width,SOLIHO3,bar_width,color='green',label="SOL Index 3")
plt.bar(x+bar_width+bar_width+bar_width,SOLIHO4,bar_width,color='orange',label="SOL Index 4")
plt.legend()
plt.xlabel("Husband's Occupation in different Levels 1 to 4")
plt.ylabel('Standard of Living Index Frequency')
plt.title("Standard of Living Index variation over Husband Occupation Category")
plt.xticks(x+bar_width+bar_width,HO)
plt.tight_layout()
plt.show()
Wife's now working? vs Standard-of-living index
Grouping the Standard of Living Index for Wife’s working condition
SOLIndexWW = cmc.groupby('Wife_Now_Working'). Standard_Of_Living_Index.value_counts().sort_index()
SOLIndexWW.unstack()
BARCHART
WW=['0 Working','1 Not Working']
SOLIWW1=[20,109]
SOLIWW2=[57,172]
SOLIWW3=[98,333]
SOLIWW4=[194,490]
bar_width=0.2
x=np.arange(len(WW))
plt.bar(x,SOLIWW4,bar_width,color='brown',label="SOL Index 4")
plt.bar(x+bar_width,SOLIWW3,bar_width,color='purple',label="SOL Index 3")
plt.bar(x+bar_width+bar_width,SOLIWW2,bar_width,color='yellow',label="SOL Index 2")
plt.bar(x+bar_width+bar_width+bar_width,SOLIWW1,bar_width,color='orange',label="SOL Index 1")
plt.legend()
plt.xlabel('Wife Working Status')
plt.ylabel('Standard of Living Index Frequency')
plt.title('Standard of Living Index variation over Wife Working Status')
plt.xticks(x+bar_width+bar_width,WW)
plt.tight_layout()
plt.show()
Wife's education vs Standard-of-living index
Grouping Wife’s Education with Standard of Living Index
SOLIndexWE = cmc.groupby('Wife_Education'). Standard_Of_Living_Index.value_counts().sort_index()
SOLIndexWE.unstack()
BARCHART
WE=['4 High','3','2','1 Low']
SOLIWE1=[8,37,55,29]
SOLIWE2=[38,81,72,38]
SOLIWE3=[145,141,100,45]
SOLIWE4=[386,151,107,40]
bar_width=0.2
x=np.arange(len(WE))
plt.bar(x,SOLIWE4,bar_width,color='red',label="SOL Index 4")
plt.bar(x+bar_width,SOLIWE3,bar_width,color='pink',label="SOL Index 3")
plt.bar(x+bar_width+bar_width,SOLIWE2,bar_width,color='green',label="SOL Index 2")
plt.bar(x+bar_width+bar_width+bar_width,SOLIWE1,bar_width,color='orange',label="SOL Index 1")
plt.legend()
plt.xlabel("Wife Education")
plt.ylabel('Standard of Living Index Frequency')
plt.title("Standard of Living Index variation over Wife Education")
plt.xticks(x+bar_width+bar_width,WE)
plt.tight_layout()
plt.show()
Wife's religion vs Contraceptive method used
Grouping the Wife’s Religion with their Contraceptive Method Use
CMCWR = cmc.groupby('Wife_Religion'). Contraceptive_Method_Used.value_counts().sort_index()
CMCWR.unstack()
BARCHART
WR=['0 Non-Islam','1 Islam']
CMCWR1=[75,554]
CMCWR2=[76,257]
CMCWR3=[69,442]
bar_width=0.3
x=np.arange(len(WR))
plt.bar(x,CMCWR1,bar_width,color='brown',label="No-Use")
plt.bar(x+bar_width,CMCWR2,bar_width,color='purple',label="Long-Term")
plt.bar(x+bar_width+bar_width,CMCWR3,bar_width,color='yellow',label="Short-Term")
plt.legend()
plt.xlabel('Wife Religious Status')
plt.ylabel('Contraceptive Method Use Frequency')
plt.title('Contraceptive Method Use with Wife Religious Status')
plt.xticks(x+bar_width+bar_width,WR)
plt.tight_layout()
plt.show()
Wife's religion vs Media Exposure
Grouping the Wife’s Religion with their Media Exposure
WRME = cmc.groupby('Wife_Religion'). Media_Exposure.value_counts().sort_index()
CMCWR.unstack()
BARCHART
WR=['0 Non-Islam','1 Islam']
MEWR0=[212,1152]
MEWR1=[8,101]
bar_width=0.3
x=np.arange(len(WR))
plt.bar(x,MEWR0,bar_width,color='blue',label="Good")
plt.bar(x+bar_width,MEWR1,bar_width,color='yellow',label="No-Good")
plt.legend()
plt.xlabel('Wife Religious Status')
plt.ylabel('Media Exposure Frequency')
plt.title('Media Exposure with Wife Religious Status')
plt.xticks(x+bar_width,WR)
plt.tight_layout()
plt.show()
Wife's now working? vs Number of children ever born
Grouping the Number of Children Born against Wife’s Working Condition
NCBWW = cmc.groupby('Wife_Now_Working'). Number_Of_Children_Ever_Born.value_counts().sort_index()
NCBWW.unstack()
BARCHART
cmc.groupby('Wife_Now_Working').Number_Of_Children_Ever_Born.plot(kind="hist", alpha=0.5)
plt.xlabel("Wife Working Status 0 Working and 1 Not Working")
plt.ylabel('Number of Children Ever Born Frequency')
plt.title("Number of Children Born against Wife Working Status")
plt.legend()
plt.show()
Contraceptive Method Used vs Number of children ever born
Grouping the Number of Children Born against Contraceptive Method Used
NCBCMC = cmc.groupby('Contraceptive_Method_Used'). Number_Of_Children_Ever_Born.value_counts().sort_index()
NCBCMC.unstack()
BOXPLOT
cmc.boxplot(column='Number_Of_Children_Ever_Born',by='Contraceptive_Method_Used')
plt.xlabel("Contraceptive Method Used 1 No-Use, 2 Long-Term, 3 Short-Term")
plt.ylabel('Number of Children Ever Born Frequency')
plt.title("Number of Children Born against Contraceptive Method Used")
plt.legend()
plt.show()
Contraceptive Method Used vs Standard-of-living index
Grouping the Contraceptive Method Used for different Standard of Living Index
SOLCMC = cmc.groupby('Standard_Of_Living_Index'). Contraceptive_Method_Used.value_counts().sort_index()
SOLCMC.unstack()
BARCHART
SCMU=['4 High','3','2','1 Low']
SOLICMU1=[248,184,117,80]
SOLICMU2=[204,90,30,9]
SOLICMU3=[232,157,82,40]
bar_width=0.2
x=np.arange(len(SCMU))
plt.bar(x, SOLICMU3,bar_width,color='blue',label="No-Use")
plt.bar(x+bar_width, SOLICMU2,bar_width,color='red',label="Long-Term")
plt.bar(x+bar_width+bar_width,SOLICMU1,bar_width,color='yellow',label="Short-Term")
plt.legend()
plt.xlabel(' Standard of Living Index')
plt.ylabel('Contraceptive Methods Used Frequency')
plt.title('Contraceptive Methods Used over different Standard of Living')
plt.xticks(x+bar_width+bar_width,SCMU)
plt.tight_layout()
plt.show()
Wife’s Education vs Contraceptive Method Used
Grouping the Contraceptive Method Used based on Wife’s Education
WECMC = cmc.groupby('Wife_Education'). Contraceptive_Method_Used.value_counts().sort_index()
WECMC.unstack()
BARCHART
CMUWE=['4 High','3','2','1 Low']
CMUWE1=[175,175,176,103]
CMUWE2=[207,80,37,9]
CMUWE3=[195,155,121,40]
bar_width=0.2
x=np.arange(len(CMUWE))
plt.bar(x, CMUWE3,bar_width,color='green',label="No-Use")
plt.bar(x+bar_width, CMUWE2,bar_width,color='purple',label="Long-Term")
plt.bar(x+bar_width+bar_width, CMUWE1,bar_width,color='pink',label="Short-Term")
plt.legend()
plt.xlabel('Wife Education 1 being Low and 4 High')
plt.ylabel('Contraceptive Methods Used Frequency')
plt.title('Contraceptive Methods Used over Wife Education')
plt.xticks(x+bar_width+bar_width,CMUWE)
plt.tight_layout()
plt.show ()
Contraceptive Method Used vs Wife’s Age
Grouping the Contraceptive Method Used vs different age groups of Wives.
WACMC = cmc.groupby('Contraceptive_Method_Used').Wife_Age.value_counts().sort_index()
WACMC.unstack()
BOXPLOT
cmc.boxplot(column='Wife_Age',by='Contraceptive_Method_Used')
plt.xlabel("Contraceptive Method Used 1 No-Use, 2 Long-Term, 3 Short-Term")
plt.ylabel('Wife Age Frequency')
plt.title("Wife Age against Contraceptive Method Used")
plt.legend()
plt.show()
Data Modelling
Engineering Feature and Selecting a Model
The Question is to predict the current contraceptive method choice (no use, long-term methods, or short-term methods) of a woman based on her demographic and socio-economic characteristics.
The Contraceptive Method Choice Dataset is a classification task consisting in identifying 3 different types of choices (No-Use, Long-Term and Short Term).
Class Distribution: 1 No-Use : 42.70%, 2 Long-Term : 22.61% 3 Short-term : 34.69%
Loading the Data
Finding Classification Metrics “Acuracy”
The class distribution seems to be balanced, accuracy is considered to be a good choice as it gives high scores to models which predict the most frequent class.
Generating Train/Test Set
from sklearn.cross_validation import train_test_split
cmc.shape
(1473, 10)
Selecting Feature Columns
cmc_copy=cmc
cmc_copy = cmc_copy.drop('Contraceptive_Method_Used',1)
cmc_copy.shape
Out[7]: (1473, 9)
Selecting Target Columns with Class Variable
target = cmc['Contraceptive_Method_Used']
target.shape
(1473,)
np.unique(target)
Out[90]: array([1, 2, 3])
X_train, X_test, y_train, y_test = train_test_split(cmc_copy,target, test_size=0.4, random_state=0)
X_train.shape,y_train.shape
Out[13]: ((883, 9), (883,))
X_test.shape,y_test.shape
Out[14]: ((590, 9), (590,))
Checking classification accuracy of KNN with K=5 by test and train method
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights='uniform')
y_pred = knn.predict(X_test)
print metrics.accuracy_score(y_test, y_pred)
Output: 0.4763
Obtaining classification accuracy by cross_val_score (Score Method)
from sklearn.cross_validation import cross_val_score
knn = KNeighborsClassifier(n_neighbors=5)
scores = cross_val_score(knn, cmc_copy, target, cv=10, scoring='accuracy')
scores
Output: array([ 0.56375839, 0.46621622, 0.49324324, 0.50340136, 0.49659864,
0.52380952, 0.57823129, 0.54421769, 0.56462585, 0.45890411])
scores.mean()
Output: 0.5193
Discussion
Comparing Cross Validation and Train/Test Split
Accuracy on CV=0.5193 on TTSplit=0.4763
CV gives more accurate estimate of out-of-sample accuracy
Obtaining classification accuracy with different Scoring Parameter
from sklearn import metrics
from sklearn.cross_validation import cross_val_predict
predicted = cross_val_predict(knn, cmc_copy, target, cv=10)
metrics.precision_score(target, predicted)
Output: 0.5122
metrics.recall_score(target, predicted)
Output: 0.5193
metrics.f1_score(target, predicted)
Output: 0.5128
Performance Accuracy
Obtaining Performance Score by train_test_split : 0.4763
Obtaining Performance Score by cross_val_score (Score Method) : 0.5193
Obtaining Performance Score by Scoring Parameter
Precision Score : 0.5122
Recall Score : 0.5193
F1 Score : 0.5128
Cross validation and Recall metric gives more accurate estimate of out-of-sample accuracy.
k-folds Cross Validation
Computing the score 10 consecutive times
import numpy as np
from sklearn import datasets, svm
X_folds = np.array_split(cmc_copy, 10)
y_folds = np.array_split(target, 10)
kfclf = svm.SVC(kernel='linear', C=1)
scores = list()
for k in range(10):
X_train = list(X_folds)
X_test = X_train.pop(k)
X_train = np.concatenate(X_train)
y_train = list(y_folds)
y_test = y_train.pop(k)
y_train = np.concatenate(y_train)
scores.append(kfclf.fit(X_train, y_train).score(X_test, y_test))
print(scores)
[0.39864864864864863, 0.42567567567567566, 0.46621621621621623, 0.0, 0.25850340136054423, 0.013605442176870748, 0.37414965986394561, 0.47619047619047616, 0.17006802721088435, 0.0]
Simulating by splitting the cmc data of 25 observations in to 5 folds
from sklearn.cross_validation import KFold
kf = KFold(25, n_folds=5, shuffle=False)
print '{} {:^61} {}'.format('Iteration', 'Training set observations', 'Testing set observations')
for iteration, data in enumerate(kf, start=1):
print '{:^9} {} {:^25}'.format(iteration, data[0], data[1])
Iteration Training set observations Testing set observations
1 [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [0 1 2 3 4]
2 [ 0 1 2 3 4 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [5 6 7 8 9]
3 [ 0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21 22 23 24] [10 11 12 13 14]
4 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 21 22 23 24] [15 16 17 18 19]
5 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19] [20 21 22 23 24]
Finding the optimal value of k for k-nearest neighbour
from sklearn.neighbors import KNeighborsClassifier
k_range = range(1, 31)
k_scores = []
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, cmc_copy, target, cv=10, scoring='accuracy')
k_scores.append(scores.mean())
print k_scores
[0.46094351110834114, 0.46969970995437571, 0.48610035018084768, 0.50439011683282042, 0.51930063120517422, 0.53288377875842907, 0.53769612215874985, 0.51997214613055376, 0.53089296411969422, 0.53355454444449568, 0.54306009289756119, 0.53831601352352876, 0.54441106899538183, 0.54847444057769912, 0.55051072343589968, 0.55864225066046735, 0.55261608125679096, 0.55461987425862702, 0.56206659337052434, 0.56139116828645197, 0.5641720116207567, 0.55394457130098529, 0.55193555855634435, 0.55942855809185799, 0.55195872961669734, 0.54789095049572611, 0.54923310771299327, 0.5458453488415137, 0.54857153362551381, 0.55263005818305999]
Plotting the value of K for KNN (x-axis) versus the cross-validated accuracy (y-axis)
import matplotlib.pyplot as plt
plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')
plt.show()
10 fold cross-validation gives the best KNN model with k=21
knn = KNeighborsClassifier(n_neighbors=21)
scores= cross_val_score(knn, cmc_copy, target, cv=10, scoring='accuracy')
print scores
[0.60402685 0.4527027 0.53378378 0.59183673 0.57823129 0.56462585
0.59183673 0.57823129 0.61904762 0.52739726]
scores.mean()
Output: 0.5641
10 fold cross-validation gives the best KNN model with k=21 gives more precise accuracy
Data Modelling
k-nearest neighbour Classifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import cross_val_score
X_train, X_test, y_train, y_test = train_test_split(cmc_copy,target, test_size=0.4, random_state=0)
Fitting the Model
k21clf = KNeighborsClassifier(21)
k21fit = k21clf.fit(X_train, y_train)
k21fit
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=21, p=2,
weights='uniform')
print cross_val_score(k21fit, cmc_copy, target, cv=10, scoring='accuracy').mean()
Output 0.56417
Predict on Unseen Data
k21predicted = k21fit.predict(X_test)
k21predicted
Out[101]:
array([1, 2, 2, 1, 2, 2, 1, 1, 1, 2, 2, 2, 3, 2, 2, 2, 1, 2, 3, 1, 3, 3, 1,
1, 3, 2, 3, 3, 1, 2, 3, 1, 3, 1, 1, 2, 3, 3, 3, 1, 1, 1, 1, 1, 1, 3,
3, 1, 1, 1, 3, 3, 1, 1, 1, 1, 1, 3, 1, 1, 3, 3, 1, 1, 3, 1, 1, 2, 1,
1, 1, 3, 3, 2, 3, 3, 1, 1, 1, 1, 3, 1, 3, 2, 1, 3, 3, 3, 3, 1, 1, 3,
3, 3, 1, 3, 2, 2, 1, 2, 1, 2, 1, 3, 1, 1, 3, 1, 2, 1, 1, 1, 2, 3, 1,
1, 3, 3, 1, 1, 2, 1, 2, 1, 2, 3, 1, 3, 1, 1, 3, 1, 3, 3, 3, 3, 1, 3,
1, 3, 2, 1, 3, 3, 2, 3, 1, 1, 1, 1, 1, 1, 2, 2, 1, 3, 1, 1, 1, 1, 3,
3, 1, 2, 1, 3, 1, 1, 3, 2, 2, 1, 1, 3, 3, 3, 3, 3, 2, 1, 1, 1, 2, 1,
1, 1, 3, 3, 2, 2, 3, 3, 3, 3, 2, 3, 2, 1, 3, 1, 1, 1, 1, 3, 3, 3, 3,
1, 1, 3, 1, 3, 3, 1, 2, 3, 1, 3, 2, 1, 1, 1, 1, 3, 1, 1, 1, 2, 3, 1,
1, 3, 3, 1, 3, 2, 1, 2, 3, 3, 1, 3, 1, 1, 1, 2, 3, 2, 1, 2, 1, 2, 1,
1, 1, 1, 3, 1, 3, 1, 1, 1, 2, 1, 1, 1, 3, 1, 1, 2, 1, 1, 1, 3, 2, 1,
3, 3, 1, 2, 3, 1, 1, 1, 3, 2, 1, 3, 2, 2, 1, 2, 2, 3, 1, 3, 1, 2, 1,
1, 3, 2, 3, 2, 1, 3, 2, 2, 2, 3, 3, 1, 3, 3, 3, 2, 3, 1, 3, 1, 1, 1,
2, 1, 2, 3, 3, 1, 1, 1, 1, 3, 1, 3, 2, 3, 3, 1, 2, 3, 1, 1, 3, 2, 3,
3, 3, 3, 2, 3, 3, 3, 1, 2, 1, 3, 1, 1, 3, 2, 3, 1, 2, 1, 1, 3, 2, 2,
2, 2, 2, 1, 2, 3, 3, 2, 1, 3, 1, 1, 3, 3, 2, 2, 2, 1, 2, 1, 2, 3, 2,
1, 3, 2, 2, 1, 1, 3, 3, 3, 1, 3, 3, 2, 1, 2, 1, 3, 1, 2, 3, 3, 3, 1,
2, 1, 1, 1, 3, 2, 3, 1, 1, 1, 3, 2, 1, 1, 2, 2, 1, 3, 1, 1, 3, 1, 3,
3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 1, 1, 3, 3, 3, 1, 1, 3, 3, 1,
2, 1, 1, 1, 1, 1, 2, 3, 1, 1, 3, 3, 1, 2, 3, 1, 2, 1, 3, 3, 3, 1, 2,
2, 3, 2, 3, 1, 2, 1, 3, 1, 3, 3, 3, 2, 3, 2, 2, 3, 2, 2, 2, 3, 3, 3,
3, 3, 1, 3, 3, 3, 2, 3, 1, 1, 1, 1, 2, 3, 3, 1, 3, 3, 1, 1, 2, 3, 1,
3, 2, 1, 1, 1, 3, 1, 3, 1, 3, 2, 2, 2, 1, 1, 3, 1, 1, 1, 1, 3, 1, 1,
2, 1, 3, 1, 3, 3, 1, 3, 1, 3, 1, 1, 3, 1, 1, 3, 2, 3, 1, 3, 1, 3, 1,
1, 2, 2, 1, 1, 3, 2, 3, 3, 1, 3, 2, 1, 1, 3])
k21predicted.shape
(590,)
Methods for classification prediction results
Confusion Matrix
from sklearn.metrics import confusion_matrix
k21cm = confusion_matrix(y_test,k21predicted)
k21cm
Out[105]:
array([[152, 29, 61],
[ 43, 58, 45],
[ 65, 37, 100]])
Class Distribution: 1 No-Use : 42.70% (629), 2 Long-Term : 22.61% (333) 3 Short-term : 34.69% (511)
Classification system has been trained to distinguish between No-Use, Long-Term and Short-Term.
The confusion matrix summarizes the results of testing the algorithm. The sample of 590 choices — 242 No-Use ,333 Long-Term ,511 Short-Term, the resulting confusion matrix
Predicted
1 2 3
Actual 1 [152, 29, 61], =242
2 [ 43, 58, 45], =146
3 [ 65, 37, 100] =202
260 124 206
Discussion
Distribution in diagonal were correct guesses: 152+58+100 = 310
49.03% is ‘No-Use’ 18.71% is ‘Long-Term’ 32.25% is ‘Short-Term’
Errors represented by values outside the diagonal.
By Considering No-Use vs Other Choices
No-Use Other Choices
TP152 FN135 287
FP145 TN158 303
297 293
TN-True Positive, TP-True Positive,
FP-False Positive (Type I Error), FN-False Negative (Type II Error).
The Classifier made a total of 590 Predictions
Out of 590, Classifier Predicted 297 women’s not using the contraceptive methods, 293 were using a method.
In reality, 287 were not using and 303 were using.
How often is the classifier correct?
Accuracy of the Classifier: (TP+TN)/total = (152+158)/590 = 0.5254
How often is the classifier wrong?
Classification Error Rate: (FP+FN)/total = (135+145)/590 = 0.4746 (Equivalent to 1-Accuracy)
When it is actually ‘No-Use, how often the classifier predicted ‘No-Use’ i.e Recall or Sensitivity
Recall: True Positive Rate: TP/actual =152/287 = 0.5296
When it's actually ‘No-Use’ of Methods, how often Classifier predict ‘Other Choices’?
False Positive Rate: FP/actual = 145/303 = 0.4785
When it's actually ‘Other Choices’ of Methods, how often does it predict ‘Other Choices’?
Specificity: TN/actual = 158/303 = 0.5214 (equivalent to 1 minus False Positive Rate)
When it predicts ‘No-Use’, how often is it correct?
Precision: TP/predicted = 152/297 = 0.5117
How often does the ‘other Choices’ actually occur in our sample?
Prevalence: actual/total = 303/590 = 0.5136
How often does the ‘No-Use’ actually occur in our sample?
Prevalence: actual/total = 287/590 = 0.4864
Classification Report
from sklearn.metrics import classification_report
print classification_report(y_test, k21predicted)
Class Label precision recall f1-score support
1 0.58 0.63 0.61 242
2 0.47 0.40 0.43 146
3 0.49 0.50 0.49 202
avg / total 0.52 0.53 0.52 590
Out of 242 Actual ‘No-Use’, System predicted 152(true positive), rest 90 belongs to Long-Term and Short-Term. And out of 260 predicted, 108 belongs to other choices.
Precision gives the measure of quality
Recall gives the measure of quantity
High precision gives that the Classifier returned substantially more relevant results than irrelevant ones, while high recall gives that an algorithm returned most of the relevant results.
‘No-Use’
Classification Error Rate: (FP+FN)/total = (135+145)/590 = 0.4746
Precision is 152/260=0.58, Recall is 152/242=0.63
F1Score=2x((0.58x0.63)/(0.58+0.63))=0.61
Here Recall is higher than precision, that is the KNN Classifier returned most of the women not using any contraceptive method choices.
‘Long-Term’
Classification Error Rate: (FP+FN)/total = (131+149)/590 = 0.4746
Precision is 58/124=0.47, Recall is 58/146=0.40
F1Score=2x((0.47x0.40)/(0.47+0.40))=0.43
Here Precision is high, that is KNN Classifier returned substantially more relevant results than irrelevant ones. The classifier is accurate.
‘Short-Term’
Classification Error Rate: (FP+FN)/total = (131+149)/590 = 0.4746
Precision is 100/206=0.49, Recall is 100/202=0.50
F1Score=2x((0.49x0.50)/(0.49+0.50))=0.49
Here Recall is higher than precision, that is the classifier returned most of the relevant results.
Accuracy = 310/ (310+(135+145)) =0.5
KNN model returned most of the relevant results.
So, Based on our objective,
KNN model predict most of the women based on her demographic and socio-economic characteristics, prefer “No-Use” of the Contraceptive Method Choice.
Decision Tree classifier
from sklearn.cross_validation import train_test_split
target = cmc['Contraceptive_Method_Used']
cmc_copy=cmc
cmc_copy = cmc_copy.drop('Contraceptive_Method_Used',1)
X_train, X_test, y_train, y_test = train_test_split(cmc_copy,target, test_size=0.4, random_state=0)
Fitting the model
from sklearn.tree import DecisionTreeClassifier
Dclf = DecisionTreeClassifier()
DTfit = Dclf.fit(X_train, y_train)
DTfit
Output:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best')
from sklearn.cross_validation import cross_val_score
print cross_val_score(DTfit, cmc_copy, target, cv=10, scoring='accuracy').mean()
Out: 0.467
Predicting the Class Samples
y_pre = DTfit.predict(X_test)
y_pre
Out[118]:
array([1, 1, 2, 3, 2, 2, 1, 3, 3, 2, 3, 2, 1, 2, 3, 2, 1, 1, 3, 3, 3, 1, 1,
1, 3, 1, 1, 1, 1, 3, 2, 1, 3, 1, 1, 1, 3, 3, 1, 2, 2, 1, 1, 1, 2, 3,
2, 1, 1, 3, 3, 2, 1, 1, 1, 3, 1, 3, 1, 3, 3, 3, 1, 1, 1, 1, 1, 2, 3,
2, 3, 3, 1, 2, 2, 2, 1, 3, 1, 1, 3, 3, 3, 2, 1, 3, 1, 1, 3, 1, 2, 1,
3, 3, 1, 3, 2, 2, 1, 2, 1, 3, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 3, 3,
1, 3, 1, 1, 1, 2, 3, 2, 1, 3, 3, 3, 2, 1, 3, 1, 1, 3, 1, 3, 3, 1, 3,
1, 1, 2, 1, 3, 3, 3, 1, 2, 1, 1, 1, 3, 1, 2, 2, 1, 2, 1, 1, 1, 1, 2,
1, 1, 3, 1, 2, 1, 1, 2, 2, 2, 1, 1, 1, 3, 2, 2, 2, 2, 2, 1, 1, 1, 1,
1, 3, 2, 1, 2, 3, 3, 3, 1, 3, 2, 2, 2, 1, 3, 1, 1, 1, 3, 1, 3, 3, 1,
1, 3, 2, 1, 3, 3, 1, 2, 3, 1, 3, 3, 1, 3, 3, 1, 3, 1, 2, 1, 2, 3, 1,
1, 2, 1, 1, 2, 2, 3, 2, 3, 3, 1, 2, 1, 1, 2, 1, 3, 2, 2, 2, 2, 2, 1,
3, 2, 1, 1, 1, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1, 3, 3, 3, 2, 2, 1, 2, 1,
3, 2, 3, 3, 3, 2, 1, 1, 2, 3, 3, 3, 1, 3, 3, 2, 3, 1, 1, 3, 1, 1, 2,
1, 1, 2, 2, 2, 1, 1, 3, 3, 2, 3, 3, 1, 2, 1, 2, 2, 3, 1, 3, 3, 1, 1,
2, 1, 2, 3, 1, 1, 1, 1, 1, 3, 3, 2, 2, 3, 1, 1, 2, 3, 2, 1, 3, 1, 1,
3, 1, 1, 2, 1, 1, 3, 1, 2, 1, 2, 1, 1, 1, 2, 3, 1, 2, 3, 2, 1, 1, 2,
2, 1, 1, 1, 1, 3, 3, 3, 1, 3, 1, 1, 1, 3, 3, 3, 1, 3, 1, 1, 1, 3, 2,
3, 2, 3, 2, 1, 3, 3, 3, 3, 1, 2, 3, 3, 3, 2, 1, 1, 3, 2, 3, 3, 3, 1,
2, 2, 1, 2, 1, 2, 3, 2, 1, 1, 3, 2, 2, 1, 3, 3, 1, 3, 1, 3, 3, 2, 1,
2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 2, 1, 1, 3, 1, 3, 1, 1, 3, 2, 1,
2, 3, 1, 1, 1, 1, 2, 3, 1, 1, 1, 3, 1, 2, 2, 1, 2, 1, 2, 3, 3, 2, 2,
2, 3, 2, 3, 3, 2, 2, 1, 1, 2, 2, 3, 1, 3, 3, 2, 3, 2, 2, 1, 1, 3, 2,
3, 3, 1, 2, 3, 1, 2, 1, 1, 3, 1, 3, 3, 3, 1, 3, 2, 3, 2, 3, 3, 3, 1,
3, 2, 1, 2, 2, 3, 1, 2, 3, 3, 1, 2, 3, 1, 1, 1, 2, 1, 3, 1, 1, 1, 2,
3, 2, 1, 1, 2, 2, 3, 3, 1, 3, 1, 1, 1, 1, 1, 3, 2, 1, 1, 2, 2, 3, 1,
1, 2, 2, 1, 1, 2, 1, 3, 3, 1, 3, 2, 1, 1, 3])
y_pre.shape
Output: (590,)
Predicting the probability of each class, which is the fraction of training samples of the same class in a leaf:
y_pre_prob = DTfit.predict_proba(X_test)
y_pre_prob
array([[ 1., 0., 0.], [ 1., 0., 0.],
[ 1., 0., 0.], [ 1., 0., 0.],
[ 0., 1., 0.], ..., [ 0., 0., 1.]])
Confusion Matrix
from sklearn.metrics import confusion_matrix
Dcm = confusion_matrix(y_test, y_pre)
Dcm
Out[122]:
array([[142, 44, 56],
[ 37, 61, 48],
[ 68, 55, 79]])
Class Distribution: 1 No-Use : 42.70% (629), 2 Long-Term : 22.61% (333) 3 Short-term : 34.69% (511)
The resulting confusion matrix
Predicted
1 2 3
Actual 1 [142, 44, 56], =242
2 [ 37, 61, 48], =146
3 [ 68, 55, 79] =202
247 160 183
Distribution in diagonal were correct guesses: 142+61+79 = 282
50.34% is ‘No-Use’ 20.98% is ‘Long-Term’ 28.67% is ‘Short-Term’
Errors represented by values outside the diagonal.
In this confusion matrix, Actual 42.70% ‘No-Use’, the system predicted that 50.34%, Actual 333(22.61%) ‘Long Term’, the system predicted that 20.98%, 511(34.69%) ‘Short-Term’ the system predicted that 28.67%. We can see from the matrix that the system in question has trouble distinguishing between ‘No-Use’ and ‘Short-Term’, but can make the distinction between ‘No-Use’ and other types of choices.
Classification Report
from sklearn.metrics import classification_report
classification_report(y_test,y_pre)
Output:
Class precision recall f1-score support
1 0.57 0.59 0.58 242
2 0.38 0.42 0.40 146
3 0.43 0.39 0.41 202
avg / total 0.48 0.48 0.48 590
Out of 242 Actual ‘No-Use’, System predicted 143(true positive), rest 99 belongs to Long-Term and Short-Term.
Classification Error Rate: (FP+FN)/total = (160+148)/590 = 0.5220
‘No-Use’
Precision is 142/247=0.57, Recall is 142/242=0.59
F1Score=2x((0.57x0.59)/(0.57+0.59))=0.58
Here Recall is higher than precision, that is the DecisionTree Classifier returned most of the women not using any contraceptive method choices.
‘Long-Term’
Precision is 61/160=0.38, Recall is 61/146=0.42
F1Score=2x((0.38x0.42)/(0.38+0.42))=0.40
Here Recall is higher than precision, that is the DecisionTree Classifier returned most of the relevant results.
‘Short-Term’
Precision is 79/183=0.43, Recall is 79/202=0.39
F1Score=2x((0.43x0.39)/(0.43+0.39))=0.41
Here Precision is high, that is DecisionTree Classifier returned substantially more relevant results than irrelevant ones. The classifier is accurate.
Accuracy = 282/ (282+(148+160)) =0.48
On the Average, the Recall is higher than precision, that is the algorithm returned most of the relevant results.
Decision Tree Model returned most of the relevant results.
The problem is to predict the current contraceptive method choice (no use, long-term methods, or short-term methods) of a woman based on her demographic and socio-economic characteristics.
Decision Tree Visualization
from sklearn import tree
from os import system
dtree = tree.DecisionTreeClassifier()
clf = dtree.fit(cmc_copy, target)
tree.export_graphviz(clf,out_file='CMCtree.dot')
dotfile = open("CMCtree.dot", 'w')
tree.export_graphviz(clf, out_file = dotfile)
dotfile.close()
system("dot -Tpng CMCtree.dot -o CMCtree.png")
Comparing Decision Tree Model with KNN model
Accuracy
KNN 0.50
DT 0.48
KNN has correctly classified, KNN performs better.
Classification Error Rate
KNN 0.4746
DT 0.5220
KNN has low percentage of observations in the test data set that model mislabelled.
So, KNN performs better.
Classification Report
Class Label precision recall f1-score support
KNN1 0.58 0.63 0.61 242
DT1 0.57 0.59 0.58 242
KNN2 0.47 0.40 0.43 146
DT2 0.38 0.42 0.40 146
KNN3 0.49 0.50 0.49 202
DT3 0.43 0.39 0.41 202
With other choices KNN predicted Long_Term use and Decision Tree predicted Short-Term Use. So, F1 score is the better choice to identify the performance of the classifier.
‘No-Use’
F1 Score is high in KNN, KNN has classified ‘No-Use’ choice more instances correctly that Decision Tree.
‘Long-Term’
F1 Score is high in KNN, KNN has classified ‘No-Use’ choice more instances correctly that Decision Tree.
‘Short-Term’
F1 Score is high in KNN, KNN has classified ‘No-Use’ choice more instances correctly that Decision Tree.
KNN performs better
Both predict most of the women based on her demographic and socio-economic characteristics, prefer “No-Use” of the Contraceptive Method Choice.
import pandas
import matplotlib.pyplot as plt
from sklearn import cross_validation
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import cross_val_score
array = cmc.values
models = []
models.append(('KNN', KNeighborsClassifier()))
array = cmc.values
models = []
models.append(('KNN', KNeighborsClassifier()))
models.append(('DTC', DecisionTreeClassifier()))
results = []
names = []
for name, model in models:
cv_results = cross_val_score(model, cmc_copy, target, cv=10,scoring='accuracy')
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
Output:
Mean Accuracy Standard Deviation Accuracy
KNN: 0.519301 (0.040072)
DTC: 0.465609 (0.041911)
fig = plt.figure()
fig.suptitle('Model Comparison based on Accuracy')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
Boxplot shows the spread of accuracy scores across each classifier
From these results, it would suggest KNN model is worthy of further study on this problem.
Nearest Centroid classifier
Each class is represented by its centroid, with test samples classified to the class with the nearest centroid.
A nearest centroid classifier or nearest prototype classifier is a classification model that assigns to observations the label of the class of training samples whose mean (centroid) is closest to the observation.
Generating Train and Test Split
from sklearn.cross_validation import train_test_split
target = cmc['Contraceptive_Method_Used']
cmc_copy=cmc
cmc_copy = cmc_copy.drop('Contraceptive_Method_Used',1)
X_train, X_test, y_train, y_test = train_test_split(cmc_copy,target, test_size=0.4, random_state=0)
Selecting the Classifier
from sklearn.neighbors import NearestCentroid
NCclf = NearestCentroid()
Fitting the Model
NCfit = NCclf.fit(X_train, y_train)
NCfit
Output: NearestCentroid(metric='euclidean', shrink_threshold=None)
Predicting the Unseen Data
y_pre = NCfit.predict(X_test)
y_pre
array([3, 2, 2, 3, 2, 2, 1, 3, 3, 2, 2, 2, 3, 2, 2, 2, 2, 2, 3, 2, 3, 3, 1,
2, 3, 2, 2, 3, 2, 3, 3, 2, 1, 1, 3, 2, 3, 2, 2, 2, 3, 2, 2, 2, 3, 3,
3, 2, 1, 3, 3, 3, 3, 2, 3, 1, 2, 3, 3, 2, 1, 3, 3, 2, 3, 2, 2, 2, 3,
2, 3, 3, 3, 3, 1, 2, 3, 3, 3, 3, 2, 3, 2, 2, 2, 3, 3, 1, 3, 2, 2, 3,
3, 3, 2, 3, 2, 2, 3, 2, 2, 2, 2, 3, 3, 2, 3, 1, 2, 2, 1, 3, 2, 3, 3,
2, 3, 1, 2, 2, 3, 3, 2, 3, 2, 3, 2, 3, 2, 2, 2, 3, 3, 3, 3, 3, 2, 3,
1, 3, 2, 2, 1, 3, 2, 3, 3, 3, 2, 2, 3, 3, 2, 2, 2, 3, 2, 2, 2, 2, 3,
2, 3, 2, 2, 2, 2, 3, 3, 2, 3, 3, 2, 2, 3, 1, 3, 3, 2, 2, 1, 2, 3, 3,
3, 2, 3, 3, 2, 2, 3, 3, 2, 3, 2, 3, 2, 2, 3, 3, 2, 2, 3, 3, 3, 1, 3,
2, 3, 3, 2, 1, 3, 1, 2, 2, 2, 2, 2, 2, 3, 3, 2, 1, 3, 2, 1, 2, 3, 2,
2, 3, 2, 2, 2, 2, 3, 2, 3, 3, 2, 3, 3, 2, 2, 3, 2, 3, 3, 2, 3, 2, 3,
3, 2, 2, 3, 2, 1, 2, 2, 3, 2, 2, 3, 2, 2, 2, 1, 2, 3, 2, 2, 3, 2, 3,
3, 3, 3, 2, 3, 2, 3, 2, 3, 2, 3, 3, 2, 2, 1, 2, 2, 3, 3, 2, 2, 3, 2,
3, 3, 2, 3, 2, 2, 3, 2, 2, 2, 3, 3, 2, 1, 3, 3, 2, 3, 2, 2, 3, 3, 3,
2, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 2, 3, 3, 3, 2, 2, 2, 3, 2, 2, 3,
3, 3, 2, 2, 3, 2, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 3, 2, 3, 2, 2, 2, 2,
3, 2, 2, 3, 2, 3, 2, 2, 1, 1, 2, 2, 3, 3, 2, 3, 2, 2, 2, 2, 2, 3, 2,
3, 3, 2, 3, 2, 2, 1, 3, 3, 3, 2, 2, 2, 3, 2, 1, 2, 2, 2, 2, 3, 3, 3,
2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 1, 2, 3, 2, 2, 2, 2, 2, 3, 3, 3, 3, 2,
3, 3, 3, 2, 3, 2, 2, 3, 2, 3, 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 2, 1, 2,
2, 3, 2, 3, 2, 2, 3, 3, 2, 3, 3, 1, 3, 2, 3, 2, 2, 3, 2, 3, 3, 2, 2,
3, 3, 2, 3, 2, 2, 2, 3, 3, 1, 3, 3, 2, 3, 3, 3, 3, 2, 2, 2, 2, 3, 3,
3, 1, 3, 3, 3, 1, 2, 3, 3, 3, 3, 2, 2, 3, 3, 3, 3, 3, 2, 3, 3, 2, 2,
3, 2, 3, 1, 2, 2, 2, 3, 3, 3, 2, 3, 2, 2, 3, 3, 3, 2, 3, 1, 2, 2, 2,
2, 3, 3, 2, 1, 3, 2, 3, 2, 3, 3, 1, 3, 2, 3, 1, 2, 3, 2, 2, 2, 1, 2,
2, 2, 2, 2, 3, 3, 2, 3, 2, 2, 1, 2, 3, 2, 3]
y_pre.shape
(590,)
Accuracy
from sklearn.cross_validation import cross_val_score
cross_val_score(NCfit, cmc_copy, target, cv=10, scoring='accuracy')
array([ 0.32214765, 0.39864865, 0.34459459, 0.41496599, 0.39455782,
0.38095238, 0.40136054, 0.36054422, 0.29931973, 0.4109589 ])
print cross_val_score(NCfit, cmc_copy, target, cv=10, scoring='accuracy').mean()
output: 0.3728
Confusion Matrix
from sklearn.metrics import confusion_matrix
Ccm = confusion_matrix(y_test, y_pre)
Ccm
array([[ 18, 113, 111],
[ 12, 88, 46],
[ 12, 73, 117]])
Accuracy = 223/ 590 =0.378
Classification Error Rate NCC 270+97 / 590 = 0.622
Classification Report
from sklearn.metrics import classification_report
classification_report(y_test,y_pre)
precision recall f1-score support
1 0.43 0.07 0.13 242
2 0.32 0.60 0.42 146
3 0.43 0.58 0.49 202
avg / total 0.40 0.38 0.32 590
‘No-Use’
Here Precision is high, that is Nearest Centroid Classifier returned substantially more relevant results than irrelevant ones. The classifier is accurate.
‘Long-Term’
Here Recall is higher than precision, that is the Nearest Centroid Classifier returned most of the relevant results.
‘Short-Term’
Here Recall is higher than precision, that is the Nearest Centroid Classifier returned most of the women not using any contraceptive method choices.
Based on F1 Score, NCC has classified more ‘Short-Term’ instances correctly, NCC predicted ‘Short-Term’ Methods is most common among the women’s.
Comparing the KNN, DT and NCC Models
Accuracy
KNN 0.50
DT 0.48
NCC 0.38
KNN has correctly classified, KNN performs better.
Classification Error Rate
KNN 0.4746
DT 0.5220
NCC 0.6220
KNN has low percentage of observations in the test data set that model mislabelled.
So, KNN performs better.
Classification Report
Class Label precision recall f1-score support
KNN1 0.58 0.63 0.61 242
DT1 0.57 0.59 0.58 242
NCC1 0.43 0.07 0.13 242
KNN2 0.47 0.40 0.43 146
DT2 0.38 0.42 0.40 146
NCC2 0.32 0.60 0.42 146
KNN3 0.49 0.50 0.49 202
DT3 0.43 0.39 0.41 202
NCC3 0.43 0.58 0.49 202
‘No-Use’
F1 Score is high in KNN, KNN has classified ‘No-Use’ choice more instances correctly than Decision Tree and NCC.
‘Long-Term’
F1 Score is high in KNN, KNN has classified ‘No-Use’ choice more instances correctly than Decision Tree and NCC
‘Short-Term’
F1 Score is high in KNN and NCC, KNN has classified ‘No-Use’ choice more instances correctly than Decision Tree.
KNN performs better
3 models predict most of the women based on her demographic and socio-economic characteristics, prefer “No-Use” of the Contraceptive Method Choice.
import pandas
import matplotlib.pyplot as plt
from sklearn import cross_validation
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import NearestCentroid
from sklearn.cross_validation import cross_val_score
array = cmc.values
models = []
models.append(('KNN', KNeighborsClassifier()))
models.append(('DTC', DecisionTreeClassifier()))
models.append(('NCC', NearestCentroid()))
results = []
names = []
for name, model in models:
cv_results = cross_val_score(model, cmc_copy, target, cv=10,scoring='accuracy')
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
Output:
Mean Accuracy Standard Deviation Accuracy
KNN: 0.519301 (0.040072)
DTC: 0.477132 (0.034676)
NCC: 0.372805 (0.037641)
fig = plt.figure()
fig.suptitle('Model Comparison based on Accuracy')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
Boxplot shows the spread of accuracy scores across each classifier
From these results, it would suggest KNN model is worthy of further study on this problem.
Conclusion
Data modelling, a core step in the data science process. In this assignment we have understood, developed and implemented appropriate steps, in IPython, to complete the corresponding tasks.
Practical experience with the typical 5th and 6th steps of the data science process: data modelling, and presentation and automation has been gained.