KNN-CLASSIFICATION

Classification Task on a Dataset using K-Nearest Neighbor and Decision Tree Classifier in Python

ABSTRACT

Dataset Title: Contraceptive Method Choice (Source: http://archive.ics.uci.edu/ml/machine-learning-databases/cmc/cmc.data)

Dataset is a subset of the 1987 National Indonesia Contraceptive Prevalence Survey.

Data Set Information:

This dataset is a subset of the 1987 National Indonesia Contraceptive Prevalence Survey. The samples are married women who were either not pregnant or do not know if they were at the time of interview.

The problem is to predict the current contraceptive method choice (no use, long-term methods, or short-term methods) of a woman based on her demographic and socio-economic characteristics.

Number of Instances: 1473

Number of Attributes: 10 (including the class attribute)

Missing Attribute Values: None

Attributes of Dataset:

1. Wife's age (numerical)

2. Wife's education (categorical) 1=low, 2, 3, 4=high

3. Husband's education (categorical) 1=low, 2, 3, 4=high

4. Number of children ever born (numerical)

5. Wife's religion (binary) 0=Non-Islam, 1=Islam

6. Wife's now working? (binary) 0=Yes, 1=No

7. Husband's occupation (categorical) 1, 2, 3, 4

8. Standard-of-living index (categorical) 1=low, 2, 3, 4=high

9. Media exposure (binary) 0=Good, 1=Not good

10. Contraceptive method used (class attribute) 1=No-use, 2=Long-term, 3=Short-term

Data Retrieving - PremGeorge

Data Retrieving

Data Retrieving - PremGeorge

1. Importing Packages

import pandas as pd

import matplotlib.pyplot as plt

import numpy as np

import urllib2

1.a Loading CSV File from web

url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/cmc/cmc.data'

set1 = urllib2.Request(url)

cmc_p = urllib2.urlopen(set1)

Assigning the dataset

cmc = pd.read_csv(cmc_p, sep=',', decimal='.', header=None, names=['Wife_Age', 'Wife_Education', 'Husband_Education', 'Number_Of_Children_Ever_Born', 'Wife_Religion', 'Wife_Now_Working', 'Husband_Occupation', 'Standard_Of_Living_Index', 'Media_Exposure','Contraceptive_Method_Used'])

Listing Column Names : cmc.columns

Listing the Column values : cmc.head()

Checking the Data Types: cmc.dtypes

Data Preparation for Individual Columns (Showing for Wife's Age Column)

Checking Datatypes : cmc['Wife_Age'].dtype

Checking the attributes :

cmc['Wife_Age'].describe()

cmc['Wife_Age'].unique()

cmc['Wife_Age'].value_counts()

Summary on Data Preparation:

Data Cleaned and Processed for

1.Missing Values

2.Negative Values

3.Impossible Values.

4.Extra Whitespaces

5.Typos

Data Exploration

Wife's age :

cmc['Wife_Age'].plot(kind='hist',bins=10,color='red')

plt.title("Wife's Age Distribution")

plt.xlabel("Wife's Age")

plt.ylabel("Number of wives in the particular age interval")

plt.show()

Number of children ever born :

cmc['Number_Of_Children_Ever_Born'].plot(kind='hist',bins=10,color='brown')

plt.title("Distribution of Number of Children Ever Born")

plt.xlabel("Number of Children Ever Born ")

plt.ylabel("Number of Times the Children Born for each category")

plt.show()

Wife's education :

x_WEaxis=['1 Low', '2','3','4 High']

y_WEaxis=[577,410,334,152]

freqWE=np.arange(len(x_WEaxis))

plt.bar(freqWE,y_WEaxis,align='center', color ='purple')

plt.xticks(freqWE,x_WEaxis)

plt.ylabel("Educated Wife's on Category")

plt.title("BAR CHART: Wife's Education Frequency Distribution")

plt.xlabel("Wife's Education 1 being Low and 4 is Highly Educated")

plt.show()

Husband's education :

x_HEaxis=['1 Low', '2','3','4 High']

y_HEaxis=[899,352,178,44]

freqHE=np.arange(len(x_HEaxis))

plt.bar(freqHE,y_HEaxis,align='center', color ='orange')

plt.xticks(freqHE,x_HEaxis)

plt.ylabel("Educated Husband's on Category")

plt.title("BAR CHART: Husband's Education Frequency Distribution")

plt.xlabel("Husband's Education 1 being Low and 4 is Highly Educated")

plt.show()

Husband's Occupation :

cmc['Husband_Occupation'].value_counts().plot(kind='pie',autopct='%.2f')

plt.title("PIE-Chart: 'Husband’s_Occupation' Percentage Distribution")

plt.show()

Standard-of-living index show :

x_SIaxis=['1 Low', '2','3','4 High']

y_SIaxis=[684,431,229,129]

freqSI=np.arange(len(x_SIaxis))

plt.bar(freqSI,y_SIaxis,align='center',color ='pink')

plt.xticks(freqSI,x_SIaxis)

plt.ylabel("Standard of Living Index on Category")

plt.title("BAR CHART: Standard of Living Index Frequency Distribution")

plt.xlabel("Standard of Living Index 1 being Low and 4 is High Standard")

plt.show()

Wife's Religion :

x_WRaxis=['0 Non Islam','1 Islam']

y_WRaxis=[220,1253]

freqWR=np.arange(len(x_WRaxis))

plt.bar(freqWR,y_WRaxis,align='center',color=['yellow', 'red'])

plt.xticks(freqWR,x_WRaxis)

plt.ylabel('Wifes Religion on Category')

plt.title("BAR CHART: Wife's Religion Frequency Distribution")

plt.xlabel("Wife's Religion 0 is Non-Islamic and 1 being Islamic")

plt.show()

Wife Now Working :

x_WWaxis=['0 Working','1 Not Working']

y_WWaxis=[369,1104]

freqWW=np.arange(len(x_WWaxis))

plt.bar(freqWW,y_WWaxis,align='center',color=['violet','green'])

plt.xticks(freqWW,x_WWaxis)

plt.ylabel('Wifes working or Not')

plt.title("BAR CHART: Wife Working or Not Frequency Distribution")

plt.xlabel("Wife's Work 0 is Working and 1 being Not-Working")

plt.show()

Media Exposure :

x_MEaxis=['0 Good','1 Not Good']

y_MEaxis=[1364,109]

freqME=np.arange(len(x_MEaxis))

plt.bar(freqME,y_MEaxis,align='center',color=['purple', 'yellow'])

plt.xticks(freqME,x_MEaxis)

plt.ylabel('Media Exposure')

plt.title("BAR CHART: Media Exposure Frequency Distribution")

plt.xlabel("Media Exposure 0 is Good Exposure and 1 being Not-Good")

plt.show()

Contraceptive Method Used :

x_cmcaxis=['1 No Use','2 Long Term', '3 Short Term']

y_cmcaxis=[629,333,511]

freqcmc=np.arange(len(x_cmcaxis))

plt.bar(freqcmc,y_cmcaxis,align='center',color =['green','red','blue'])

plt.xticks(freqcmc,x_cmcaxis)

plt.ylabel('Contraceptive Method Used')

plt.title("BAR CHART: Contraceptive Method Used Frequency Distribution")

plt.xlabel("Contraceptive Method Used 1 being Not Used, 2 Long Term used and 3 Short Term used")

plt.show()

Relationships between attributes

Husband's education vs Wife's education

EducationLevel=['4 High','3','2','1 Low']

Husband=[899,352,178,44]

Wife=[577,410,334,152]

bar_width=0.4

x=np.arange(len(EducationLevel))

plt.bar(x,Husband,bar_width,color='red',label="Husband's Education")

plt.bar(x+bar_width,Wife,bar_width,color='pink',label="Wife's Education")

plt.legend()

plt.xlabel("Husband's vs Wife's Education 1:Low and 4:Highly Education")

plt.ylabel('Number of Educated peoples')

plt.title("Husband's Education Vs Wife's Education")

plt.xticks(x+bar_width,EducationLevel)

plt.tight_layout()

plt.show()

Husband's occupation vs Standard-of-living index

Grouping the Standard of Living Index for each category of Husband’s Occupation

SOLIndexHO = cmc.groupby('Husband_Occupation'). Standard_Of_Living_Index.value_counts().sort_index()

SOLIndexHO

SOLIndexHO.unstack()

BARCHART

HO=['1','2','3','4']

SOLIHO1=[7,40,77,5]

SOLIHO2=[29,75,121,4]

SOLIHO3=[107,121,198,5]

SOLIHO4=[293,189,189,13]

bar_width=0.2

x=np.arange(len(HO))

plt.bar(x,SOLIHO1,bar_width,color='red',label="SOL Index 1")

plt.bar(x+bar_width,SOLIHO2,bar_width,color='pink',label="SOL Index 2")

plt.bar(x+bar_width+bar_width,SOLIHO3,bar_width,color='green',label="SOL Index 3")

plt.bar(x+bar_width+bar_width+bar_width,SOLIHO4,bar_width,color='orange',label="SOL Index 4")

plt.legend()

plt.xlabel("Husband's Occupation in different Levels 1 to 4")

plt.ylabel('Standard of Living Index Frequency')

plt.title("Standard of Living Index variation over Husband Occupation Category")

plt.xticks(x+bar_width+bar_width,HO)

plt.tight_layout()

Husband's occupation vs Standard-of-living index

Grouping the Standard of Living Index for each category of Husband’s Occupation

SOLIndexHO = cmc.groupby('Husband_Occupation'). Standard_Of_Living_Index.value_counts().sort_index()

SOLIndexHO.unstack()

BARCHART

HO=['1','2','3','4']

SOLIHO1=[7,40,77,5]

SOLIHO2=[29,75,121,4]

SOLIHO3=[107,121,198,5]

SOLIHO4=[293,189,189,13]

bar_width=0.2

x=np.arange(len(HO))

plt.bar(x,SOLIHO1,bar_width,color='red',label="SOL Index 1")

plt.bar(x+bar_width,SOLIHO2,bar_width,color='pink',label="SOL Index 2")

plt.bar(x+bar_width+bar_width,SOLIHO3,bar_width,color='green',label="SOL Index 3")

plt.bar(x+bar_width+bar_width+bar_width,SOLIHO4,bar_width,color='orange',label="SOL Index 4")

plt.legend()

plt.xlabel("Husband's Occupation in different Levels 1 to 4")

plt.ylabel('Standard of Living Index Frequency')

plt.title("Standard of Living Index variation over Husband Occupation Category")

plt.xticks(x+bar_width+bar_width,HO)

plt.tight_layout()

plt.show()

Wife's now working? vs Standard-of-living index

Grouping the Standard of Living Index for Wife’s working condition

SOLIndexWW = cmc.groupby('Wife_Now_Working'). Standard_Of_Living_Index.value_counts().sort_index()

SOLIndexWW.unstack()

BARCHART

WW=['0 Working','1 Not Working']

SOLIWW1=[20,109]

SOLIWW2=[57,172]

SOLIWW3=[98,333]

SOLIWW4=[194,490]

bar_width=0.2

x=np.arange(len(WW))

plt.bar(x,SOLIWW4,bar_width,color='brown',label="SOL Index 4")

plt.bar(x+bar_width,SOLIWW3,bar_width,color='purple',label="SOL Index 3")

plt.bar(x+bar_width+bar_width,SOLIWW2,bar_width,color='yellow',label="SOL Index 2")

plt.bar(x+bar_width+bar_width+bar_width,SOLIWW1,bar_width,color='orange',label="SOL Index 1")

plt.legend()

plt.xlabel('Wife Working Status')

plt.ylabel('Standard of Living Index Frequency')

plt.title('Standard of Living Index variation over Wife Working Status')

plt.xticks(x+bar_width+bar_width,WW)

plt.tight_layout()

plt.show()

Wife's education vs Standard-of-living index

Grouping Wife’s Education with Standard of Living Index

SOLIndexWE = cmc.groupby('Wife_Education'). Standard_Of_Living_Index.value_counts().sort_index()

SOLIndexWE.unstack()

BARCHART

WE=['4 High','3','2','1 Low']

SOLIWE1=[8,37,55,29]

SOLIWE2=[38,81,72,38]

SOLIWE3=[145,141,100,45]

SOLIWE4=[386,151,107,40]

bar_width=0.2

x=np.arange(len(WE))

plt.bar(x,SOLIWE4,bar_width,color='red',label="SOL Index 4")

plt.bar(x+bar_width,SOLIWE3,bar_width,color='pink',label="SOL Index 3")

plt.bar(x+bar_width+bar_width,SOLIWE2,bar_width,color='green',label="SOL Index 2")

plt.bar(x+bar_width+bar_width+bar_width,SOLIWE1,bar_width,color='orange',label="SOL Index 1")

plt.legend()

plt.xlabel("Wife Education")

plt.ylabel('Standard of Living Index Frequency')

plt.title("Standard of Living Index variation over Wife Education")

plt.xticks(x+bar_width+bar_width,WE)

plt.tight_layout()

plt.show()

Wife's religion vs Contraceptive method used

Grouping the Wife’s Religion with their Contraceptive Method Use

CMCWR = cmc.groupby('Wife_Religion'). Contraceptive_Method_Used.value_counts().sort_index()

CMCWR.unstack()

BARCHART

WR=['0 Non-Islam','1 Islam']

CMCWR1=[75,554]

CMCWR2=[76,257]

CMCWR3=[69,442]

bar_width=0.3

x=np.arange(len(WR))

plt.bar(x,CMCWR1,bar_width,color='brown',label="No-Use")

plt.bar(x+bar_width,CMCWR2,bar_width,color='purple',label="Long-Term")

plt.bar(x+bar_width+bar_width,CMCWR3,bar_width,color='yellow',label="Short-Term")

plt.legend()

plt.xlabel('Wife Religious Status')

plt.ylabel('Contraceptive Method Use Frequency')

plt.title('Contraceptive Method Use with Wife Religious Status')

plt.xticks(x+bar_width+bar_width,WR)

plt.tight_layout()

plt.show()

Wife's religion vs Media Exposure

Grouping the Wife’s Religion with their Media Exposure

WRME = cmc.groupby('Wife_Religion'). Media_Exposure.value_counts().sort_index()

CMCWR.unstack()

BARCHART

WR=['0 Non-Islam','1 Islam']

MEWR0=[212,1152]

MEWR1=[8,101]

bar_width=0.3

x=np.arange(len(WR))

plt.bar(x,MEWR0,bar_width,color='blue',label="Good")

plt.bar(x+bar_width,MEWR1,bar_width,color='yellow',label="No-Good")

plt.legend()

plt.xlabel('Wife Religious Status')

plt.ylabel('Media Exposure Frequency')

plt.title('Media Exposure with Wife Religious Status')

plt.xticks(x+bar_width,WR)

plt.tight_layout()

plt.show()

Wife's now working? vs Number of children ever born

Grouping the Number of Children Born against Wife’s Working Condition

NCBWW = cmc.groupby('Wife_Now_Working'). Number_Of_Children_Ever_Born.value_counts().sort_index()

NCBWW.unstack()

BARCHART

cmc.groupby('Wife_Now_Working').Number_Of_Children_Ever_Born.plot(kind="hist", alpha=0.5)

plt.xlabel("Wife Working Status 0 Working and 1 Not Working")

plt.ylabel('Number of Children Ever Born Frequency')

plt.title("Number of Children Born against Wife Working Status")

plt.legend()

plt.show()

Contraceptive Method Used vs Number of children ever born

Grouping the Number of Children Born against Contraceptive Method Used

NCBCMC = cmc.groupby('Contraceptive_Method_Used'). Number_Of_Children_Ever_Born.value_counts().sort_index()

NCBCMC.unstack()

BOXPLOT

cmc.boxplot(column='Number_Of_Children_Ever_Born',by='Contraceptive_Method_Used')

plt.xlabel("Contraceptive Method Used 1 No-Use, 2 Long-Term, 3 Short-Term")

plt.ylabel('Number of Children Ever Born Frequency')

plt.title("Number of Children Born against Contraceptive Method Used")

plt.legend()

plt.show()

Contraceptive Method Used vs Standard-of-living index

Grouping the Contraceptive Method Used for different Standard of Living Index

SOLCMC = cmc.groupby('Standard_Of_Living_Index'). Contraceptive_Method_Used.value_counts().sort_index()

SOLCMC.unstack()

BARCHART

SCMU=['4 High','3','2','1 Low']

SOLICMU1=[248,184,117,80]

SOLICMU2=[204,90,30,9]

SOLICMU3=[232,157,82,40]

bar_width=0.2

x=np.arange(len(SCMU))

plt.bar(x, SOLICMU3,bar_width,color='blue',label="No-Use")

plt.bar(x+bar_width, SOLICMU2,bar_width,color='red',label="Long-Term")

plt.bar(x+bar_width+bar_width,SOLICMU1,bar_width,color='yellow',label="Short-Term")

plt.legend()

plt.xlabel(' Standard of Living Index')

plt.ylabel('Contraceptive Methods Used Frequency')

plt.title('Contraceptive Methods Used over different Standard of Living')

plt.xticks(x+bar_width+bar_width,SCMU)

plt.tight_layout()

plt.show()

Wife’s Education vs Contraceptive Method Used

Grouping the Contraceptive Method Used based on Wife’s Education

WECMC = cmc.groupby('Wife_Education'). Contraceptive_Method_Used.value_counts().sort_index()

WECMC.unstack()

BARCHART

CMUWE=['4 High','3','2','1 Low']

CMUWE1=[175,175,176,103]

CMUWE2=[207,80,37,9]

CMUWE3=[195,155,121,40]

bar_width=0.2

x=np.arange(len(CMUWE))

plt.bar(x, CMUWE3,bar_width,color='green',label="No-Use")

plt.bar(x+bar_width, CMUWE2,bar_width,color='purple',label="Long-Term")

plt.bar(x+bar_width+bar_width, CMUWE1,bar_width,color='pink',label="Short-Term")

plt.legend()

plt.xlabel('Wife Education 1 being Low and 4 High')

plt.ylabel('Contraceptive Methods Used Frequency')

plt.title('Contraceptive Methods Used over Wife Education')

plt.xticks(x+bar_width+bar_width,CMUWE)

plt.tight_layout()

plt.show ()

Contraceptive Method Used vs Wife’s Age

Grouping the Contraceptive Method Used vs different age groups of Wives.

WACMC = cmc.groupby('Contraceptive_Method_Used').Wife_Age.value_counts().sort_index()

WACMC.unstack()

BOXPLOT

cmc.boxplot(column='Wife_Age',by='Contraceptive_Method_Used')

plt.xlabel("Contraceptive Method Used 1 No-Use, 2 Long-Term, 3 Short-Term")

plt.ylabel('Wife Age Frequency')

plt.title("Wife Age against Contraceptive Method Used")

plt.legend()

plt.show()

Data Modelling

Engineering Feature and Selecting a Model

The Question is to predict the current contraceptive method choice (no use, long-term methods, or short-term methods) of a woman based on her demographic and socio-economic characteristics.

The Contraceptive Method Choice Dataset is a classification task consisting in identifying 3 different types of choices (No-Use, Long-Term and Short Term).

Class Distribution: 1 No-Use : 42.70%, 2 Long-Term : 22.61% 3 Short-term : 34.69%

Loading the Data

Finding Classification Metrics “Acuracy”

The class distribution seems to be balanced, accuracy is considered to be a good choice as it gives high scores to models which predict the most frequent class.

Generating Train/Test Set

from sklearn.cross_validation import train_test_split

cmc.shape

(1473, 10)

Selecting Feature Columns

cmc_copy=cmc

cmc_copy = cmc_copy.drop('Contraceptive_Method_Used',1)

cmc_copy.shape

Out[7]: (1473, 9)

Selecting Target Columns with Class Variable

target = cmc['Contraceptive_Method_Used']

target.shape

(1473,)

np.unique(target)

Out[90]: array([1, 2, 3])

X_train, X_test, y_train, y_test = train_test_split(cmc_copy,target, test_size=0.4, random_state=0)

X_train.shape,y_train.shape

Out[13]: ((883, 9), (883,))

X_test.shape,y_test.shape

Out[14]: ((590, 9), (590,))

Checking classification accuracy of KNN with K=5 by test and train method

from sklearn.neighbors import KNeighborsClassifier

from sklearn import metrics

knn = KNeighborsClassifier(n_neighbors=5)

knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',

metric_params=None, n_jobs=1, n_neighbors=5, p=2,

weights='uniform')

y_pred = knn.predict(X_test)

print metrics.accuracy_score(y_test, y_pred)

Output: 0.4763

Obtaining classification accuracy by cross_val_score (Score Method)

from sklearn.cross_validation import cross_val_score

knn = KNeighborsClassifier(n_neighbors=5)

scores = cross_val_score(knn, cmc_copy, target, cv=10, scoring='accuracy')

scores

Output: array([ 0.56375839, 0.46621622, 0.49324324, 0.50340136, 0.49659864,

0.52380952, 0.57823129, 0.54421769, 0.56462585, 0.45890411])

scores.mean()

Output: 0.5193

Discussion

Comparing Cross Validation and Train/Test Split

Accuracy on CV=0.5193 on TTSplit=0.4763

CV gives more accurate estimate of out-of-sample accuracy

Obtaining classification accuracy with different Scoring Parameter

from sklearn import metrics

from sklearn.cross_validation import cross_val_predict

predicted = cross_val_predict(knn, cmc_copy, target, cv=10)

metrics.precision_score(target, predicted)

Output: 0.5122

metrics.recall_score(target, predicted)

Output: 0.5193

metrics.f1_score(target, predicted)

Output: 0.5128

Performance Accuracy

Obtaining Performance Score by train_test_split : 0.4763

Obtaining Performance Score by cross_val_score (Score Method) : 0.5193

Obtaining Performance Score by Scoring Parameter

Precision Score : 0.5122

Recall Score : 0.5193

F1 Score : 0.5128

Cross validation and Recall metric gives more accurate estimate of out-of-sample accuracy.

k-folds Cross Validation

Computing the score 10 consecutive times

import numpy as np

from sklearn import datasets, svm

X_folds = np.array_split(cmc_copy, 10)

y_folds = np.array_split(target, 10)

kfclf = svm.SVC(kernel='linear', C=1)

scores = list()

for k in range(10):

X_train = list(X_folds)

X_test = X_train.pop(k)

X_train = np.concatenate(X_train)

y_train = list(y_folds)

y_test = y_train.pop(k)

y_train = np.concatenate(y_train)

scores.append(kfclf.fit(X_train, y_train).score(X_test, y_test))

print(scores)

[0.39864864864864863, 0.42567567567567566, 0.46621621621621623, 0.0, 0.25850340136054423, 0.013605442176870748, 0.37414965986394561, 0.47619047619047616, 0.17006802721088435, 0.0]

Simulating by splitting the cmc data of 25 observations in to 5 folds

from sklearn.cross_validation import KFold

kf = KFold(25, n_folds=5, shuffle=False)

print '{} {:^61} {}'.format('Iteration', 'Training set observations', 'Testing set observations')

for iteration, data in enumerate(kf, start=1):

print '{:^9} {} {:^25}'.format(iteration, data[0], data[1])

Iteration Training set observations Testing set observations

1 [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [0 1 2 3 4]

2 [ 0 1 2 3 4 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [5 6 7 8 9]

3 [ 0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21 22 23 24] [10 11 12 13 14]

4 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 21 22 23 24] [15 16 17 18 19]

5 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19] [20 21 22 23 24]

Finding the optimal value of k for k-nearest neighbour

from sklearn.neighbors import KNeighborsClassifier

k_range = range(1, 31)

k_scores = []

for k in k_range:

knn = KNeighborsClassifier(n_neighbors=k)

scores = cross_val_score(knn, cmc_copy, target, cv=10, scoring='accuracy')

k_scores.append(scores.mean())

print k_scores

[0.46094351110834114, 0.46969970995437571, 0.48610035018084768, 0.50439011683282042, 0.51930063120517422, 0.53288377875842907, 0.53769612215874985, 0.51997214613055376, 0.53089296411969422, 0.53355454444449568, 0.54306009289756119, 0.53831601352352876, 0.54441106899538183, 0.54847444057769912, 0.55051072343589968, 0.55864225066046735, 0.55261608125679096, 0.55461987425862702, 0.56206659337052434, 0.56139116828645197, 0.5641720116207567, 0.55394457130098529, 0.55193555855634435, 0.55942855809185799, 0.55195872961669734, 0.54789095049572611, 0.54923310771299327, 0.5458453488415137, 0.54857153362551381, 0.55263005818305999]

Plotting the value of K for KNN (x-axis) versus the cross-validated accuracy (y-axis)

import matplotlib.pyplot as plt

plt.plot(k_range, k_scores)

plt.xlabel('Value of K for KNN')

plt.ylabel('Cross-Validated Accuracy')

plt.show()

10 fold cross-validation gives the best KNN model with k=21

knn = KNeighborsClassifier(n_neighbors=21)

scores= cross_val_score(knn, cmc_copy, target, cv=10, scoring='accuracy')

print scores

[0.60402685 0.4527027 0.53378378 0.59183673 0.57823129 0.56462585

0.59183673 0.57823129 0.61904762 0.52739726]

scores.mean()

Output: 0.5641

10 fold cross-validation gives the best KNN model with k=21 gives more precise accuracy

Data Modelling

k-nearest neighbour Classifier

from sklearn.neighbors import KNeighborsClassifier

from sklearn.cross_validation import cross_val_score

X_train, X_test, y_train, y_test = train_test_split(cmc_copy,target, test_size=0.4, random_state=0)

Fitting the Model

k21clf = KNeighborsClassifier(21)

k21fit = k21clf.fit(X_train, y_train)

k21fit

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',

metric_params=None, n_jobs=1, n_neighbors=21, p=2,

weights='uniform')

print cross_val_score(k21fit, cmc_copy, target, cv=10, scoring='accuracy').mean()

Output 0.56417

Predict on Unseen Data

k21predicted = k21fit.predict(X_test)

k21predicted

Out[101]:

array([1, 2, 2, 1, 2, 2, 1, 1, 1, 2, 2, 2, 3, 2, 2, 2, 1, 2, 3, 1, 3, 3, 1,

1, 3, 2, 3, 3, 1, 2, 3, 1, 3, 1, 1, 2, 3, 3, 3, 1, 1, 1, 1, 1, 1, 3,

3, 1, 1, 1, 3, 3, 1, 1, 1, 1, 1, 3, 1, 1, 3, 3, 1, 1, 3, 1, 1, 2, 1,

1, 1, 3, 3, 2, 3, 3, 1, 1, 1, 1, 3, 1, 3, 2, 1, 3, 3, 3, 3, 1, 1, 3,

3, 3, 1, 3, 2, 2, 1, 2, 1, 2, 1, 3, 1, 1, 3, 1, 2, 1, 1, 1, 2, 3, 1,

1, 3, 3, 1, 1, 2, 1, 2, 1, 2, 3, 1, 3, 1, 1, 3, 1, 3, 3, 3, 3, 1, 3,

1, 3, 2, 1, 3, 3, 2, 3, 1, 1, 1, 1, 1, 1, 2, 2, 1, 3, 1, 1, 1, 1, 3,

3, 1, 2, 1, 3, 1, 1, 3, 2, 2, 1, 1, 3, 3, 3, 3, 3, 2, 1, 1, 1, 2, 1,

1, 1, 3, 3, 2, 2, 3, 3, 3, 3, 2, 3, 2, 1, 3, 1, 1, 1, 1, 3, 3, 3, 3,

1, 1, 3, 1, 3, 3, 1, 2, 3, 1, 3, 2, 1, 1, 1, 1, 3, 1, 1, 1, 2, 3, 1,

1, 3, 3, 1, 3, 2, 1, 2, 3, 3, 1, 3, 1, 1, 1, 2, 3, 2, 1, 2, 1, 2, 1,

1, 1, 1, 3, 1, 3, 1, 1, 1, 2, 1, 1, 1, 3, 1, 1, 2, 1, 1, 1, 3, 2, 1,

3, 3, 1, 2, 3, 1, 1, 1, 3, 2, 1, 3, 2, 2, 1, 2, 2, 3, 1, 3, 1, 2, 1,

1, 3, 2, 3, 2, 1, 3, 2, 2, 2, 3, 3, 1, 3, 3, 3, 2, 3, 1, 3, 1, 1, 1,

2, 1, 2, 3, 3, 1, 1, 1, 1, 3, 1, 3, 2, 3, 3, 1, 2, 3, 1, 1, 3, 2, 3,

3, 3, 3, 2, 3, 3, 3, 1, 2, 1, 3, 1, 1, 3, 2, 3, 1, 2, 1, 1, 3, 2, 2,

2, 2, 2, 1, 2, 3, 3, 2, 1, 3, 1, 1, 3, 3, 2, 2, 2, 1, 2, 1, 2, 3, 2,

1, 3, 2, 2, 1, 1, 3, 3, 3, 1, 3, 3, 2, 1, 2, 1, 3, 1, 2, 3, 3, 3, 1,

2, 1, 1, 1, 3, 2, 3, 1, 1, 1, 3, 2, 1, 1, 2, 2, 1, 3, 1, 1, 3, 1, 3,

3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 1, 1, 3, 3, 3, 1, 1, 3, 3, 1,

2, 1, 1, 1, 1, 1, 2, 3, 1, 1, 3, 3, 1, 2, 3, 1, 2, 1, 3, 3, 3, 1, 2,

2, 3, 2, 3, 1, 2, 1, 3, 1, 3, 3, 3, 2, 3, 2, 2, 3, 2, 2, 2, 3, 3, 3,

3, 3, 1, 3, 3, 3, 2, 3, 1, 1, 1, 1, 2, 3, 3, 1, 3, 3, 1, 1, 2, 3, 1,

3, 2, 1, 1, 1, 3, 1, 3, 1, 3, 2, 2, 2, 1, 1, 3, 1, 1, 1, 1, 3, 1, 1,

2, 1, 3, 1, 3, 3, 1, 3, 1, 3, 1, 1, 3, 1, 1, 3, 2, 3, 1, 3, 1, 3, 1,

1, 2, 2, 1, 1, 3, 2, 3, 3, 1, 3, 2, 1, 1, 3])

k21predicted.shape

(590,)

Methods for classification prediction results

Confusion Matrix

from sklearn.metrics import confusion_matrix

k21cm = confusion_matrix(y_test,k21predicted)

k21cm

Out[105]:

array([[152, 29, 61],

[ 43, 58, 45],

[ 65, 37, 100]])

Class Distribution: 1 No-Use : 42.70% (629), 2 Long-Term : 22.61% (333) 3 Short-term : 34.69% (511)

Classification system has been trained to distinguish between No-Use, Long-Term and Short-Term.

The confusion matrix summarizes the results of testing the algorithm. The sample of 590 choices — 242 No-Use ,333 Long-Term ,511 Short-Term, the resulting confusion matrix

Predicted

1 2 3

Actual 1 [152, 29, 61], =242

2 [ 43, 58, 45], =146

3 [ 65, 37, 100] =202

260 124 206

Discussion

Distribution in diagonal were correct guesses: 152+58+100 = 310

49.03% is ‘No-Use’ 18.71% is ‘Long-Term’ 32.25% is ‘Short-Term’

Errors represented by values outside the diagonal.

By Considering No-Use vs Other Choices

No-Use Other Choices

TP152 FN135 287

FP145 TN158 303

297 293

TN-True Positive, TP-True Positive,

FP-False Positive (Type I Error), FN-False Negative (Type II Error).

The Classifier made a total of 590 Predictions

Out of 590, Classifier Predicted 297 women’s not using the contraceptive methods, 293 were using a method.

In reality, 287 were not using and 303 were using.

How often is the classifier correct?

Accuracy of the Classifier: (TP+TN)/total = (152+158)/590 = 0.5254

How often is the classifier wrong?

Classification Error Rate: (FP+FN)/total = (135+145)/590 = 0.4746 (Equivalent to 1-Accuracy)

When it is actually ‘No-Use, how often the classifier predicted ‘No-Use’ i.e Recall or Sensitivity

Recall: True Positive Rate: TP/actual =152/287 = 0.5296

When it's actually ‘No-Use’ of Methods, how often Classifier predict ‘Other Choices’?

False Positive Rate: FP/actual = 145/303 = 0.4785

When it's actually ‘Other Choices’ of Methods, how often does it predict ‘Other Choices’?

Specificity: TN/actual = 158/303 = 0.5214 (equivalent to 1 minus False Positive Rate)

When it predicts ‘No-Use’, how often is it correct?

Precision: TP/predicted = 152/297 = 0.5117

How often does the ‘other Choices’ actually occur in our sample?

Prevalence: actual/total = 303/590 = 0.5136

How often does the ‘No-Use’ actually occur in our sample?

Prevalence: actual/total = 287/590 = 0.4864

Classification Report

from sklearn.metrics import classification_report

print classification_report(y_test, k21predicted)

Class Label precision recall f1-score support

1 0.58 0.63 0.61 242

2 0.47 0.40 0.43 146

3 0.49 0.50 0.49 202

avg / total 0.52 0.53 0.52 590

Out of 242 Actual ‘No-Use’, System predicted 152(true positive), rest 90 belongs to Long-Term and Short-Term. And out of 260 predicted, 108 belongs to other choices.

Precision gives the measure of quality

Recall gives the measure of quantity

High precision gives that the Classifier returned substantially more relevant results than irrelevant ones, while high recall gives that an algorithm returned most of the relevant results.

‘No-Use’

Classification Error Rate: (FP+FN)/total = (135+145)/590 = 0.4746

Precision is 152/260=0.58, Recall is 152/242=0.63

F1Score=2x((0.58x0.63)/(0.58+0.63))=0.61

Here Recall is higher than precision, that is the KNN Classifier returned most of the women not using any contraceptive method choices.

‘Long-Term’

Classification Error Rate: (FP+FN)/total = (131+149)/590 = 0.4746

Precision is 58/124=0.47, Recall is 58/146=0.40

F1Score=2x((0.47x0.40)/(0.47+0.40))=0.43

Here Precision is high, that is KNN Classifier returned substantially more relevant results than irrelevant ones. The classifier is accurate.

‘Short-Term’

Classification Error Rate: (FP+FN)/total = (131+149)/590 = 0.4746

Precision is 100/206=0.49, Recall is 100/202=0.50

F1Score=2x((0.49x0.50)/(0.49+0.50))=0.49

Here Recall is higher than precision, that is the classifier returned most of the relevant results.

Accuracy = 310/ (310+(135+145)) =0.5

KNN model returned most of the relevant results.

So, Based on our objective,

KNN model predict most of the women based on her demographic and socio-economic characteristics, prefer “No-Use” of the Contraceptive Method Choice.

Decision Tree classifier

from sklearn.cross_validation import train_test_split

target = cmc['Contraceptive_Method_Used']

cmc_copy=cmc

cmc_copy = cmc_copy.drop('Contraceptive_Method_Used',1)

X_train, X_test, y_train, y_test = train_test_split(cmc_copy,target, test_size=0.4, random_state=0)

Fitting the model

from sklearn.tree import DecisionTreeClassifier

Dclf = DecisionTreeClassifier()

DTfit = Dclf.fit(X_train, y_train)

DTfit

Output:

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,

max_features=None, max_leaf_nodes=None, min_samples_leaf=1,

min_samples_split=2, min_weight_fraction_leaf=0.0,

presort=False, random_state=None, splitter='best')

from sklearn.cross_validation import cross_val_score

print cross_val_score(DTfit, cmc_copy, target, cv=10, scoring='accuracy').mean()

Out: 0.467

Predicting the Class Samples

y_pre = DTfit.predict(X_test)

y_pre

Out[118]:

array([1, 1, 2, 3, 2, 2, 1, 3, 3, 2, 3, 2, 1, 2, 3, 2, 1, 1, 3, 3, 3, 1, 1,

1, 3, 1, 1, 1, 1, 3, 2, 1, 3, 1, 1, 1, 3, 3, 1, 2, 2, 1, 1, 1, 2, 3,

2, 1, 1, 3, 3, 2, 1, 1, 1, 3, 1, 3, 1, 3, 3, 3, 1, 1, 1, 1, 1, 2, 3,

2, 3, 3, 1, 2, 2, 2, 1, 3, 1, 1, 3, 3, 3, 2, 1, 3, 1, 1, 3, 1, 2, 1,

3, 3, 1, 3, 2, 2, 1, 2, 1, 3, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 3, 3,

1, 3, 1, 1, 1, 2, 3, 2, 1, 3, 3, 3, 2, 1, 3, 1, 1, 3, 1, 3, 3, 1, 3,

1, 1, 2, 1, 3, 3, 3, 1, 2, 1, 1, 1, 3, 1, 2, 2, 1, 2, 1, 1, 1, 1, 2,

1, 1, 3, 1, 2, 1, 1, 2, 2, 2, 1, 1, 1, 3, 2, 2, 2, 2, 2, 1, 1, 1, 1,

1, 3, 2, 1, 2, 3, 3, 3, 1, 3, 2, 2, 2, 1, 3, 1, 1, 1, 3, 1, 3, 3, 1,

1, 3, 2, 1, 3, 3, 1, 2, 3, 1, 3, 3, 1, 3, 3, 1, 3, 1, 2, 1, 2, 3, 1,

1, 2, 1, 1, 2, 2, 3, 2, 3, 3, 1, 2, 1, 1, 2, 1, 3, 2, 2, 2, 2, 2, 1,

3, 2, 1, 1, 1, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1, 3, 3, 3, 2, 2, 1, 2, 1,

3, 2, 3, 3, 3, 2, 1, 1, 2, 3, 3, 3, 1, 3, 3, 2, 3, 1, 1, 3, 1, 1, 2,

1, 1, 2, 2, 2, 1, 1, 3, 3, 2, 3, 3, 1, 2, 1, 2, 2, 3, 1, 3, 3, 1, 1,

2, 1, 2, 3, 1, 1, 1, 1, 1, 3, 3, 2, 2, 3, 1, 1, 2, 3, 2, 1, 3, 1, 1,

3, 1, 1, 2, 1, 1, 3, 1, 2, 1, 2, 1, 1, 1, 2, 3, 1, 2, 3, 2, 1, 1, 2,

2, 1, 1, 1, 1, 3, 3, 3, 1, 3, 1, 1, 1, 3, 3, 3, 1, 3, 1, 1, 1, 3, 2,

3, 2, 3, 2, 1, 3, 3, 3, 3, 1, 2, 3, 3, 3, 2, 1, 1, 3, 2, 3, 3, 3, 1,

2, 2, 1, 2, 1, 2, 3, 2, 1, 1, 3, 2, 2, 1, 3, 3, 1, 3, 1, 3, 3, 2, 1,

2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 2, 1, 1, 3, 1, 3, 1, 1, 3, 2, 1,

2, 3, 1, 1, 1, 1, 2, 3, 1, 1, 1, 3, 1, 2, 2, 1, 2, 1, 2, 3, 3, 2, 2,

2, 3, 2, 3, 3, 2, 2, 1, 1, 2, 2, 3, 1, 3, 3, 2, 3, 2, 2, 1, 1, 3, 2,

3, 3, 1, 2, 3, 1, 2, 1, 1, 3, 1, 3, 3, 3, 1, 3, 2, 3, 2, 3, 3, 3, 1,

3, 2, 1, 2, 2, 3, 1, 2, 3, 3, 1, 2, 3, 1, 1, 1, 2, 1, 3, 1, 1, 1, 2,

3, 2, 1, 1, 2, 2, 3, 3, 1, 3, 1, 1, 1, 1, 1, 3, 2, 1, 1, 2, 2, 3, 1,

1, 2, 2, 1, 1, 2, 1, 3, 3, 1, 3, 2, 1, 1, 3])

y_pre.shape

Output: (590,)

Predicting the probability of each class, which is the fraction of training samples of the same class in a leaf:

y_pre_prob = DTfit.predict_proba(X_test)

y_pre_prob

array([[ 1., 0., 0.], [ 1., 0., 0.],

[ 1., 0., 0.], [ 1., 0., 0.],

[ 0., 1., 0.], ..., [ 0., 0., 1.]])

Confusion Matrix

from sklearn.metrics import confusion_matrix

Dcm = confusion_matrix(y_test, y_pre)

Dcm

Out[122]:

array([[142, 44, 56],

[ 37, 61, 48],

[ 68, 55, 79]])

Class Distribution: 1 No-Use : 42.70% (629), 2 Long-Term : 22.61% (333) 3 Short-term : 34.69% (511)

The resulting confusion matrix

Predicted

1 2 3

Actual 1 [142, 44, 56], =242

2 [ 37, 61, 48], =146

3 [ 68, 55, 79] =202

247 160 183

Distribution in diagonal were correct guesses: 142+61+79 = 282

50.34% is ‘No-Use’ 20.98% is ‘Long-Term’ 28.67% is ‘Short-Term’

Errors represented by values outside the diagonal.

In this confusion matrix, Actual 42.70% ‘No-Use’, the system predicted that 50.34%, Actual 333(22.61%) ‘Long Term’, the system predicted that 20.98%, 511(34.69%) ‘Short-Term’ the system predicted that 28.67%. We can see from the matrix that the system in question has trouble distinguishing between ‘No-Use’ and ‘Short-Term’, but can make the distinction between ‘No-Use’ and other types of choices.

Classification Report

from sklearn.metrics import classification_report

classification_report(y_test,y_pre)

Output:

Class precision recall f1-score support

1 0.57 0.59 0.58 242

2 0.38 0.42 0.40 146

3 0.43 0.39 0.41 202

avg / total 0.48 0.48 0.48 590

Out of 242 Actual ‘No-Use’, System predicted 143(true positive), rest 99 belongs to Long-Term and Short-Term.

Classification Error Rate: (FP+FN)/total = (160+148)/590 = 0.5220

‘No-Use’

Precision is 142/247=0.57, Recall is 142/242=0.59

F1Score=2x((0.57x0.59)/(0.57+0.59))=0.58

Here Recall is higher than precision, that is the DecisionTree Classifier returned most of the women not using any contraceptive method choices.

‘Long-Term’

Precision is 61/160=0.38, Recall is 61/146=0.42

F1Score=2x((0.38x0.42)/(0.38+0.42))=0.40

Here Recall is higher than precision, that is the DecisionTree Classifier returned most of the relevant results.

‘Short-Term’

Precision is 79/183=0.43, Recall is 79/202=0.39

F1Score=2x((0.43x0.39)/(0.43+0.39))=0.41

Here Precision is high, that is DecisionTree Classifier returned substantially more relevant results than irrelevant ones. The classifier is accurate.

Accuracy = 282/ (282+(148+160)) =0.48

On the Average, the Recall is higher than precision, that is the algorithm returned most of the relevant results.

Decision Tree Model returned most of the relevant results.

The problem is to predict the current contraceptive method choice (no use, long-term methods, or short-term methods) of a woman based on her demographic and socio-economic characteristics.

Decision Tree Visualization

from sklearn import tree

from os import system

dtree = tree.DecisionTreeClassifier()

clf = dtree.fit(cmc_copy, target)

tree.export_graphviz(clf,out_file='CMCtree.dot')

dotfile = open("CMCtree.dot", 'w')

tree.export_graphviz(clf, out_file = dotfile)

dotfile.close()

system("dot -Tpng CMCtree.dot -o CMCtree.png")

Comparing Decision Tree Model with KNN model

Accuracy

KNN 0.50

DT 0.48

KNN has correctly classified, KNN performs better.

Classification Error Rate

KNN 0.4746

DT 0.5220

KNN has low percentage of observations in the test data set that model mislabelled.

So, KNN performs better.

Classification Report

Class Label precision recall f1-score support

KNN1 0.58 0.63 0.61 242

DT1 0.57 0.59 0.58 242

KNN2 0.47 0.40 0.43 146

DT2 0.38 0.42 0.40 146

KNN3 0.49 0.50 0.49 202

DT3 0.43 0.39 0.41 202

With other choices KNN predicted Long_Term use and Decision Tree predicted Short-Term Use. So, F1 score is the better choice to identify the performance of the classifier.

‘No-Use’

F1 Score is high in KNN, KNN has classified ‘No-Use’ choice more instances correctly that Decision Tree.

‘Long-Term’

F1 Score is high in KNN, KNN has classified ‘No-Use’ choice more instances correctly that Decision Tree.

‘Short-Term’

F1 Score is high in KNN, KNN has classified ‘No-Use’ choice more instances correctly that Decision Tree.

KNN performs better

Both predict most of the women based on her demographic and socio-economic characteristics, prefer “No-Use” of the Contraceptive Method Choice.

import pandas

import matplotlib.pyplot as plt

from sklearn import cross_validation

from sklearn.neighbors import KNeighborsClassifier

from sklearn.tree import DecisionTreeClassifier

from sklearn.cross_validation import cross_val_score

array = cmc.values

models = []

models.append(('KNN', KNeighborsClassifier()))

array = cmc.values

models = []

models.append(('KNN', KNeighborsClassifier()))

models.append(('DTC', DecisionTreeClassifier()))

results = []

names = []

for name, model in models:

cv_results = cross_val_score(model, cmc_copy, target, cv=10,scoring='accuracy')

results.append(cv_results)

names.append(name)

msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())

print(msg)

Output:

Mean Accuracy Standard Deviation Accuracy

KNN: 0.519301 (0.040072)

DTC: 0.465609 (0.041911)

fig = plt.figure()

fig.suptitle('Model Comparison based on Accuracy')

ax = fig.add_subplot(111)

plt.boxplot(results)

ax.set_xticklabels(names)

plt.show()

Boxplot shows the spread of accuracy scores across each classifier

From these results, it would suggest KNN model is worthy of further study on this problem.

Nearest Centroid classifier

Each class is represented by its centroid, with test samples classified to the class with the nearest centroid.

A nearest centroid classifier or nearest prototype classifier is a classification model that assigns to observations the label of the class of training samples whose mean (centroid) is closest to the observation.

Generating Train and Test Split

from sklearn.cross_validation import train_test_split

target = cmc['Contraceptive_Method_Used']

cmc_copy=cmc

cmc_copy = cmc_copy.drop('Contraceptive_Method_Used',1)

X_train, X_test, y_train, y_test = train_test_split(cmc_copy,target, test_size=0.4, random_state=0)

Selecting the Classifier

from sklearn.neighbors import NearestCentroid

NCclf = NearestCentroid()

Fitting the Model

NCfit = NCclf.fit(X_train, y_train)

NCfit

Output: NearestCentroid(metric='euclidean', shrink_threshold=None)

Predicting the Unseen Data

y_pre = NCfit.predict(X_test)

y_pre

array([3, 2, 2, 3, 2, 2, 1, 3, 3, 2, 2, 2, 3, 2, 2, 2, 2, 2, 3, 2, 3, 3, 1,

2, 3, 2, 2, 3, 2, 3, 3, 2, 1, 1, 3, 2, 3, 2, 2, 2, 3, 2, 2, 2, 3, 3,

3, 2, 1, 3, 3, 3, 3, 2, 3, 1, 2, 3, 3, 2, 1, 3, 3, 2, 3, 2, 2, 2, 3,

2, 3, 3, 3, 3, 1, 2, 3, 3, 3, 3, 2, 3, 2, 2, 2, 3, 3, 1, 3, 2, 2, 3,

3, 3, 2, 3, 2, 2, 3, 2, 2, 2, 2, 3, 3, 2, 3, 1, 2, 2, 1, 3, 2, 3, 3,

2, 3, 1, 2, 2, 3, 3, 2, 3, 2, 3, 2, 3, 2, 2, 2, 3, 3, 3, 3, 3, 2, 3,

1, 3, 2, 2, 1, 3, 2, 3, 3, 3, 2, 2, 3, 3, 2, 2, 2, 3, 2, 2, 2, 2, 3,

2, 3, 2, 2, 2, 2, 3, 3, 2, 3, 3, 2, 2, 3, 1, 3, 3, 2, 2, 1, 2, 3, 3,

3, 2, 3, 3, 2, 2, 3, 3, 2, 3, 2, 3, 2, 2, 3, 3, 2, 2, 3, 3, 3, 1, 3,

2, 3, 3, 2, 1, 3, 1, 2, 2, 2, 2, 2, 2, 3, 3, 2, 1, 3, 2, 1, 2, 3, 2,

2, 3, 2, 2, 2, 2, 3, 2, 3, 3, 2, 3, 3, 2, 2, 3, 2, 3, 3, 2, 3, 2, 3,

3, 2, 2, 3, 2, 1, 2, 2, 3, 2, 2, 3, 2, 2, 2, 1, 2, 3, 2, 2, 3, 2, 3,

3, 3, 3, 2, 3, 2, 3, 2, 3, 2, 3, 3, 2, 2, 1, 2, 2, 3, 3, 2, 2, 3, 2,

3, 3, 2, 3, 2, 2, 3, 2, 2, 2, 3, 3, 2, 1, 3, 3, 2, 3, 2, 2, 3, 3, 3,

2, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 2, 3, 3, 3, 2, 2, 2, 3, 2, 2, 3,

3, 3, 2, 2, 3, 2, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 3, 2, 3, 2, 2, 2, 2,

3, 2, 2, 3, 2, 3, 2, 2, 1, 1, 2, 2, 3, 3, 2, 3, 2, 2, 2, 2, 2, 3, 2,

3, 3, 2, 3, 2, 2, 1, 3, 3, 3, 2, 2, 2, 3, 2, 1, 2, 2, 2, 2, 3, 3, 3,

2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 1, 2, 3, 2, 2, 2, 2, 2, 3, 3, 3, 3, 2,

3, 3, 3, 2, 3, 2, 2, 3, 2, 3, 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 2, 1, 2,

2, 3, 2, 3, 2, 2, 3, 3, 2, 3, 3, 1, 3, 2, 3, 2, 2, 3, 2, 3, 3, 2, 2,

3, 3, 2, 3, 2, 2, 2, 3, 3, 1, 3, 3, 2, 3, 3, 3, 3, 2, 2, 2, 2, 3, 3,

3, 1, 3, 3, 3, 1, 2, 3, 3, 3, 3, 2, 2, 3, 3, 3, 3, 3, 2, 3, 3, 2, 2,

3, 2, 3, 1, 2, 2, 2, 3, 3, 3, 2, 3, 2, 2, 3, 3, 3, 2, 3, 1, 2, 2, 2,

2, 3, 3, 2, 1, 3, 2, 3, 2, 3, 3, 1, 3, 2, 3, 1, 2, 3, 2, 2, 2, 1, 2,

2, 2, 2, 2, 3, 3, 2, 3, 2, 2, 1, 2, 3, 2, 3]

y_pre.shape

(590,)

Accuracy

from sklearn.cross_validation import cross_val_score

cross_val_score(NCfit, cmc_copy, target, cv=10, scoring='accuracy')

array([ 0.32214765, 0.39864865, 0.34459459, 0.41496599, 0.39455782,

0.38095238, 0.40136054, 0.36054422, 0.29931973, 0.4109589 ])

print cross_val_score(NCfit, cmc_copy, target, cv=10, scoring='accuracy').mean()

output: 0.3728

Confusion Matrix

from sklearn.metrics import confusion_matrix

Ccm = confusion_matrix(y_test, y_pre)

Ccm

array([[ 18, 113, 111],

[ 12, 88, 46],

[ 12, 73, 117]])

Accuracy = 223/ 590 =0.378

Classification Error Rate NCC 270+97 / 590 = 0.622

Classification Report

from sklearn.metrics import classification_report

classification_report(y_test,y_pre)

precision recall f1-score support

1 0.43 0.07 0.13 242

2 0.32 0.60 0.42 146

3 0.43 0.58 0.49 202

avg / total 0.40 0.38 0.32 590

‘No-Use’

Here Precision is high, that is Nearest Centroid Classifier returned substantially more relevant results than irrelevant ones. The classifier is accurate.

‘Long-Term’

Here Recall is higher than precision, that is the Nearest Centroid Classifier returned most of the relevant results.

‘Short-Term’

Here Recall is higher than precision, that is the Nearest Centroid Classifier returned most of the women not using any contraceptive method choices.

Based on F1 Score, NCC has classified more ‘Short-Term’ instances correctly, NCC predicted ‘Short-Term’ Methods is most common among the women’s.

Comparing the KNN, DT and NCC Models

Accuracy

KNN 0.50

DT 0.48

NCC 0.38

KNN has correctly classified, KNN performs better.

Classification Error Rate

KNN 0.4746

DT 0.5220

NCC 0.6220

KNN has low percentage of observations in the test data set that model mislabelled.

So, KNN performs better.

Classification Report

Class Label precision recall f1-score support

KNN1 0.58 0.63 0.61 242

DT1 0.57 0.59 0.58 242

NCC1 0.43 0.07 0.13 242

KNN2 0.47 0.40 0.43 146

DT2 0.38 0.42 0.40 146

NCC2 0.32 0.60 0.42 146

KNN3 0.49 0.50 0.49 202

DT3 0.43 0.39 0.41 202

NCC3 0.43 0.58 0.49 202

‘No-Use’

F1 Score is high in KNN, KNN has classified ‘No-Use’ choice more instances correctly than Decision Tree and NCC.

‘Long-Term’

F1 Score is high in KNN, KNN has classified ‘No-Use’ choice more instances correctly than Decision Tree and NCC

‘Short-Term’

F1 Score is high in KNN and NCC, KNN has classified ‘No-Use’ choice more instances correctly than Decision Tree.

KNN performs better

3 models predict most of the women based on her demographic and socio-economic characteristics, prefer “No-Use” of the Contraceptive Method Choice.

import pandas

import matplotlib.pyplot as plt

from sklearn import cross_validation

from sklearn.neighbors import KNeighborsClassifier

from sklearn.tree import DecisionTreeClassifier

from sklearn.neighbors import NearestCentroid

from sklearn.cross_validation import cross_val_score

array = cmc.values

models = []

models.append(('KNN', KNeighborsClassifier()))

models.append(('DTC', DecisionTreeClassifier()))

models.append(('NCC', NearestCentroid()))

results = []

names = []

for name, model in models:

cv_results = cross_val_score(model, cmc_copy, target, cv=10,scoring='accuracy')

results.append(cv_results)

names.append(name)

msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())

print(msg)

Output:

Mean Accuracy Standard Deviation Accuracy

KNN: 0.519301 (0.040072)

DTC: 0.477132 (0.034676)

NCC: 0.372805 (0.037641)

fig = plt.figure()

fig.suptitle('Model Comparison based on Accuracy')

ax = fig.add_subplot(111)

plt.boxplot(results)

ax.set_xticklabels(names)

plt.show()

Boxplot shows the spread of accuracy scores across each classifier

From these results, it would suggest KNN model is worthy of further study on this problem.

Conclusion

Data modelling, a core step in the data science process. In this assignment we have understood, developed and implemented appropriate steps, in IPython, to complete the corresponding tasks.

Practical experience with the typical 5^th and 6^th steps of the data science process: data modelling, and presentation and automation has been gained.

Google Sites

Report abuse