2. Measuring Model Performance

In classification, accuracy is a commonly used metric
Accuracy is equal to the fraction of correct predictions and how well the model perform on new data
Split data into training and testing set:
- Fit/train the classifier on the training set
- Make predictions on the testing set
- Compare predictions with the known labels
By default, train test split splits the data into 75% training data and 25% testing data

from sklearn import datasets

import matplotlib.pyplot as plt

digits = datasets.load_digits() # Load the digits dataset

print(digits.keys()) # Print the keys of the dataset

print(digits['DESCR']) # Print the DESCR of the dataset

print(digits.images.shape) # Print the shape of the images

print(digits.data.shape) # Print the shape of the data keys

plt.imshow(digits.images[1010], cmap=plt.cm.gray_r, interpolation='nearest') # Display digit 1010

plt.show()

from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split

X = digits.data # Create feature arrays

y = digits.target # Create target arrays

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y) # Split into training and test set

knn = KNeighborsClassifier(n_neighbors=7) # Create a k-NN classifier with 7 neighbors

knn.fit(X_train, y_train) # Fit the classifier to the training data

print(knn.score(X_test, y_test)) # Print the accuracy

Larger k equals to smoother decision boundary and less complex model
Smaller k equals to more complex model and can lead to overfitting
- Complex model run the risk of being sensitive to noise in the data, rather than reflecting trends in the data

neighbors = np.arange(1, 9) # Setup arrays to store train and test accuracies

train_accuracy = np.empty(len(neighbors))

test_accuracy = np.empty(len(neighbors))

for i, k in enumerate(neighbors): # Loop over different values of k

knn = KNeighborsClassifier(n_neighbors=k) # Setup a k-NN Classifier with k neighbors

knn.fit(X_train, y_train) # Fit the classifier to the training data

train_accuracy[i] = knn.score(X_train, y_train) #Compute accuracy on the training set

test_accuracy[i] = knn.score(X_test, y_test) #Compute accuracy on the testing set

# Generate plot

plt.title('k-NN: Varying Number of Neighbors')

plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')

plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')

plt.legend()

plt.xlabel('Number of Neighbors')

plt.ylabel('Accuracy')

plt.show()

Google Sites

Report abuse