Implementation sklearn mit Python
🎯 Um was geht es?
Wir haben unseren KNN Classifier von scratch gebaut. Jetzt wird KNN dermassen häufig verwendet, dass es natürlich auch eine Library gibt, welche einen KNNClassifier() schon implementiert hat.
Jetzt brauchen wir nur noch ein paar Zeilen Code und haben dasselbe Resultat.
Folgend findest Du ein Tutorial, um KNN mit Python zu implementieren. Du darfst alleine oder zu zweit arbeiten.
Tutorial
Dataset
Wir arbeiten mit dem Iris Dataset , um unseren KNN-Algorithmus zu testen. Du kannst das Dataset hier herunterladen: https://www.kaggle.com/datasets/arshid/iris-flower-dataset
import pandas as pd
iris = pd.read_csv('IRIS.csv')
Schritt 1
Features and LabelsWie letztes Mal, brauchen wir X und y (features and labels). Teile auf, wie letztes Mal!
X = iris[["sepal_length", "sepal_width", "petal_length", "petal_width"]].values
y = iris["species"].value
Unsere Labels sind categorical (also nicht numerisch). Den KNNClassifier(), welchen wir jedoch nutzen werden, akzeptiert nur numerische labels. Deshalb müssen wir diese in Zahlen umwandeln: 0,1,2.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
Scale dein X set, sowie teile wieder auf in X_train, X_test, y_train, y_test
from sklearn.model_selection import train_test_split
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
Schritt 2
Nun wollen wir KNN nutzen für unsere Klassifizierung.
Dazu müssen wir einige libraries importieren:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import cross_val_score
Ergänze folgenden Code:
classifier = KNeighborsClassifier(n_neighbors=❓)
classifier.fit(❓, ❓)
y_pred = classifier.predict(❓)
weiter wollen wir die Accuracy bestimmen:
accuracy = accuracy_score(❓, ❓)*100
print('Accuracy of our model is equal ' + str(round(accuracy, 2)) + ' %.')
#Instantiate learning model (k = 3)
classifier = KNeighborsClassifier(n_neighbors=3)
#Fitting the model
classifier.fit(X_train, y_train)
#Predicting the Test set results
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)*100
print('Accuracy of our model is equal ' + str(round(accuracy, 2)) + ' %.')
Schritt 3
Wir können nun aber auch testen, was die Wahl der Distanzfunktion für einen Einfluss auf die accuracy hat.
Welche metric schneidet besser ab, für n_neighbors = 5? Manhattan oder Euclidean?
classifier = KNeighborsClassifier(n_neighbors= 5, metric = "❓")
classifier.fit(❓, ❓)
y_pred = classifier.predict(❓)
Accuracy bestimmen
accuracy = accuracy_score(❓, ❓)*100
print('Accuracy of our model is equal ' + str(round(accuracy, 2)) + ' %.')
Es gibt auch noch andere Metriken:
minkowski
chebyshev
Welche schneidet am Besten ab?
Zusätzliches
Vergleiche nun die Accuracies zwischen deiner "von scratch" und mit "sklearn"-Implementation.
Teste für k zwischen 1 und 50.
# KNN
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
import matplotlib.pyplot as plt
#from KNeighborsClassifier import KNeighborsClassifier
def most_common(data_list):
'''Returns the most common element in a list'''
return max(set(data_list), key=data_list.count)
def euclidean(point, data):
'''Euclidean distance between a point & data'''
return np.sqrt(np.sum((point - data)**2, axis=1))
def predict(X_test,k):
neighbors = []
for x in X_test:
distances = euclidean(x, X_train)
y_sorted = [y for _, y in sorted(zip(distances, y_train))]
neighbors.append(y_sorted[:k])
return list(map(most_common, neighbors))
def evaluate(X_test, y_test,k):
y_pred = predict(X_test,k)
accuracy = sum(y_pred == y_test) / len(y_test)
return accuracy
iris = pd.read_csv('IRIS.csv')
X = iris[["sepal_length", "sepal_width", "petal_length", "petal_width"]].values
y = iris["species"].values
le = LabelEncoder()
y = le.fit_transform(y)
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# creating list of K for KNN
k_list = list(range(1,50,2))
# creating list of cv scores
accuracies_sklearn = []
accuracies = []
for k in k_list:
accuracy = evaluate(X_test, y_test,k)
accuracies.append(accuracy)
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
acc = accuracy_score(y_test, y_pred)
accuracies_sklearn.append(acc)
best_k = k_list[accuracies.index(max(accuracies))]
best_k_sklearn = k_list[accuracies_sklearn.index(max(accuracies_sklearn))]
print("The optimal number of neighbors with from scratch implementation is %d." % best_k)
print("The optimal number of neighbors with sklearn implementation is %d." % best_k_sklearn)
print(accuracies, accuracies_sklearn)
fig, ax = plt.subplots()
ax.plot(k_list, accuracies_sklearn, color = "blue")
ax.plot(k_list, accuracies, color = "orange")
ax.set(xlabel="k",
ylabel="Accuracy",
title="Performance of knn")
plt.show()
Die Performance sollte fast dieselbe sein (bis auf einige Nachkommastellen… 😌).
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
iris = pd.read_csv('IRIS.csv')
X = iris[["sepal_length", "sepal_width", "petal_length", "petal_width"]].values
y = iris["species"].values
le = LabelEncoder()
y = le.fit_transform(y)
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Instantiate learning model (k = 3)
classifier = KNeighborsClassifier(n_neighbors=3)
# Fitting the model
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)*100
print('Accuracy of our model, for k = 3, is equal ' + str(round(accuracy, 2)) + ' %.')
cm = confusion_matrix(y_test, y_pred)
print(cm)
## Cross Validation
# creating list of K for KNN
k_list = list(range(1,50,2))
# creating list of cv scores
accuracies = []
# perform 10-fold cross validation
for k in k_list:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X_train, y_train, cv=10, scoring='accuracy')
accuracies.append(scores.mean())
best_k = k_list[accuracies.index(max(accuracies))]
print("The optimal number of neighbors is %d." % best_k)
fig, ax = plt.subplots()
ax.plot(k_list, accuracies)
ax.set(xlabel="k",
ylabel="Accuracy",
title="Performance of knn")
plt.show()