Implementation sklearn mit Python

🎯 Um was geht es? 

Wir haben unseren KNN Classifier von scratch gebaut. Jetzt wird KNN dermassen häufig verwendet, dass es natürlich auch eine Library gibt, welche einen KNNClassifier() schon implementiert hat. 

Jetzt brauchen wir nur noch ein paar Zeilen Code und haben dasselbe Resultat. 

Folgend findest Du ein Tutorial, um KNN mit Python zu implementieren. Du darfst alleine oder zu zweit arbeiten. 

Tutorial

Dataset 

Wir arbeiten mit dem Iris Dataset , um unseren KNN-Algorithmus zu testen. Du kannst das Dataset hier herunterladen: https://www.kaggle.com/datasets/arshid/iris-flower-dataset 

import pandas as pd

iris = pd.read_csv('IRIS.csv')

Schritt 1 

Features and Labels

Wie letztes Mal, brauchen wir X und y (features and labels). Teile auf, wie letztes Mal!

Lösung 

X = iris[["sepal_length", "sepal_width", "petal_length", "petal_width"]].values

y = iris["species"].value

Unsere Labels sind categorical (also nicht numerisch). Den KNNClassifier(), welchen wir jedoch nutzen werden, akzeptiert nur numerische labels. Deshalb müssen wir diese in Zahlen umwandeln: 0,1,2. 

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

y = le.fit_transform(y)

Train and Test Set

Scale dein X set, sowie teile wieder auf in X_train, X_test, y_train, y_test

Lösung 

from sklearn.model_selection import train_test_split

X = StandardScaler().fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

Schritt 2

Nun wollen wir KNN nutzen für unsere Klassifizierung. 

Dazu müssen wir einige libraries importieren: 

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import confusion_matrix, accuracy_score

from sklearn.model_selection import cross_val_score

Ergänze folgenden Code: 

classifier = KNeighborsClassifier(n_neighbors=)

classifier.fit(, )

y_pred = classifier.predict()

weiter wollen wir die Accuracy bestimmen: 

accuracy = accuracy_score(, )*100

print('Accuracy of our model is equal ' + str(round(accuracy, 2)) + ' %.')

Lösung 

#Instantiate learning model (k = 3)

classifier = KNeighborsClassifier(n_neighbors=3)


#Fitting the model

classifier.fit(X_train, y_train)


#Predicting the Test set results

y_pred = classifier.predict(X_test)


accuracy = accuracy_score(y_test, y_pred)*100

print('Accuracy of our model is equal ' + str(round(accuracy, 2)) + ' %.')

Schritt 3

Wir können nun aber auch testen, was die Wahl der Distanzfunktion für einen Einfluss auf die accuracy hat. 

Welche metric schneidet besser ab, für n_neighbors = 5? Manhattan oder Euclidean? 

classifier = KNeighborsClassifier(n_neighbors= 5, metric = "")

classifier.fit(, )

y_pred = classifier.predict()

Accuracy bestimmen

accuracy = accuracy_score(, )*100

print('Accuracy of our model is equal ' + str(round(accuracy, 2)) + ' %.')

Es gibt auch noch andere Metriken: 

Welche schneidet am Besten ab? 

Zusätzliches 

Vergleiche nun die Accuracies zwischen deiner "von scratch" und mit "sklearn"-Implementation. 

Teste für k zwischen 1 und 50. 

Lösung 

# KNN

import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler, LabelEncoder

from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import confusion_matrix, accuracy_score


import matplotlib.pyplot as plt


#from KNeighborsClassifier import KNeighborsClassifier



def most_common(data_list):

    '''Returns the most common element in a list'''

    return max(set(data_list), key=data_list.count)


def euclidean(point, data):

    '''Euclidean distance between a point  & data'''

    return np.sqrt(np.sum((point - data)**2, axis=1))


def predict(X_test,k):

    neighbors = []

    for x in X_test:

        distances = euclidean(x, X_train)

        y_sorted = [y for _, y in sorted(zip(distances, y_train))]

        neighbors.append(y_sorted[:k])

    return list(map(most_common, neighbors))


def evaluate(X_test, y_test,k):

        y_pred = predict(X_test,k)

        accuracy = sum(y_pred == y_test) / len(y_test)

        return accuracy



iris = pd.read_csv('IRIS.csv')

X = iris[["sepal_length", "sepal_width", "petal_length", "petal_width"]].values

y = iris["species"].values


le = LabelEncoder()

y = le.fit_transform(y)


X = StandardScaler().fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)



# creating list of K for KNN

k_list = list(range(1,50,2))

# creating list of cv scores

accuracies_sklearn = []

accuracies = []


for k in k_list:

    accuracy = evaluate(X_test, y_test,k)

    accuracies.append(accuracy)

    

    knn = KNeighborsClassifier(n_neighbors=k)

    knn.fit(X_train, y_train)

    y_pred = knn.predict(X_test)

    acc = accuracy_score(y_test, y_pred)

    accuracies_sklearn.append(acc)


best_k = k_list[accuracies.index(max(accuracies))]

best_k_sklearn = k_list[accuracies_sklearn.index(max(accuracies_sklearn))]

print("The optimal number of neighbors with from scratch implementation is %d." % best_k)

print("The optimal number of neighbors with sklearn implementation is %d." % best_k_sklearn)

print(accuracies, accuracies_sklearn) 

fig, ax = plt.subplots()

ax.plot(k_list, accuracies_sklearn, color = "blue")

ax.plot(k_list, accuracies, color = "orange")


ax.set(xlabel="k",

       ylabel="Accuracy",

       title="Performance of knn")

plt.show()

Die Performance sollte fast dieselbe sein (bis auf einige Nachkommastellen… 😌). 

Ganzer Code 

import pandas as pd

from sklearn.preprocessing import LabelEncoder, StandardScaler

from sklearn.model_selection import train_test_split


from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import confusion_matrix, accuracy_score

from sklearn.model_selection import cross_val_score


import matplotlib.pyplot as plt




iris = pd.read_csv('IRIS.csv')

X = iris[["sepal_length", "sepal_width", "petal_length", "petal_width"]].values

y = iris["species"].values


le = LabelEncoder()

y = le.fit_transform(y)


X = StandardScaler().fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


# Instantiate learning model (k = 3)

classifier = KNeighborsClassifier(n_neighbors=3)


# Fitting the model

classifier.fit(X_train, y_train)


# Predicting the Test set results

y_pred = classifier.predict(X_test)


accuracy = accuracy_score(y_test, y_pred)*100

print('Accuracy of our model, for k = 3, is equal ' + str(round(accuracy, 2)) + ' %.')


cm = confusion_matrix(y_test, y_pred)

print(cm)


## Cross Validation


# creating list of K for KNN

k_list = list(range(1,50,2))

# creating list of cv scores

accuracies = []


# perform 10-fold cross validation

for k in k_list:

    knn = KNeighborsClassifier(n_neighbors=k)

    scores = cross_val_score(knn, X_train, y_train, cv=10, scoring='accuracy')

    accuracies.append(scores.mean())


best_k = k_list[accuracies.index(max(accuracies))]

print("The optimal number of neighbors is %d." % best_k)


fig, ax = plt.subplots()

ax.plot(k_list, accuracies)

ax.set(xlabel="k",

       ylabel="Accuracy",

       title="Performance of knn")

plt.show()