GOAL : Learn the topics of model evaluation: data visualization, feature scaling, training, prediction, and model evaluation, and learn to use Scikit-Learn to apply on IRIS dataset and k-nearest neighbor.
Learning experience:
Through this homework, I gained hands-on experience in supervised learning, which involves training models on labeled data to make predictions. Utilizing Scikit-Learn, a Python library offering a comprehensive suite of machine learning algorithms, I delved into various stages of the classification task using the Iris dataset. From preprocessing data to visualizing insights, and ultimately employing the k-nearest neighbors algorithm for model training and prediction, I deepened my understanding of machine learning workflows and techniques.
working environment :
OS: Windows 11 home
CPU : intel i9-13900k
GPU : Nvidia RTX 4090
Python Version : 3.12.2
Development environment: jupyter notebook.
4.0 In this chapter, we will learn classification and regression, and ScikitLearn. And we start with a classification application with which we introduce and practice with some important concepts in machine learning such as data splitting, normalization, training a classifier, prediction, and evaluation.
4.1 Supervised Learning
Supervised learning is perhaps the most common type of machine learning in which the goal is to learn a mapping between a vector of input variables (also known as predictors or feature vector, which is a vector of measurable variables in a problem) and output variables. For example, the upper figure [2] show the workflow of the Supervised learning, there are some input raw data with labeled, labeled data is data that has been tagged with a correct answer or classification, a labeled dataset of images of Elephant, Camel and Cow would have each image tagged with either “Elephant” , “Camel”or “Cow”. Next the machine learns the relationship between inputs (animal images) and outputs (animal labels). Then the trained machine can then make predictions on new, unlabeled data.
In classification, possible values for yi(denotes the target value (outcome) associated with xi) belong to a set of predefined finite categories called labels and the goal is to assign a given (realization of random) feature vector (also known as observation or instance) to one of the class labels (in some applications known as multilabel classification, multiple labels are assigned to one instance). In regression, on the other hand, yi represents realizations of a numeric random variable and the goal is to estimate the target value for a given feature vector.
In machine learning, we refer to this estimation problem as prediction; that is, predicting y for a given x.
4.2 Scikit-Learn
Scikit-Learn is a Python package that contains an efficient and uniform API to implement many machine learning (ML) methods. It was initially developed in 2007 by David Cournapeau and was made available to public in 2010. The Scikit-Learn API is neatly designed based on a number of classes. Three fundamental objects in scikit-learn are Estimators, Transformers, and Predictors. These has learned in previously exercise.
4.3 The First Application: Iris Flower Classification
In this application, we would like to train a ML classifier that receives a feature vector containing morphologic measurements (length and width of petals and sepals in centimeters) of Iris flowers and classifies a given feature vector to one of the three Iris flower species: Iris setosa, Iris virginica, or Iris versicolor.The underlying hypothesis behind this application is that an Iris flower can be classified into its species based on its petal and sepal lengths and widths. The figure show in below is the three type of Iris flowers [3]. Here is a Iris flower dataset from kaggle.
Therefore, our feature vectors xi that are part of training data are four dimensional (p = 4), and y can take three values; therefore, we have a multiclass classification (three-class) problem.
Our training data is a well-known dataset in statistics and machine learning, namely, Iris dataset, that was collected by one of the most famous biologiststatistician of 20th century, Sir Ronald Fisher. This dataset is already part of scikitlearn and can be accessed by importing its datasets module as In 2.
sklearn.datasets.load_iris(*, return_X_y=False, as_frame=False), Load and return the iris dataset (classification) [4].
Datasets that are part of scikit-learn are generally stored as “Bunch” objects; that is, an object of class sklearn.utils.Bunch. This is an object that contains the actual data as well as some information about it. All these information are stored in Bunch objects similar to dictionaries (i.e., using keys and values). Similar to dictionaries, we can use key() method to see all keys in a Bunch object like In 3.
In 8 the value of the key DESCR gives some brief information about the dataset. where we see the name of features, classes(can be find in frist 500 characters.), and summary.
In 4-6 in Bunch object we can also access values as bunch.key (or, equivalently, as in dictionaries by bunch['key']). For example, we can access class names as In 4, target label in In 5, and In 6 all measurements (i.e., feature vectors) are stored as values of data key.
We refer to the matrix containing all feature vectors as data matrix (also known as feature matrix). By convention, scikit-learn assumes this matrix has the shape of sample size × feature size; that is, the number of observations (also sometimes referred to as the number of samples) × the number of features. For example, in this data there are 150 Iris flowers and for each there are 4 features; therefore, the shape is 150 × 4: show in In 11.
The corresponding targets (in the same order as the feature vectors stored in data) can be accessed through the target field show in In 12. The three classes in the dataset, namely, setosa, versicolor, and virginica are encoded as integers 0, 1, and 2, respectively. Although there are various encoding schemes to transform categorical variables into their numerical counterparts, this is known as integer (ordinal) encoding.
Here we use bincount function from NumPy to count the number of samples in each class: show in In 13, as seen here, there are 50 observations in each class.
Here we check whether the type of data matrix and target is “array-like” (e.g., numpy array or pandas DataFrame), which is the expected type of input data for scikit-learn estimators: In 20 , is numpy.ndarray type, In 21 we can see the feature names, it is same as the front we defined.
4.4 Test Set for Model Assessment
Sometimes before training a machine learning model, we need to think a bit in advance. Suppose we train a model based on the given training data. A major question to ask is how well the model performs. This is a critical question that shows the predictive capacity of the trained model. In other words, the entire practical utility of the trained model is summarized in its metrics of performance. Here we only have one dataset. Therefore, we have to simulate the effect of having test set. In this regard, we randomly split the given data into a training set (used to train the model) and a test set, which is used for evaluation. Because here the test set is held out from the original data, it is also common to refer to that as holdout set. Next we are going to use train_test_split function from the sklearn.model_selection module, Remember that the test_size argument of the function represents the proportion of data that should be assigned to the test set. The default value of this parameter is 0.25, which is a good rule of thumb if no other specific proportion is desired; and It is a good practice to keep the proportion of classes in both the training and the test sets as in the whole data. This is done by setting the stratify to variable representing the target.
sklearn.model_selection.train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None), Split arrays or matrices into random train and test subsets. [5]
In 23 here, we use stratified random split to divide the given data into 80% training and 20% test. In 24 we set the test_size = 0.5 , so the output split to 75/75. X_train and y_train are the feature matrix and the target values used for training, respectively. The feature matrix and the corresponding targets for evaluation are stored in X_test and y_test, respectively. In 26 count the number of classspecific observations in the training data, This shows that the equal proportion of classes in the given data is kept in both the training and the test sets.
4.5 Data Visualization
For datasets in which the number of variables is not really large, visualization could be a good exploratory analysis—it could reveal possible abnormalities or could provide us with an insight into the hypothesis behind the entire experiment. In this regard, scatter plots can be helpful.
There are various ways to plot pair plots in Python but an easy and appealing way is to use pairplot() function from seaborn library. As this function expects a DataFrame as the input, In 27 first convert the X_train and y_train arrays to dataframes and concatenate them.
In 31 generate the pair plots, the output presents the scatter plot of all pairs of features. The plots on diagonal show the histogram for each feature across all classes. For example, we can observe that class-specific histograms generated based on petal width feature are fairly distinct. Further inspection of these plots suggests that, for example, petal width could potentially be a better feature than sepal width in discriminating classes. This is because the class-specific histograms generated by considering sepal width are more mixed when compared with petal width histograms. However, it is not easy to infer much about higher order feature dependency (i.e., multivariate relationship) from these plots. As a result, although visualization could also be used for selecting discriminating feature subsets, it is generally avoided because we are restricted to low-order dependency among features, while in many problems higher-order feature dependencies lead to an acceptable level of prediction. Next we consider all four features to train a classifier.
4.6 Feature Scaling (Normalization)
This section is familiar with Ex4 4.3 such like Normalization, so In 34 first finds the mean and the standard deviation for each feature in the training set, and In 35 observe that the mean of each feature is 0, and the standard deviations are 1.
It is important to note that the “test set” should not be used in any stage involved in training of our classifier, not even in a preprocessing stage such as normalization. The main reason for this can be explained through the following four points:
The entire worth of the classifier depends on its performance on unseen observations.
Because “unseen” observations, as the name suggests, are not available to us during the training, we use (the available) test set to simulate the effect of unseen observations for evaluating the trained classifier.
As a result of Point 2, in order to have an unbiased evaluation of the classifier using test set, the classifier should classify observations in the test set in precisely the same way it is used to classify unseen future observations.
Unseen observations are not available to us and naturally they can not be used in any training stage such as normalization, feature selection, etc.; therefore, observations in the test set should not be used in any training stage either.
In this regard, a common mistake is to apply some data preprocessing such as normalization before splitting the entire data into training and test sets. But ! It is important to note that here we refer to “normalizing the entire data before splitting” as an illegitimate practice from the standpoint of model evaluation using test data, which is the most common reason to split the data into two sets. However, a similar practice could be legitimate if we view it from the standpoint of:
1) training a classifier in semi-supervised learning fashion. An objective in semi-supervised learning is to use both labeled and unlabeled data to train a classifier. Suppose we use the entire data for normalization and then we divide that into two sets, namely Set A and Set B. Set A (its feature vectors and their labels) is then used to train a classifier and Set B is not used at this stage. This classifier is trained using both labeled data (Set A) and unlabeled data (Set B). This is because for normalization we used both Set A and Set B without using labels, but Set A (and the labels of observation within that set) were used to train the classifier. Although from the standpoint of semi-supervised learning we have used a labeled set and unlabeled set in the process of training the classifier, it is perhaps a “dumb” training procedure because all valuable labels of observations within Set B are discarded, and this set is only used for normalization purposes. Nevertheless, even after training the classifier using such a semi-supervised learning procedure, we still need an independent test set to evaluate the performance of the classifier.
2) model evaluation using a performance estimator with unknown properties. An important property of using test set for model evaluation is that it is an unbiased estimator of model performance. This procedure is also known as test-set estimator of performance. However, there are other performance estimators. For example, another estimator is known as resubstitution estimator (discussed in Chapter 9). It is simply reusing the training data to evaluate the performance of a trained model. Nonetheless, resubstitution has an undesirable property of being usually strongly optimistically biased. That being said, because resubstitution has an undesirable property, we can not refer to that as an illegitimate estimation rule. When we refer to the practice of “normalizing the data before splitting into two sets and then using one. set in training and the other for evaluation” as an illegitimate practice, what is really meant is that following this procedure and then framing that as test-set estimator of performance is illegitimate. If we accept it, however, as another performance metric, one that is less known and is expected to be optimistically biased to some extent, then it is a legitimate performance estimator.
In 38 once the relevant statistics (here, the mean and the standard deviation) are estimated from the training set, they can be used to normalize the test set,
In 39 Observe that the test set does not necessarily have a mean of 0 or standard deviation
of 1,
Here we use StandardScaler to perform the standardization. To be able to use these transformers (or any other estimator), we first need to instantiate the class into an object.
class sklearn.preprocessing.StandardScaler(*, copy=True, with_mean=True, with_std=True), Standardize features by removing the mean and scaling to unit variance.
In 42 To estimate the mean and the standard deviation of each feature from the training set (i.e., from X_train), we call the fit() method of the scaler object.
The scaler object holds any information that the standardization algorithm implemented in StandardScaler class extracts from X_train. The fit() methodreturns the scaler object itself and modifies that in place (i.e., stores the parameters estimated from data). Next, we call the transform() method of the scaler object to transform the training and test sets based on statistics extracted from the training set, and use X_train_scaled and X_test_scaled to refer to the transformed training and test sets, respectively. In 43.
For later use and in order to avoid the above preprocessing steps, the training and testing arrays could be saved using numpy.save() to binary files. For NumPy arrays this is generally more efficient than the usual “pickling” supported by pickle module. For saving multiple arrays, we can use numpy.savez(). We can provide our arrays as keyword arguments to this function. In that case, they are saved in a binary file with arbitrary names that we provide as keywords. If we give them as positional arguments, then they will be stored with names being arr_0, arr_1, etc. Here we specify our arrays as keywords X and y.
Note : No such file or directory: 'data/iris_train_scaled.npz' , solve.
Note : np.savez save as zip . [6]
4.7 Model Training
We are now in the position to train the actual machine learning model. For this purpose, we use the k-nearest neighbors (kNN) classification rule in its standard form. To classify a test point, one can think that the kNN classifier grows a spherical region centered at the test point until it encloses k training samples, and classifies the test point to the majority class among these k training samples. For example, in Fig. 4.2, 5NN assigns “green” to the test observation because within the 5 nearest observations to this test point, three are from the green class. kNN classifier is implemented in the KNeighborsClassifier class in the sklearn.neighbors module. Similar to the way we used the StandardScaler estimator earlier, we first instantiate the KNeighborsClassifier class into an object, In 51.
In 52 The constructors of many estimators take as arguments various hyperparameters, which can affect the estimator performance. In case of kNN, perhaps the most important one is k, which is the number of nearest neighbors of a test point. In KNeighborsClassifier constructor this is determined by n_neighbors parameter. In the above code snippet, we set _neighbors to 3, which implies we are using 3NN. Full specification of KNeighborsClassifier class is found at (Scikit-kNNC, 2023). To train the model, we call the fit() method of the knn bject. Because this is an estimator used for supervised learning, its fit() method expects both the feature matrix and the targets. Similar to the use of fit() method for StandardScaler, the fit() method here returns the modified knn object.
4.8 Prediction Using the Trained Model
As said before, some estimators in scikit-learn are predictors; that is, they can make prediction by implementing the predict() method. KNeighborsClassifier is also a predictor and, therefore, implements predict(). Here we use this method to make prediction on a new data point. Suppose we have the following data point measured in the original scales as in the original training set X_train, In 53. and recall that scikit-learn always assumes two-dimensional NumPy arrays of shape sample size × feature size—this is the reason in the code above the test point x_test is placed in a two-dimensional NumPy array of shape (1, 4). However, before making prediction, we need to scale the test data point using the same statistics used to scale the training set; after all, the knn classifier was trained using the transformed dataset and it classifies data points in the transformed space. This is achieved by In 54.
Now, we can predict the label by In 55.
In 56 we can also give several sample points as the argument to the predict() method. In that case, we receive the assigned label for each of them.
The above sequence of operations, namely, instantiating the class KNN, fitting, and predicting, can be combined as the following one liner pattern In 60 known as method chaining.
4.9 Model Evaluation (Error Estimation)
There are various rules and metrics to assess the performance of a classifier. Here we use the simplest and perhaps the most intuitive one; that is, the proportion of misclassified points in a test set. This is known as the test-set estimator of error rate. In other words, the proportion of misclassified observations in the test set is indeed an estimate of classification error rate denoted ε, which is defined as the probability of misclassification by the trained classifier.
Let us first formalize the definition of error rate for binary classification. Let X and Y represent a random feature vector and a binary random variable representing the class variable, respectively. Because Y is a discrete random variable and X is a continuous feature vector, we can characterize the joint distribution of X and Y (this is known as joint feature-label distribution) as:
where p(x|Y = i) is known as the class-conditional probability density function, which shows the (relative) likelihood of X being close to a specific value x given Y = i, and where E represents an event, which is basically a subset of the sample space to which we can assign probability (in technical terms a Borel-measurable set). Intuitively, P(X,Y) shows the frequency of encountering particular pairs of (X,Y) in practice. Furthermore, in (4.1), P(Y = i) is the prior probability of class i, which quantifies the probability that a randomly drawn sample from the population of entities across all classes belongs to class i.
Given a training set Str , we train a classifier ψ : R^p → {0, 1}, which maps realizations of X to realizations of Y. Let ψ(X) denote the act of classifying (realizations of) X by a specific trained classifier. Because X is random, ψ(X), which is either 0 or 1, is random as well. Let E0 denote all events for which ψ(X) gives label 0. A probabilistic question to ask is what would be the joint probability of all those events and Y = 1? Formally, to answer this question, we need to find,
where (1=) is a direct consequence of (4.1). Similarly, let E1 denote all events for which ψ(X) gives label 1. We can ask a similar probabilistic question; that is, what would be the joint probability of Y = 0 and E1? In this case, we need to find,
At this stage, it is straightforward to see that the probability of misclassification ε is indeed obtained by adding probabilities (4.2) and (4.3); that is,
Nonetheless, in practical settings ε is almost always unknown because of the unknown nature of P(X ∈ Ei ,Y = i). The test-set error estimator is a way to estimate ε using a test set.
Suppose we apply the classifier on a test set Ste that contains m observations with their labels. Let k denote the number of observations in Ste that are misclassified by the classifier. The test-set error estimate, denoted εˆte, is given by
Rather than reporting the error estimate, it is also common to report the accuracy estimate of a classifier. The accuracy, denoted acc, and its test-set estimate, denoted accˆ te, are given by:
For example, a classifier with an error rate of 15% has an accuracy of 85%.
Let us calculate the test-set error estimate of our trained kNN classifier. In this regard, we can compare the actual labels within the test set with the predicted labels and then find the proportion of misclassification, show in In 67. The classifier misclassified three data points(true) out of 30 that were part of the test set; therefore, εˆte = 3/30 = 0.1, show in In 68 , we use the place holder { } and format specifier 0.2f to specify the number of digits after decimal point. The “:” before .2f separates the format specifier from the rest of the replacement field (if any option is set) within the { }.
Using scikit-learn built-in functions from metrics module many performance metrics can be easily calculated. A complete list of these metrics supported by scikitlearn is found at (Scikit- eval, 2023). Here we only show how the accuracy estimate can be obtained in scikit-learn. For this purpose, there are two options: 1) using the accuracy_score function; and 2) using the score method of the classifier.
The accuracy_score function expects the actual labels and predicted labels as arguments: show in In 70. All classifiers in scikit-learn also have a score method that given a test data and its labels, returns the classifier accuracy; for example in In 71
In 76 we can use load_digits to load the digits dataset.
In 77 we can print the keys, DESCR, to see the information of this dataset. Also In 82 we can print images shape and data shape. (1797, 8, 8) indicates that there are 1797 images in total, each with dimensions of 8x8 pixels. (1797, 64) indicates that there are 1797 data points in total, each containing 64 feature values. This is because each 8x8 image is flattened into a one-dimensional array of 64 elements.
In 83 we plot the first 100 images , the output is show in below,
In 124 we can implement the exercise, first we import Libraries, and we split the dataset uses the train_test_split function to split the digits dataset into training (X_train, y_train) and testing (X_test, y_test) sets, with 25% of the data reserved for testing. Next we standardize the features using StandardScaler, and fit_transform stores the result in X_train_scaled, also in test, Then we train the KNN classifier, with k=3, using the standardized training features (X_train_scaled) and training labels (y_train) with the fit method. Finally, we computes the accuracy of the trained classifier on the standardized testing features (X_test_scaled) and testing labels (y_test) using the score method, and prints the result. The result is 0.96667.
And we plot a accuracy vs number of neighbors, we can see that when k become bigger the accuracy is worse. so we can use this plot to find a best k for prediction. Show at In 126 we can get 0.97777 with the accuracy is better than k=3.
[1] R1 - Chapters 4 Supervised Learning in Practice-the First Application Using Scikit-Learn, Machine Learning with Python - Theory and Implementation
[2] Supervised and Unsupervised learning , geeksforgeeks, 2024.
[3] Start Your First Machine Learning Project with the Iris flower classification challenge , Ritwik Dalmia, linkedin , 2022.
[4] sklearn.datasets.load_iris , Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
[5] sklearn.model_selection.train_test_split , Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
[6] [Day18]Numpy檔案輸入與輸出! , plusone , 2018.