GOAL : Learn about Support Vector Machines(SVM), and scikit-learn's function LinearSVC, SVC.
Learning experience: In this assignment, I discovered that Support Vector Machine (SVM) is a powerful method. By learning to use the relevant functions in scikit-learn, I was able to gain a deeper understanding and application of SVM models. In this task, I utilized the GridSearchCV function from sklearn to perform hyperparameter tuning, resulting in better accuracy compared to previous attempts. This has deepened my understanding of tuning machine learning models and improving performance, while also enhancing my skills and knowledge level.
working environment :
OS: Windows 11 home
CPU : intel i9-13900k
GPU : Nvidia RTX 4090
Python Version : 3.12.2
Development environment: jupyter notebook.
17.0 Introduction
To understand support vector machines, we must understand hyperplanes. Formally, a hyperplane is an n – 1 subspace in an n-dimensional space. While that sounds complex, it actually is pretty simple. For example, if we wanted to divide a two-dimensional space, we’d use a one-dimensional hyperplane (i.e., a line). If we wanted to divide a three-dimensional space, we’d use a two-dimensional hyperplane (i.e., a flat piece of paper or a bed sheet). A hyperplane is simply a generalization of that concept into n dimensions.
Support vector machines classify data by finding the hyperplane that maximizes the margin between the classes in the training data. In a two-dimensional example with two classes, we can think of a hyperplane as the widest straight “band” (i.e., line with margins) that separates the two classes.
The advantages of support vector machines are:
Effective in high dimensional spaces.
Still effective in cases where number of dimensions is greater than the number of samples.
Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.
The disadvantages of support vector machines include:
If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.
SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below). [2]
In Scikit-learn it provide the module name sklearn.svm includes Support Vector Machine algorithms. such as:
svm.LinearSVC([penalty, loss, dual, tol, C, ...]) : Linear Support Vector Classification.
svm.LinearSVR(*[, epsilon, tol, C, loss, ...]) : Linear Support Vector Regression.
svm.NuSVC(*[, nu, kernel, degree, gamma, ...]) : Nu-Support Vector Classification.
svm.NuSVR(*[, nu, C, kernel, degree, gamma, ...]) : Nu Support Vector Regression.
svm.OneClassSVM(*[, kernel, degree, gamma, ...]) : Unsupervised Outlier Detection.
svm.SVC(*[, C, kernel, degree, gamma, ...]) : C-Support Vector Classification.
svm.SVR(*[, kernel, degree, gamma, coef0, ...]) : Epsilon-Support Vector Regression.
In this chapter will focus on Linear.SVC (17.1) and SVC.
class sklearn.svm.LinearSVC(penalty='l2', loss='squared_hinge', *, dual='warn', tol=0.0001, C=1.0, multi_class='ovr', fit_intercept=True, intercept_scaling=1, class_weight=None, verbose=0, random_state=None, max_iter=1000), is a Linear Support Vector Classification. [3]
There are some important parameters :
penalty{‘l1’, ‘l2’}, default=’l2’ : Specifies the norm used in the penalization. The ‘l2’ penalty is the standard used in SVC. The ‘l1’ leads to coef_ vectors that are sparse.
loss{‘hinge’, ‘squared_hinge’}, default=’squared_hinge’ : Specifies the loss function. ‘hinge’ is the standard SVM loss (used e.g. by the SVC class) while ‘squared_hinge’ is the square of the hinge loss. The combination of penalty='l1' and loss='hinge' is not supported.
Cfloat, default=1.0 : Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive.
multi_class{‘ovr’, ‘crammer_singer’}, default=’ovr’ : Determines the multi-class strategy if y contains more than two classes. "ovr" trains n_classes one-vs-rest classifiers, while "crammer_singer" optimizes a joint objective over all classes. While crammer_singer is interesting from a theoretical perspective as it is consistent, it is seldom used in practice as it rarely leads to better accuracy and is more expensive to compute. If "crammer_singer" is chosen, the options loss, penalty and dual will be ignored.
fit_interceptbool, default=True : Whether or not to fit an intercept. If set to True, the feature vector is extended to include an intercept term: [x_1, ..., x_n, 1], where 1 corresponds to the intercept. If set to False, no intercept will be used in calculations (i.e. data is expected to be already centered).
class_weightdict or ‘balanced’, default=None : Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).
And some commonly used method :
decision_function(X) : Predict confidence scores for samples.
fit(X, y[, sample_weight]) : Fit the model according to the given training data.
predict(X) : Predict class labels for samples in X.
score(X, y[, sample_weight]) : Return the mean accuracy on the given test data and labels. see more.
class sklearn.svm.SVC(*, C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', break_ties=False, random_state=None), C-Support Vector Classification. [4]
There are some important parameters :
Cfloat, default=1.0 : Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty.
kernel{‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’} or callable, default=’rbf’ : Specifies the kernel type to be used in the algorithm. If none is given, ‘rbf’ will be used. If a callable is given it is used to pre-compute the kernel matrix from data matrices; that matrix should be an array of shape (n_samples, n_samples). For an intuitive visualization of different kernel types see Plot classification boundaries with different SVM Kernels.
degreeint, default=3 : Degree of the polynomial kernel function (‘poly’). Must be non-negative. Ignored by all other kernels.
gamma{‘scale’, ‘auto’} or float, default=’scale’ : Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’.
if gamma='scale' (default) is passed then it uses 1 / (n_features * X.var()) as value of gamma,
if ‘auto’, uses 1 / n_features
if float, must be non-negative.
class_weightdict or ‘balanced’, default=None : Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).
And some commonly used method :
decision_function(X) : Evaluate the decision function for the samples in X.
fit(X, y[, sample_weight]) : Fit the SVM model according to the given training data.
predict_proba(X) : Compute probabilities of possible outcomes for samples in X.
score(X, y[, sample_weight]) : Return the mean accuracy on the given test data and labels. see more.
Here we will brief introduction the math about linearSVM. [4]
Use of Dot Product in SVM
Consider a random point X and we want to know whether it lies on the right side of the plane or the left side of the plane (positive or negative).
To find this first we assume this point is a vector (X) and then we make a vector (w) which is perpendicular to the hyperplane. Let’s say the distance of vector w from origin to decision boundary is ‘c’. Now we take the projection of X vector on w.
We already know that projection of any vector or another vector is called dot-product. Hence, we take the dot product of x and w vectors. If the dot product is greater than ‘c’ then we can say that the point lies on the right side. If the dot product is less than ‘c’ then the point is on the left side and if the dot product is equal to ‘c’ then the point lies on the decision boundary.
You must be having this doubt that why did we take this perpendicular vector w to the hyperplane? So what we want is the distance of vector X from the decision boundary and there can be infinite points on the boundary to measure the distance from. So that’s why we come to standard, we simply take perpendicular and use it as a reference and then take projections of all the other data points on this perpendicular vector and then compare the distance.
Margin in Support Vector Machine
We all know the equation of a hyperplane is w.x+b=0 where w is a vector normal to hyperplane and b is an offset.
To classify a point as negative or positive we need to define a decision rule. We can define decision rule as:
If the value of w.x+b>0 then we can say it is a positive point otherwise it is a negative point. Now we need (w,b) such that the margin has a maximum distance. Let’s say this distance is ‘d’.
To calculate ‘d’ we need the equation of L1 and L2. For this, we will take few assumptions that the equation of L1 is w.x+b=1 and for L2 it is w.x+b=-1.
The lines will move as we do changes in (w,b) and this is how this gets optimized. But what is the optimization function? Let’s calculate it.
We know that the aim of SVM is to maximize this margin that means distance (d). But there are few constraints for this distance (d). Let’s look at what these constraints are.
Optimization Function and its Constraints
In order to get our optimization function, there are few constraints to consider. That constraint is that “We’ll calculate the distance (d) in such a way that no positive or negative point can cross the margin line”. Let’s write these constraints mathematically:
Rather than taking 2 constraints forward, we’ll now try to simplify these two constraints into 1. We assume that negative classes have y=-1 and positive classes have y=1.
We can say that for every point to be correctly classified this condition should always be true:
Suppose a green point is correctly classified that means it will follow w.x+b>=1, if we multiply this with y=1 we get this same equation mentioned above. Similarly, if we do this with a red point with y=-1 we will again get this equation. Hence, we can say that we need to maximize (d) such that this constraint holds true.
We will take 2 support vectors, 1 from the negative class and 2nd from the positive class. The distance between these two vectors x1 and x2 will be (x2-x1) vector. What we need is, the shortest distance between these two points which can be found using a trick we used in the dot product. We take a vector ‘w’ perpendicular to the hyperplane and then find the projection of (x2-x1) vector on ‘w’. Note: this perpendicular vector should be a unit vector then only this will work. Why this should be a unit vector? This has been explained in the dot-product section. To make this ‘w’ a unit vector we divide this with the norm of ‘w’.
Finding Projection of a Vector on Another Vector Using Dot Product
We already know how to find the projection of a vector on another vector. We do this by dot-product of both vectors. So let’s see how
Since x2 and x1 are support vectors and they lie on the hyperplane, hence they will follow yi* (2.x+b)=1 so we can write it as:
Putting equations (2) and (3) in equation (1) we get:
Hence the equation which we have to maximize is:
We have now found our optimization function but there is a catch here that we don’t find this type of perfectly linearly separable data in the industry, there is hardly any case we get this type of data and hence we fail to use this condition we proved here. The type of problem which we just studied is called Hard Margin SVM now we shall study soft margin which is similar to this but there are few more interesting tricks we use in Soft Margin SVM.
Soft Margin SVM
In real-life applications, we rarely encounter datasets that are perfectly linearly separable. Instead, we often come across datasets that are either nearly linearly separable or entirely non-linearly separable. Unfortunately, the trick demonstrated above for linearly separable datasets is not applicable in these cases. This is where Support Vector Machines (SVM) come into play. These are a powerful tool in machine learning that can effectively handle both almost linearly separable and non-linearly separable datasets, providing a robust solution to classification problems in diverse real-world scenarios.
To tackle this problem what we do is modify that equation in such a way that it allows few misclassifications that means it allows few points to be wrongly classified.
We know that max[f(x)] can also be written as min[1/f(x)], it is common practice to minimize a cost function for optimization problems; therefore, we can invert the function.
To make a soft margin equation we add 2 more terms to this equation which is zeta and multiply that by a hyperparameter ‘c’
For all the correctly classified points our zeta will be equal to 0 and for all the incorrectly classified points the zeta is simply the distance of that particular point from its correct hyperplane that means if we see the wrongly classified green points the value of zeta will be the distance of these points from L1 hyperplane and for wrongly classified redpoint zeta will be the distance of that point from L2 hyperplane.
So now we can say that our that are SVM Error = Margin Error + Classification Error. The higher the margin, the lower would-be margin error, and vice versa.
Let’s say you take a high value of ‘c’ =1000, this would mean that you don’t want to focus on margin error and just want a model which doesn’t misclassify any data point.
Look at the figure below:
If someone asks you which is a better model, the one where the margin is maximum and has 2 misclassified points or the one where the margin is very less, and all the points are correctly classified?
Well, there’s no correct answer to this question, but rather we can use SVM Error = Margin Error + Classification Error to justify this. If you don’t want any misclassification in the model then you can choose figure 2. That means we’ll increase ‘c’ to decrease Classification Error but if you want that your margin should be maximized then the value of ‘c’ should be minimized. That’s why ‘c’ is a hyperparameter and we find the optimal value of ‘c’ using GridsearchCV and cross-validation.
Kernels in Support Vector Machine
The most interesting feature of SVM is that it can even work with a non-linear dataset and for this, we use “Kernel Trick” which makes it easier to classifies the points. Suppose we have a dataset like this:
Here we see we cannot draw a single line or say hyperplane which can classify the points correctly. So what we do is try converting this lower dimension space to a higher dimension space using some quadratic functions which will allow us to find a decision boundary that clearly divides the data points. These functions which help us do this are called Kernels and which kernel to use is purely determined by hyperparameter tuning.
Different Kernel Functions
Some kernel functions which you can use in SVM are given below:
1. Polynomial Kernel
Following is the formula for the polynomial kernel:
Here d is the degree of the polynomial, which we need to specify manually.
Suppose we have two features X1 and X2 and output variable as Y, so using polynomial kernel we can write it as:
So we basically need to find X12 , X22 and X1.X2, and now we can see that 2 dimensions got converted into 5 dimensions.
2. Sigmoid Kernel
We can use it as the proxy for neural networks. Equation is:
It is just taking your input, mapping them to a value of 0 and 1 so that they can be separated by a simple straight line.
3. RBF Kernel
What it actually does is to create non-linear combinations of our features to lift your samples onto a higher-dimensional feature space where we can use a linear decision boundary to separate your classes It is the most used kernel in SVM classifications, the following formula explains it mathematically:
where,
1. ‘σ’ is the variance and our hyperparameter
2. ||X₁ – X₂|| is the Euclidean Distance between two points X₁ and X₂
4. Bessel function kernel
It is mainly used for eliminating the cross term in mathematical functions. Following is the formula of the Bessel function kernel:
5. Anova Kernel
It performs well on multidimensional regression problems. The formula for this kernel function is:
How to Choose the Right Kernel?
Choosing a kernel totally depends on what kind of dataset are you working on. If it is linearly separable then you must opt. for linear kernel function since it is very easy to use and the complexity is much lower compared to other kernel functions. I’d recommend you start with a hypothesis that your data is linearly separable and choose a linear kernel function.
You can then work your way up towards the more complex kernel functions. Usually, we use SVM with RBF and linear kernel function because other kernels like polynomial kernel are rarely used due to poor efficiency.
How Does Support Vector Machine Algorithm Work?
The best way to understand the SVM algorithm is by focusing on its primary type, the SVM classifier. The idea behind the SVM classifier is to come up with a hyper-lane in an N-dimensional space that divides the data points belonging to different classes. However, this hyper-pane is chosen based on margin as the hyperplane providing the maximum margin between the two classes is considered. These margins are calculated using data points known as Support Vectors. Support Vectors are those data points that are near to the hyper-plane and help in orienting it.
If the functioning of SVM classifier is to be understood mathematically then it can be understood in the following ways-
Step 1: SVM algorithm predicts the classes. One of the classes is identified as 1 while the other is identified as -1.
Step 2: As all machine learning algorithms convert the business problem into a mathematical equation involving unknowns. These unknowns are then found by converting the problem into an optimization problem. As optimization problems always aim at maximizing or minimizing something while looking and tweaking for the unknowns, in the case of the SVM classifier, a loss function known as the hinge loss function is used and tweaked to find the maximum margin.
Step 3: For ease of understanding, this loss function can also be called a cost function whose cost is 0 when no class is incorrectly predicted. However, if this is not the case, then error/loss is calculated. The problem with the current scenario is that there is a trade-off between maximizing margin and the loss generated if the margin is maximized to a very large extent. To bring these concepts in theory, a regularization parameter is added.
Step 4: As is the case with most optimization problems, weights are optimized by calculating the gradients using advanced mathematical concepts of calculus viz. partial derivatives.
Step 5: The gradients are updated only by using the regularization parameter when there is no error in the classification while the loss function is also used when misclassification happens.
Step 6: The gradients are updated only by using the regularization parameter when there is no error in the classification, while the loss function is also used when misclassification happens. [5]
17.1 Training a Linear Classifier
This section will train a model to classify observations.
In 5 Firstly, the code loads necessary modules and functions from the scikit-learn library. Then, it loads the first 100 samples from the Iris dataset, which contain only two features. Subsequently, it standardizes the features using the StandardScaler function to ensure that features have similar scales.
Next, a Linear Support Vector Classifier (LinearSVC) is created with the parameter C set to the default value of 1.0. Then, the fit method of the model is used to train the model using the standardized features and corresponding target values (class labels).
Finally, the code returns a trained SVM model (model), which can be used for subsequent predictions.
scikit-learn’s LinearSVC implements a simple SVC. To get an intuition behind what an SVC is doing, let’s plot out the data and hyperplane. While SVCs work well in high dimensions, in our solution we loaded only two features and took a subset of observations so that the data contains only two classes. This will let us visualize the model.
In 7 aims to plot the data points and the hyperplane defined by the SVM model.
Firstly, it loads the pyplot module from the matplotlib library. Then, it determines the color of each data point based on their class, and plots the standardized data points in a scatter plot in a two-dimensional space.
Next, it retrieves the weight vector w from the trained SVM model, and calculates the slope a and intercept of the hyperplane based on this weight vector. Then, it generates a range of values for the x-axis, and calculates the corresponding y-axis values based on the equation of the hyperplane.
Finally, it uses the plt.plot function to plot the calculated hyperplane on the scatter plot. Calling plt.axis("off") turns off the axis labels, and plt.show() displays the plot.
In this visualization, all observations of class 0 are black and observations of class 1 are light gray. The hyperplane is the decision boundary deciding how new observations are classified. Specifically, any observation above the line will by classified as class 0, while any observation below the line will be classified as class 1.
In 8 We can prove this by creating a new observation in the top-left corner of our visualization, meaning it should be predicted to be class 0.
There are a few things to note about SVCs. First, for the sake of visualization, we limited our example to a binary example (i.e., only two classes); however, SVCs can work well with multiple classes. Second, as our visualization shows, the hyperplane is by definition linear (i.e., not curved). This was okay in this example because the data was linearly separable, meaning there was a hyperplane that could perfectly separate the two classes. Unfortunately, in the real world this is rarely the case.
More typically, we will not be able to perfectly separate classes. In these situations there is a balance between SVC maximizing the margin of the hyperplane and minimizing the misclassification. In SVC, the latter is controlled with the hyperparameter C. C is a parameter of the SVC learner and is the penalty for misclassifying a data point. When C is small, the classifier is okay with misclassified data points (high bias but low variance). When C is large, the classifier is heavily penalized for misclassified data and therefore bends over backward to avoid any misclassified data points (low bias but high variance).
In scikit-learn, C is determined by the parameter C and defaults to C=1.0. We should treat C has a hyperparameter of our learning algorithm, which we tune using model selection techniques.
17.2 Handling Linearly Inseparable Classes Using Kernels
This section will train a support vector classifier, but your classes are linearly inseparable.
In 15 create a Support Vector Machine (SVM) classifier and use it for classifying sample data generated from two features. Here, an XOR gate (which is not necessary to know) is used to generate linearly inseparable classes.
Firstly, necessary modules and functions are imported from the scikit-learn library. Then, a random seed is set to ensure reproducibility of results.
Next, 200 samples are generated, each containing two features. Then, an XOR gate is used to generate target classes, where the class labels are determined based on the signs of feature values, resulting in two linearly inseparable classes.
Then, a Support Vector Machine classifier (SVC) is created with a radial basis function (RBF) kernel, and hyperparameters such as random state, gamma, and C are set.
Finally, the fit method is used to train the model with the features and corresponding target values.
In 16 We can understand the intuition behind kernels by visualizing a simple example. This function, based on one by Sebastian Raschka, plots the observations and decision boundary hyperplane of a two-dimensional space.
In our solution, we have data containing two features (i.e., two dimensions) and a target vector with the class of each observation. Importantly, the classes are assigned such that they are linearly inseparable. That is, there is no straight line we can draw that will divide the two classes.
In 17 First, let’s create a support vector machine classifier with a linear kernel.
In 18 Next, since we have only two features, we are working in a two-dimensional space and can visualize the observations, their classes, and our model’s linear hyperplane:
As we can see, our linear hyperplane did very poorly at dividing the two classes!
In 19 Now, let’s swap out the linear kernel with a radial basis function kernel and use it to train a new model.
In 20 And then visualize the observations and hyperplane.
By using the radial basis function kernel we can create a decision boundary that is able to do a much better job of separating the two classes than the linear kernel. This is the motivation behind using kernels in support vector machines.
In 22 we change the kernel to poly, and the figure show another boundary, and it shows bad!
In scikit-learn, we can select the kernel we want to use by using the kernel parameter. Once we select a kernel, we need to specify the appropriate kernel options, such as the value of d (using the degree parameter) in polynomial kernels, and the value of γ (using the gamma parameter) in radial basis function kernels. We will also need to set the penalty parameter, C. When training the model, in most cases we should treat all of these as hyperparameters and use model selection techniques to identify the combination of their values that produces the model with the best performance.
17.3 Creating Predicted Probabilities
This section will learn when you need to know the predicted class probabilities for an observation.
When using scikit-learn’s SVC, set probability=True, train the model, then use predict_proba to see the calibrated probabilities:
In 23 use a Support Vector Classifier (SVC) to classify the Iris dataset and predict the class probabilities for new observations.
Firstly, necessary modules and functions are imported from the scikit-learn library. Then, the features and target values are loaded from the Iris dataset using the datasets module.
Next, the features are standardized using the StandardScaler function to ensure that features have similar scales.
Then, a Support Vector Classifier (SVC) object is created with the kernel set to "linear", probability set to True to compute class probabilities, and random_state set to 0.
Subsequently, the fit method is used to train the model with the standardized features and target values.
Finally, a new observation new_observation is created, and the predict_proba method is used to predict its class probabilities.
The new_observation will be predict to class 1, cause its probability is 0.97.
In 32 we can visualizes with two features along with the new observation.
In more practical terms, creating predicted probabilities has two major issues. First, because we are training a second model with cross-validation, generating predicted probabilities can significantly increase the time it takes to train our model. Second, because the predicted probabilities are created using cross-validation, they might not always match the predicted classes. That is, an observation might be predicted to be class 1 but have a predicted probability of being class 1 of less than 0.5.
17.4 Identifying Support Vectors
This section will identify which observations are the support vectors of the decision hyperplane.
In 33 Firstly, necessary modules and functions are imported from the scikit-learn library. Then, the features and target values are loaded from the first 100 samples of the Iris dataset, which contain only two classes. Next, the features are standardized using the StandardScaler function to ensure that features have similar scales.
Then, a Support Vector Classifier (SVC) object is created with the kernel set to "linear". The random_state parameter is set to 0 to ensure reproducibility of results.
Subsequently, the fit method is used to train the model with the standardized features and target values.
Finally, the support_vectors_ attribute is used to view the support vectors in the trained model.
The output is a numpy array representing the support vectors used by the trained SVM model. Each row of the array corresponds to a support vector, and each column represents one of the four features.
In 39 we change the kernel to rbf, and get more support vectors, this is because the RBF kernel is more flexible and can capture complex relationships between features, allowing the decision boundary to be more flexible and adapt to the data.
Support vector machines get their name from the fact that the hyperplane is being determined by a relatively small number of observations, called the support vectors. Intuitively, think of the hyperplane as being “carried” by these support vectors. These support vectors are therefore very important to our model. For example, if we remove an observation that is not a support vector from the data, the model does not change; however, if we remove a support vector, the hyperplane will not have the maximum margin.
In 42 we can view the indices of the support vectors using support_, and we can use n_support_ to find the number of support vectors belonging to each class.
17.5 Handling Imbalanced Classes
This section will train a support vector machine classifier in the presence of imbalanced classes.
In 44 perform binary classification on a subset of the Iris dataset using Support Vector Machine (SVM) and to create a highly imbalanced class.
First, necessary modules and functions are imported from the scikit-learn library. Then, features and target values are loaded from the first 100 samples of the Iris dataset, and one class is made highly imbalanced by removing the first 40 observations. The target values are then transformed into a binary classification problem, where the value of class 0 remains 0, and the value of other classes is set to 1. Next, the features are standardized using the StandardScaler function to ensure that features have similar scales.
Then, a Support Vector Classifier (SVC) object is created with a linear kernel, balanced class weights, regularization parameter C set to 1.0, and a random seed set to 0.
Finally, the fit method is used to train the model with the standardized features and target values.
In support vector machines, C is a hyperparameter that determines the penalty for misclassifying an observation. One method for handling imbalanced classes in support vector machines is to weight C by classes, so that: Ck = C x Wj where C is the penalty for misclassification, Wj is a weight inversely proportional to class j’s frequency, and Ck is the C value for class k. The general idea is to increase the penalty for misclassifying minority classes to prevent them from being “overwhelmed” by the majority class.
In scikit-learn, when using SVC we can set the values for Ck automatically by setting class_weight="balanced". The balanced argument automatically weighs classes such that: Wj = n/knj where Wj is the weight to class j, n is the number of observations, nj is the number of observations in class j, and k is the total number of classes.
Here we try to use the toy dataset digits which provide from sklearn and implementation with SVM.
In 15 Firstly, we load the handwritten digits dataset using the load_digits() function from scikit-learn. Then, we split the dataset into training and testing sets using the train_test_split function, with the test set comprising 25% of the total dataset.
Next, we standardize the features using StandardScaler to ensure uniform scale across all features. Subsequently, a Linear Support Vector Classifier (LinearSVC) is instantiated with default parameters (C=1.0).
The classifier is then trained on the standardized training set using the fit method.
Following training, the model predicts the labels for the test set using the predict method. We compute the accuracy of the model using the accuracy_score function.
Finally, we generate a classification report using the classification_report function, which includes metrics such as precision, recall, F1-score, etc., to assess the classification performance of the model. With this we get the accuracy 0.962222
In 41 we use GridSearchCV to find the best C for LinearSVC, we compare C: [0.001, 0.01, 0.1, 1, 10, 100], and the best result is C = 0.001 so we use the best parameter to train the model, and the test accuracy is 0.98. The below table shows the gridsearchCV's result.
In 42 we change the model to SVC and set the kernel to linear to compare with LinearSVC
The below tabel shows the different.
In 46 we use gridsearchCV to find the best kernel, and the result shows that 'rbf' is the best kernel, the below table shows the gridsearchCV results.
In 57 we use gridsearchCV to find the best kernel and C, and the result shows that when C =10.0 and kernel = poly is the best. The below table shows the gridsearchCV's result.
In 73 we use the above result to train the model, and get the accuracy is 0.9866
The confusion matrix shows the result of In 73.
The figure SVM Decision Boundaries (PCA-reduced) shows the boundaries of the model.
The table shows the previous work P3 compare with this time. we can see that the SVC with poly kernel get the best accuracy.
Complete code : github
[1] Chapter 17. Support Vector Machines , Machine Learning with Python - Theory and Implementation
[2] Support Vector Machines , Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
[3] sklearn.svm.LinearSVC , Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
[4] Guide on Support Vector Machine (SVM) Algorithm , Anshul Saini, 23 Jan, 2024, AnalyticsVidhya.
[5] Introduction to SVM – Support Vector Machine Algorithm of Machine Learning , Sumeet Bansal, JULY 7, 2021, analytixlabs.