Reference: https://datatron.com/what-is-a-support-vector-machine/
Machine learning beginners often start with simple regression and classification algorithms, which are easy to grasp. But as you progress, it's important to explore more advanced techniques that can tackle complex cases. Enter Support Vector Machines or SVMs.
SVMs are a powerful type of algorithm that can handle intricate data patterns and find optimal decision boundaries. While not as basic as regression or classification, SVMs are worth delving into for their unique capabilities. So, let's take a closer look at what SVMs are all about.
Support Vector Machines, sometimes known as SVMs, are supervised machine learning models; that is, they train their algorithms using labelled datasets. SVM can solve both linear and nonlinear problems, and it divides problems into different classes using the concept of margin. In essence, though, it is employed in machine learning to solve classification difficulties. The algorithm's goal is to identify the thinnest line or decision boundary that can divide n-dimensional space into classes so that fresh data points can be placed in the appropriate class in the future. This decision boundary is called a hyperplane. SVMs typically outperform Decision Trees, KNNs, Naive Bayes Classifiers, logistic regressions, etc. in terms of precision.
Moreover, SVMs have been observed to occasionally outperform neural networks. Because of their simpler implementation and higher accuracy with less calculation, SVMs are strongly advised.
SVM primarily fall into one of two categories based on the training sets:
Linear SVM – Data points can be easily separated with a linear line.
Non-Linear SVM – Data points cannot be easily separated with a linear line.
Now, in the next section, let's look forward to the working of SVMs.
Let's quickly comprehend the following terminology before digging deeply into the SVM's operation.
Margin – Margin is the gap between the hyperplane and the support vectors.
Hyperplane – Hyperplanes are decision boundaries that aid in classifying the data points.
Support Vectors – Support Vectors are the data points that are on or nearest to the hyperplane and influence the position of the hyperplane.
Kernel function – These are the functions used to determine the shape of the hyperplane and decision boundary.
Reference: https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47
Linear SVMs are those where SVMs can easily divide the data into two different classes as shown in the examples above. The SVM algorithm is applied and it finds out the best hyperplane that divides both classes.
SVM considers all the data points and generates a line called a "Hyperplane" that separates the two classes. This border is also known as a "decision boundary." Anything that falls inside the circle belongs to class A, and vice versa. There may be more than one hyperplane, but we can determine which one is the best by looking at the margin. Finding such hyperplanes that can accurately identify the data points is the basic goal of SVM.
A Kernel function is invariably used by Support Vector Machines, even if the data is linear or not, but its real robustness is leveraged only when the data is inseparable in its present form.
In the instance of nonlinear data, SVM makes use of the Kernel trick. The intention is to map the non-linearly separable data from a lower dimension into a higher dimensional space to find a hyperplane.
Reference: https://www.scaler.com/topics/machine-learning/non-linear-svm/
For example, the mapping function transforms the 2D nonlinear input space into a 3D output space using kernel functions.
From the example on the left, one can infer that it's not possible to draw a line or hyperplane which cannot be separated by a straight line.
Most real-world data cannot be simply separated by a straight line and thus require some transformations to convert the data to a dimension where it can be separated.
Reference: https://www.scaler.com/topics/machine-learning/non-linear-svm/
But, they can be separated by a circular hyperplane, hence we can introduce a coordinate Z, with the help of X and Y, where Z= X^2 + Y^2. Now after introducing the third dimension, the graph changes and data points are linearly separable and can be separated by a straight-line hyperplane.
This representation is in 3-d with z-axis. In 2-D, the graph looks like this as displayed.
So, let's learn more about what Kernel is.
And why the dot product is so critical to the use of the kernel?
We have seen how data can be divided using higher dimensional transformations so that classification predictions can be made. It appears that we would need to do operations with the higher dimensional vectors in the modified feature space in order to train a support vector classifier and maximize our objective function. There may be several features in the data in real applications, and applying transformations that include numerous polynomial combinations of these features will result in astronomically expensive and unfeasible computational costs.
This issue is resolved by the kernel technique. The "trick" is that kernel methods don't explicitly apply the transformations ϕ(x) and represent the data by these transformed coordinates in the higher dimensional feature space, but rather only through a set of pairwise similarity comparisons between the original data observations x (with the original coordinates in the lower dimensional space).
In kernel methods, the data set X is represented by an n x n kernel matrix of pairwise similarity comparisons where the entries (i, j) are defined by the kernel function: k(xi, xj). This kernel function has a special mathematical property. The kernel function acts as a modified dot product. We have:
Reference: https://towardsdatascience.com/the-kernel-trick-c98cdbcaeb3f
The dot product of the changed vectors in the higher dimensional space is what our kernel function outputs after accepting inputs in the original lower dimensional space. Additionally, there are theorems that, in certain circumstances, assure the existence of such kernel functions.
By keeping in mind that each coordinate in the transformed vector ϕ(x) is only a function of the coordinates in the corresponding lower-dimensional vector x, it can help to grasp how the kernel function is equal to the dot product of the transformed vectors.
The dot product is critical to the use of the kernel in SVMs because it measures the similarity or proximity between pairs of data points in the transformed feature space. The kernel function calculates the dot product between pairs of data points, which allows SVMs to implicitly work in a higher-dimensional space without actually transforming the data explicitly. The dot product is used to measure the similarity or dissimilarity between pairs of data points, which is crucial for finding the optimal hyperplane that separates the classes or predicts the target values.
Watch the below video for a visual representation of the Kernel and how it transforms the vector.
Reference: https://gfycat.com/cluelessdefinitiveblackandtancoonhound-howto-style-mathmatics-polynomial
Reference: https://data-flair.training/blogs/svm-kernel-functions/
Taking a deeper look at Polynomial Kernel and Radial Basis Function (RBF) kernel
Radial Basis Function (RBF) Kernel:
RBF is the default kernel used within the sklearn’s SVM classification algorithm and can be described with the following formula:
Reference: https://towardsdatascience.com/svm-classifier-and-rbf-kernel-how-to-make-better-models-in-python-73bb4914af5b
where gamma can be set manually and has to be >0. The default value for gamma in sklearn’s SVM classification algorithm is:
Reference: https://towardsdatascience.com/svm-classifier-and-rbf-kernel-how-to-make-better-models-in-python-73bb4914af5b
So, given the above setup, we can control individual points' influence on the overall algorithm. The larger gamma is, the closer other points must be to affect the model.
Reference: Pham, Trong-Ton. (2010). MODELE DE GRAPHE ET MODELE DE LANGUE POUR LA RECONNAISSANCE DE SCENES VISUELLES.
Here, The RBF kernel maps the data points into an 3D space using a Gaussian function as explained above.
Polynomial Kernel:
So what are polynomial features? polynomial features are derived features from given features in the data set. For example, if we have a data set with a single feature x and we want to find polynomial features with degree 3, then polynomial features will be x, x², x³. If we have other features each of them would be converted similarly.
As an illustration, the kernel method for the second-degree polynomial is shown below. In a previous image, we showed this transformation in three dimensions. The two components' x1 and x2 functions determine the coordinates of the converted vectors. Therefore, the dot product will only have x1 and x2 as components. Additionally accepting the inputs x1 and x2, the kernel function will return a real integer.
Reference: https://towardsdatascience.com/the-kernel-trick-c98cdbcaeb3f
The kernel function here is the polynomial kernel k(a,b) = (a^T * b)² .
We need a random dataset. So let's create a non-linearly separable dataset using sklearn
Above images represents the distribution of data before transformation.
This is transformed data which is displayed in 2-Dimension.